Loading Now

Summary of Sorry-bench: Systematically Evaluating Large Language Model Safety Refusal, by Tinghao Xie et al.


SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

by Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal

First submitted to arxiv on: 20 Jun 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed SORRY-Bench benchmark aims to evaluate the ability of large language models (LLMs) to recognize and reject unsafe user requests. The existing evaluation methods face three limitations: coarse-grained taxonomies, ignoring linguistic characteristics, and relying on computationally expensive LLMs. To address these issues, SORRY-Bench uses a fine-grained taxonomy of 44 potentially unsafe topics and 440 class-balanced unsafe instructions compiled through human-in-the-loop methods. Additionally, it incorporates 20 diverse linguistic augmentations to examine the effects of different languages and formatting on prompts. The paper also investigates design choices for creating a fast and accurate automated safety evaluator using fine-tuned 7B LLMs that achieve comparable accuracy to GPT-4 scale LLMs with lower computational cost. The authors evaluate over 50 proprietary and open-weight LLMs on SORRY-Bench, analyzing their distinctive safety refusal behaviors.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large language models (LLMs) are designed to understand human language and generate responses. However, some users might request unsafe or harmful information. To ensure safe and policy-compliant deployments of LLMs, it is crucial to evaluate their ability to recognize and reject these requests. The authors propose a new benchmark called SORRY-Bench that addresses three limitations in existing evaluation methods: coarse-grained taxonomies, ignoring linguistic characteristics, and relying on computationally expensive LLMs. By using a fine-grained taxonomy of 44 potentially unsafe topics and 440 class-balanced unsafe instructions compiled through human-in-the-loop methods, the authors show that it is possible to evaluate LLMs more accurately. They also investigate design choices for creating a fast and accurate automated safety evaluator.

Keywords

» Artificial intelligence  » Gpt