Summary of Sorry-bench: Systematically Evaluating Large Language Model Safety Refusal, by Tinghao Xie et al.
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
by Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal
First submitted to arxiv on: 20 Jun 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed SORRY-Bench benchmark aims to evaluate the ability of large language models (LLMs) to recognize and reject unsafe user requests. The existing evaluation methods face three limitations: coarse-grained taxonomies, ignoring linguistic characteristics, and relying on computationally expensive LLMs. To address these issues, SORRY-Bench uses a fine-grained taxonomy of 44 potentially unsafe topics and 440 class-balanced unsafe instructions compiled through human-in-the-loop methods. Additionally, it incorporates 20 diverse linguistic augmentations to examine the effects of different languages and formatting on prompts. The paper also investigates design choices for creating a fast and accurate automated safety evaluator using fine-tuned 7B LLMs that achieve comparable accuracy to GPT-4 scale LLMs with lower computational cost. The authors evaluate over 50 proprietary and open-weight LLMs on SORRY-Bench, analyzing their distinctive safety refusal behaviors. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large language models (LLMs) are designed to understand human language and generate responses. However, some users might request unsafe or harmful information. To ensure safe and policy-compliant deployments of LLMs, it is crucial to evaluate their ability to recognize and reject these requests. The authors propose a new benchmark called SORRY-Bench that addresses three limitations in existing evaluation methods: coarse-grained taxonomies, ignoring linguistic characteristics, and relying on computationally expensive LLMs. By using a fine-grained taxonomy of 44 potentially unsafe topics and 440 class-balanced unsafe instructions compiled through human-in-the-loop methods, the authors show that it is possible to evaluate LLMs more accurately. They also investigate design choices for creating a fast and accurate automated safety evaluator. |
Keywords
» Artificial intelligence » Gpt