Summary of Sorry-bench: Systematically Evaluating Large Language Model Safety Refusal, by Tinghao Xie et al.

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

by Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal

First submitted to arxiv on: 20 Jun 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The proposed SORRY-Bench benchmark aims to evaluate the ability of large language models (LLMs) to recognize and reject unsafe user requests. The existing evaluation methods face three limitations: coarse-grained taxonomies, ignoring linguistic characteristics, and relying on computationally expensive LLMs. To address these issues, SORRY-Bench uses a fine-grained taxonomy of 44 potentially unsafe topics and 440 class-balanced unsafe instructions compiled through human-in-the-loop methods. Additionally, it incorporates 20 diverse linguistic augmentations to examine the effects of different languages and formatting on prompts. The paper also investigates design choices for creating a fast and accurate automated safety evaluator using fine-tuned 7B LLMs that achieve comparable accuracy to GPT-4 scale LLMs with lower computational cost. The authors evaluate over 50 proprietary and open-weight LLMs on SORRY-Bench, analyzing their distinctive safety refusal behaviors.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Large language models (LLMs) are designed to understand human language and generate responses. However, some users might request unsafe or harmful information. To ensure safe and policy-compliant deployments of LLMs, it is crucial to evaluate their ability to recognize and reject these requests. The authors propose a new benchmark called SORRY-Bench that addresses three limitations in existing evaluation methods: coarse-grained taxonomies, ignoring linguistic characteristics, and relying on computationally expensive LLMs. By using a fine-grained taxonomy of 44 potentially unsafe topics and 440 class-balanced unsafe instructions compiled through human-in-the-loop methods, the authors show that it is possible to evaluate LLMs more accurately. They also investigate design choices for creating a fast and accurate automated safety evaluator.

Keywords

» Artificial intelligence » Gpt

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal

by Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Solving a Stackelberg Game on Transportation Networks in a Dynamic Crime Scenario: a Mixed Approach on Multi-layer Networks, by Sukanya Samanta et al.

Summary of Automated Architectural Space Layout Planning Using a Physics-inspired Generative Design Framework, by Zhipeng Li et al.

Related Posts