Summary of Navigating the Overkill in Large Language Models, by Chenyu Shi et al.
Navigating the OverKill in Large Language Models
by Chenyu Shi, Xiao Wang, Qiming Ge, Songyang Gao, Xianjun Yang, Tao Gui, Qi Zhang, Xuanjing Huang, Xun Zhao, Dahua Lin
First submitted to arxiv on: 31 Jan 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper explores the phenomenon of large language models becoming overly cautious and refusing to answer benign queries, a potential overkill issue. Researchers investigate the factors contributing to this problem by examining how models handle and determine the safety of queries. The study reveals that shortcuts within models lead to an over-attention on harmful words like ‘kill’ and prompts emphasizing safety can exacerbate overkill. To alleviate this issue, the authors introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy. This approach first extracts over-attention by amplifying differences in model output distributions when responding to system prompts with or without emphasis on safety. Then, it downplays the over-attention via contrastive decoding to determine final next-token predictions. The empirical results show that Self-CD achieves an average reduction of the refusal rate by 20% while having almost no impact on safety. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large language models are super smart and can help us with lots of things! But sometimes, they get too worried about saying something wrong and won’t answer simple questions. This paper figures out why this happens and comes up with a solution to make them less afraid. The main idea is that these models have shortcuts inside them that make them pay too much attention to certain words. This makes them more likely to say no to helpful answers. To fix this, the researchers create a new way to train the models, called Self-Contrastive Decoding (Self-CD). This method helps the models be less scared and answer questions correctly. |
Keywords
* Artificial intelligence * Attention * Token