Summary of Refuse Whenever You Feel Unsafe: Improving Safety in Llms Via Decoupled Refusal Training, by Youliang Yuan et al.
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
by Youliang Yuan, Wenxiang Jiao, Wenxuan Wang, Jen-tse Huang, Jiahao Xu, Tian Liang, Pinjia He, Zhaopeng Tu
First submitted to arxiv on: 12 Jul 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) by tackling refusal position bias within safety tuning data. The novel approach, Decoupled Refusal Training (DeRTa), empowers LLMs to refuse compliance to harmful prompts at any response position, enhancing their safety capabilities. DeRTa incorporates Maximum Likelihood Estimation with Harmful Response Prefix and Reinforced Transition Optimization. Experimental results using LLaMA3 and Mistral model families across six attack scenarios demonstrate improved model safety without compromising performance, surpassing GPT-4 in defending against attacks, including advanced methods like CodeAttack that have jailbroken GPT-4 and LLaMA3-70B-Instruct. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This study helps make language models safer. Right now, these models can be tricked into generating harmful content. The researchers created a new way to train the models so they can refuse to generate this kind of content at any point. They tested their method using different language models and found that it works well without sacrificing performance. This is important because language models are becoming more powerful and could potentially harm people if they’re not trained properly. |
Keywords
» Artificial intelligence » Gpt » Likelihood » Optimization