Summary of What Is in Your Safe Data? Identifying Benign Data That Breaks Safety, by Luxi He et al.
What is in Your Safe Data? Identifying Benign Data that Breaks Safety
by Luxi He, Mengzhou Xia, Peter Henderson
First submitted to arxiv on: 1 Apr 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Cryptography and Security (cs.CR)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper investigates the phenomenon of “jailbreaking” in Large Language Models (LLMs), even those designed for safety and alignment. It is found that further fine-tuning an aligned model with benign data can surprisingly lead to substantial degradation in safety. The authors delve into the data-centric aspects of why this occurs, representing fine-tuning data through two lenses: representation and gradient spaces. They propose a bi-directional anchoring method that prioritizes data points close to harmful examples and far from benign ones. This approach identifies subsets of benign data likely to degrade model safety after fine-tuning. The authors demonstrate the effectiveness of their method by training on 100 selected datapoints, which surprisingly leads to the fine-tuned model affirmatively responding to over 70% of tested harmful requests. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper looks at how to prevent Large Language Models from being “hacked” or becoming bad after being trained on certain data. They found that even if a model is designed to be good, training it on some types of harmless data can make it worse! The authors are trying to figure out why this happens and how to stop it. They’re looking at the data in a special way to see what makes some data more likely to cause trouble. They came up with a new method that helps identify the bad data, which they tested by training on just 100 examples. To their surprise, this made the model much worse! |
Keywords
* Artificial intelligence * Alignment * Fine tuning