Summary of What Is in Your Safe Data? Identifying Benign Data That Breaks Safety, by Luxi He et al.

What is in Your Safe Data? Identifying Benign Data that Breaks Safety

by Luxi He, Mengzhou Xia, Peter Henderson

First submitted to arxiv on: 1 Apr 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper investigates the phenomenon of “jailbreaking” in Large Language Models (LLMs), even those designed for safety and alignment. It is found that further fine-tuning an aligned model with benign data can surprisingly lead to substantial degradation in safety. The authors delve into the data-centric aspects of why this occurs, representing fine-tuning data through two lenses: representation and gradient spaces. They propose a bi-directional anchoring method that prioritizes data points close to harmful examples and far from benign ones. This approach identifies subsets of benign data likely to degrade model safety after fine-tuning. The authors demonstrate the effectiveness of their method by training on 100 selected datapoints, which surprisingly leads to the fine-tuned model affirmatively responding to over 70% of tested harmful requests.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper looks at how to prevent Large Language Models from being “hacked” or becoming bad after being trained on certain data. They found that even if a model is designed to be good, training it on some types of harmless data can make it worse! The authors are trying to figure out why this happens and how to stop it. They’re looking at the data in a special way to see what makes some data more likely to cause trouble. They came up with a new method that helps identify the bad data, which they tested by training on just 100 examples. To their surprise, this made the model much worse!

Keywords

* Artificial intelligence * Alignment * Fine tuning

What is in Your Safe Data? Identifying Benign Data that Breaks Safety

by Luxi He, Mengzhou Xia, Peter Henderson

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Energy-based Model For Accurate Shapley Value Estimation in Interpretable Deep Learning Predictive Modeling, by Cheng Lu et al.

Summary of Enhanced Precision in Rainfall Forecasting For Mumbai: Utilizing Physics Informed Convlstm2d Models For Finer Spatial and Temporal Resolution, by Ajay Devda et al.

Related Posts