Summary of Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in Llms, by Abhay Sheshadri et al.

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

by Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper

First submitted to arxiv on: 22 Jul 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary Recent work on red-teaming, model editing, and interpretability has highlighted the limitations of fine-tuning large language models (LLMs) to remove undesirable capabilities. Prior approaches, such as latent adversarial training (LAT), have focused on untargeted attacks that maximize loss on desirable behavior. This paper introduces targeted LAT, which seeks to minimize loss on a specific competing task. The authors experiment with targeted LAT to improve robustness to jailbreaks, outperforming strong baselines with orders of magnitude less compute. They also demonstrate the effectiveness of targeted LAT in removing backdoors without knowledge of the trigger and unlearning knowledge for specific undesirable tasks. Overall, the results suggest that targeted LAT can be an effective tool for defending against harmful behaviors from LLMs.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Imagine you’re trying to teach a super smart computer not to do bad things. But sometimes these computers can still find ways to behave badly even when we try to stop them. This paper explores new ways to make sure these computers don’t get out of control. They create an “adversarial” training method that helps the computer learn to be good by trying to make it do bad things and then correcting those mistakes. The authors test this approach and find that it can help the computer resist attempts to make it behave badly, even when someone tries to trick it into doing something bad. This is an important step in making sure these computers are used responsibly.

Keywords

* Artificial intelligence * Fine tuning

Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in LLMs

by Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, Stephen Casper

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Inverted Activations: Reducing Memory Footprint in Neural Network Training, by Georgii Novikov et al.

Summary of A New Theoretical Perspective on Data Heterogeneity in Federated Optimization, by Jiayi Wang et al.

Related Posts