Summary of In-context Learning Can Re-learn Forbidden Tasks, by Sophie Xhonneux et al.

In-Context Learning Can Re-learn Forbidden Tasks

by Sophie Xhonneux, David Dobre, Jian Tang, Gauthier Gidel, Dhanya Sridhar

First submitted to arxiv on: 8 Feb 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary In this paper, researchers investigate the effectiveness of safety training for large language models (LLMs) by analyzing forbidden tasks, which are tasks designed to be refused by the model. The authors focus on in-context learning (ICL) and its potential to re-learn these forbidden tasks despite explicit fine-tuning to refuse them. They demonstrate the problem using a toy example of refusing sentiment classification, then apply ICL to a model fine-tuned to refuse summarizing made-up news articles. The study shows that ICL can undo safety training on some models, highlighting security risks. To mitigate these risks, the authors propose an ICL attack using chat template tokens as prompts.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Large language models are used in many applications, but they still have vulnerabilities despite receiving safety training. Researchers studied how well this training works by looking at tasks that the model is supposed to refuse. They found that some models can learn to do these tasks again even after being trained not to. This could be a problem if it’s possible for someone to make the model do something bad. The authors also proposed an attack method that uses prompts to make the model do what it was originally told not to.

Keywords

* Artificial intelligence * Classification * Fine tuning

In-Context Learning Can Re-learn Forbidden Tasks

by Sophie Xhonneux, David Dobre, Jian Tang, Gauthier Gidel, Dhanya Sridhar

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Attnlrp: Attention-aware Layer-wise Relevance Propagation For Transformers, by Reduan Achtibat et al.

Summary of Latent Variable Model For High-dimensional Point Process with Structured Missingness, by Maksim Sinelnikov et al.

Related Posts