Loading Now

Summary of In-context Learning Can Re-learn Forbidden Tasks, by Sophie Xhonneux et al.


In-Context Learning Can Re-learn Forbidden Tasks

by Sophie Xhonneux, David Dobre, Jian Tang, Gauthier Gidel, Dhanya Sridhar

First submitted to arxiv on: 8 Feb 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Cryptography and Security (cs.CR)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
In this paper, researchers investigate the effectiveness of safety training for large language models (LLMs) by analyzing forbidden tasks, which are tasks designed to be refused by the model. The authors focus on in-context learning (ICL) and its potential to re-learn these forbidden tasks despite explicit fine-tuning to refuse them. They demonstrate the problem using a toy example of refusing sentiment classification, then apply ICL to a model fine-tuned to refuse summarizing made-up news articles. The study shows that ICL can undo safety training on some models, highlighting security risks. To mitigate these risks, the authors propose an ICL attack using chat template tokens as prompts.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large language models are used in many applications, but they still have vulnerabilities despite receiving safety training. Researchers studied how well this training works by looking at tasks that the model is supposed to refuse. They found that some models can learn to do these tasks again even after being trained not to. This could be a problem if it’s possible for someone to make the model do something bad. The authors also proposed an attack method that uses prompts to make the model do what it was originally told not to.

Keywords

* Artificial intelligence  * Classification  * Fine tuning