Summary of Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-source Llms Through the Embedding Space, by Leo Schwinn and David Dobre and Sophie Xhonneux and Gauthier Gidel and Stephan Gunnemann
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space
by Leo Schwinn, David Dobre, Sophie Xhonneux, Gauthier Gidel, Stephan Gunnemann
First submitted to arxiv on: 14 Feb 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes a novel attack method, called the “embedding space attack,” to compromise the robustness of open-source large language models (LLMs). The authors demonstrate that this type of attack can be more effective than traditional discrete attacks or model fine-tuning. They also show that these embedding space attacks can extract supposedly deleted information from unlearned LLMs across multiple datasets and models, highlighting the importance of ensuring the safety of open-source LLMs. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about making sure big language models are safe and won’t be used to do bad things. Right now, most research focuses on small changes in words or sentences, but this doesn’t account for how open-source models get better over time. The authors introduce a new way of attacking these models by messing with the underlying codes that represent words. This makes it easier to make the model behave badly and even pull out old information that was supposed to be deleted. The results show that we need to pay more attention to making sure our open-source language models are safe. |
Keywords
* Artificial intelligence * Attention * Embedding space * Fine tuning