Summary of Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-source Llms Through the Embedding Space, by Leo Schwinn and David Dobre and Sophie Xhonneux and Gauthier Gidel and Stephan Gunnemann

Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

by Leo Schwinn, David Dobre, Sophie Xhonneux, Gauthier Gidel, Stephan Gunnemann

First submitted to arxiv on: 14 Feb 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper proposes a novel attack method, called the “embedding space attack,” to compromise the robustness of open-source large language models (LLMs). The authors demonstrate that this type of attack can be more effective than traditional discrete attacks or model fine-tuning. They also show that these embedding space attacks can extract supposedly deleted information from unlearned LLMs across multiple datasets and models, highlighting the importance of ensuring the safety of open-source LLMs.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper is about making sure big language models are safe and won’t be used to do bad things. Right now, most research focuses on small changes in words or sentences, but this doesn’t account for how open-source models get better over time. The authors introduce a new way of attacking these models by messing with the underlying codes that represent words. This makes it easier to make the model behave badly and even pull out old information that was supposed to be deleted. The results show that we need to pay more attention to making sure our open-source language models are safe.

Keywords

* Artificial intelligence * Attention * Embedding space * Fine tuning

Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

by Leo Schwinn, David Dobre, Sophie Xhonneux, Gauthier Gidel, Stephan Gunnemann

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Nearly Minimax Optimal Regret For Learning Linear Mixture Stochastic Shortest Path, by Qiwei Di et al.

Summary of Less Is More: Fewer Interpretable Region Via Submodular Subset Selection, by Ruoyu Chen et al.

Related Posts