Loading Now

Summary of Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-source Llms Through the Embedding Space, by Leo Schwinn and David Dobre and Sophie Xhonneux and Gauthier Gidel and Stephan Gunnemann


Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space

by Leo Schwinn, David Dobre, Sophie Xhonneux, Gauthier Gidel, Stephan Gunnemann

First submitted to arxiv on: 14 Feb 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes a novel attack method, called the “embedding space attack,” to compromise the robustness of open-source large language models (LLMs). The authors demonstrate that this type of attack can be more effective than traditional discrete attacks or model fine-tuning. They also show that these embedding space attacks can extract supposedly deleted information from unlearned LLMs across multiple datasets and models, highlighting the importance of ensuring the safety of open-source LLMs.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about making sure big language models are safe and won’t be used to do bad things. Right now, most research focuses on small changes in words or sentences, but this doesn’t account for how open-source models get better over time. The authors introduce a new way of attacking these models by messing with the underlying codes that represent words. This makes it easier to make the model behave badly and even pull out old information that was supposed to be deleted. The results show that we need to pay more attention to making sure our open-source language models are safe.

Keywords

* Artificial intelligence  * Attention  * Embedding space  * Fine tuning