Summary of Unlearning or Obfuscating? Jogging the Memory Of Unlearned Llms Via Benign Relearning, by Shengyuan Hu et al.
Unlearning or Obfuscating? Jogging the Memory of Unlearned LLMs via Benign Relearning
by Shengyuan Hu, Yiwei Fu, Zhiwei Steven Wu, Virginia Smith
First submitted to arxiv on: 19 Jun 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper explores the concept of machine unlearning, which aims to mitigate undesirable memorization of training data in machine learning (ML) models. However, existing approaches for unlearning in large language models (LLMs) are surprisingly vulnerable to a set of benign relearning attacks. The researchers demonstrate that with access to only a small and loosely related dataset, they can “jog” the memory of unlearned models to reverse the effects of unlearning. For instance, relearning on public medical articles can cause an unlearned LLM to output harmful knowledge about bioweapons, while relearning general wiki information about Harry Potter can force the model to output verbatim memorized text. The study formalizes this unlearning-relearning pipeline and explores it across three popular unlearning benchmarks. The findings suggest that current approximate unlearning methods simply suppress model outputs and fail to robustly forget target knowledge in LLMs. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about making sure machine learning models don’t remember things they shouldn’t. Right now, there are ways to make these models “forget” what they learned from training data, but it turns out that these methods aren’t very good at actually forgetting. Instead, they just hide the bad information deep in their memory. The researchers found that with only a little bit of new information, they can make the model remember all over again. For example, if you want to get rid of harmful knowledge about bioweapons, you might think you’re safe, but it turns out that the model could still learn about it by reading medical articles. The study shows how this works and what we need to do to make sure machine learning models really forget what they shouldn’t. |
Keywords
* Artificial intelligence * Machine learning