Summary of Applying Sparse Autoencoders to Unlearn Knowledge in Language Models, by Eoin Farrell et al.
Applying sparse autoencoders to unlearn knowledge in language models
by Eoin Farrell, Yeu-Tong Lau, Arthur Conmy
First submitted to arxiv on: 25 Oct 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This research paper investigates the use of sparse autoencoders (SAEs) to remove knowledge from language models. The study uses the biology subset of the Weapons of Mass Destruction Proxy dataset and tests on two language models, gemma-2b-it and gemma-2-2b-it. The results show that individual interpretable biology-related SAE features can be used to unlearn a subset of WMDP-Bio questions with minimal side-effects in domains other than biology. However, the paper also finds that intervening using multiple SAE features simultaneously can unlearn multiple different topics, but with similar or larger unwanted side-effects compared to existing techniques. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This research looks at whether we can use a type of artificial intelligence called sparse autoencoders (SAEs) to remove certain knowledge from language models. We test this idea on some special language models and find that it works pretty well for biology-related questions, but not so much for other topics. The researchers also learn that using multiple SAE features at the same time can help unlearn different topics, but it might have some unwanted side effects. |