Summary of Applying Sparse Autoencoders to Unlearn Knowledge in Language Models, by Eoin Farrell et al.

Applying sparse autoencoders to unlearn knowledge in language models

by Eoin Farrell, Yeu-Tong Lau, Arthur Conmy

First submitted to arxiv on: 25 Oct 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This research paper investigates the use of sparse autoencoders (SAEs) to remove knowledge from language models. The study uses the biology subset of the Weapons of Mass Destruction Proxy dataset and tests on two language models, gemma-2b-it and gemma-2-2b-it. The results show that individual interpretable biology-related SAE features can be used to unlearn a subset of WMDP-Bio questions with minimal side-effects in domains other than biology. However, the paper also finds that intervening using multiple SAE features simultaneously can unlearn multiple different topics, but with similar or larger unwanted side-effects compared to existing techniques.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This research looks at whether we can use a type of artificial intelligence called sparse autoencoders (SAEs) to remove certain knowledge from language models. We test this idea on some special language models and find that it works pretty well for biology-related questions, but not so much for other topics. The researchers also learn that using multiple SAE features at the same time can help unlearn different topics, but it might have some unwanted side effects.

Keywords

* Artificial intelligence

Applying sparse autoencoders to unlearn knowledge in language models

by Eoin Farrell, Yeu-Tong Lau, Arthur Conmy

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Coordinated Reply Attacks in Influence Operations: Characterization and Detection, by Manita Pote et al.

Summary of Golden Ratio-based Sufficient Dimension Reduction, by Wenjing Yang and Yuhong Yang

Related Posts