Summary of Not All Language Model Features Are One-dimensionally Linear, by Joshua Engels et al.
Not All Language Model Features Are One-Dimensionally Linear
by Joshua Engels, Eric J. Michaud, Isaac Liao, Wes Gurnee, Max Tegmark
First submitted to arxiv on: 23 May 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Recent research suggests that language models manipulate one-dimensional concept representations (“features”) in activation space for computation. This paper explores the possibility of inherently multi-dimensional representations. We develop a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into independent or non-co-occurring lower-dimensional features. A scalable method using sparse autoencoders is designed to find these features in GPT-2 and Mistral 7B, yielding strikingly interpretable examples like circular representations of days of the week and months of the year. We identify tasks where these exact circles solve computational problems involving modular arithmetic. Intervention experiments on Mistral 7B and Llama 3 8B demonstrate that these circular features are the fundamental unit of computation in these tasks, with continuity observed in Mistral 7B’s days of the week feature. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper looks at how language models work. Some researchers think they just use one-dimensional concepts to compute things. But this study suggests that maybe some models can also use multi-dimensional representations. The team develops a way to define what these features are and finds them in certain models, like GPT-2 and Mistral 7B. These features include circles representing days of the week and months of the year, which help solve specific problems. The study shows that these circular features are important for computation and remain consistent in certain models. |
Keywords
» Artificial intelligence » Gpt » Llama