Summary of Using Degeneracy in the Loss Landscape For Mechanistic Interpretability, by Lucius Bushnaq et al.
Using Degeneracy in the Loss Landscape for Mechanistic Interpretability
by Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn
First submitted to arxiv on: 17 May 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper explores Mechanistic Interpretability, a method for reverse engineering neural networks by analyzing their weights and activations. The authors identify three types of parameter degeneracy: linear dependence between activations, gradients, and ReLU activation subsets. They also propose a metric to detect modular networks, which are likely to be more degenerate. To overcome this issue, the authors introduce the Interaction Basis, a technique that produces a representation invariant to these degeneracies. This could lead to a more interpretable neural network, with sparser interactions. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is trying to figure out how neural networks work by looking at their internal parts. Right now, it’s hard because many of those parts aren’t important for the network’s job. The authors found three ways that these unimportant parts can make things confusing: when different layer activations are related, when gradients from one layer affect another, and when certain ReLU activation patterns keep showing up. They also came up with a way to identify when neural networks have “modules” that might be more confusing. To fix this, they created something called the Interaction Basis, which can help make the network’s internal workings easier to understand. |
Keywords
» Artificial intelligence » Neural network » Relu