Summary of Using Degeneracy in the Loss Landscape For Mechanistic Interpretability, by Lucius Bushnaq et al.

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

by Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn

First submitted to arxiv on: 17 May 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper explores Mechanistic Interpretability, a method for reverse engineering neural networks by analyzing their weights and activations. The authors identify three types of parameter degeneracy: linear dependence between activations, gradients, and ReLU activation subsets. They also propose a metric to detect modular networks, which are likely to be more degenerate. To overcome this issue, the authors introduce the Interaction Basis, a technique that produces a representation invariant to these degeneracies. This could lead to a more interpretable neural network, with sparser interactions.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper is trying to figure out how neural networks work by looking at their internal parts. Right now, it’s hard because many of those parts aren’t important for the network’s job. The authors found three ways that these unimportant parts can make things confusing: when different layer activations are related, when gradients from one layer affect another, and when certain ReLU activation patterns keep showing up. They also came up with a way to identify when neural networks have “modules” that might be more confusing. To fix this, they created something called the Interaction Basis, which can help make the network’s internal workings easier to understand.

Keywords

» Artificial intelligence » Neural network » Relu

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

by Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Distributed Event-based Learning Via Admm, by Guner Dilsad Er et al.

Summary of Submodular Information Selection For Hypothesis Testing with Misclassification Penalties, by Jayanth Bhargav et al.

Related Posts