Loading Now

Summary of Using Degeneracy in the Loss Landscape For Mechanistic Interpretability, by Lucius Bushnaq et al.


Using Degeneracy in the Loss Landscape for Mechanistic Interpretability

by Lucius Bushnaq, Jake Mendel, Stefan Heimersheim, Dan Braun, Nicholas Goldowsky-Dill, Kaarel Hänni, Cindy Wu, Marius Hobbhahn

First submitted to arxiv on: 17 May 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper explores Mechanistic Interpretability, a method for reverse engineering neural networks by analyzing their weights and activations. The authors identify three types of parameter degeneracy: linear dependence between activations, gradients, and ReLU activation subsets. They also propose a metric to detect modular networks, which are likely to be more degenerate. To overcome this issue, the authors introduce the Interaction Basis, a technique that produces a representation invariant to these degeneracies. This could lead to a more interpretable neural network, with sparser interactions.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is trying to figure out how neural networks work by looking at their internal parts. Right now, it’s hard because many of those parts aren’t important for the network’s job. The authors found three ways that these unimportant parts can make things confusing: when different layer activations are related, when gradients from one layer affect another, and when certain ReLU activation patterns keep showing up. They also came up with a way to identify when neural networks have “modules” that might be more confusing. To fix this, they created something called the Interaction Basis, which can help make the network’s internal workings easier to understand.

Keywords

» Artificial intelligence  » Neural network  » Relu