Loading Now

Summary of Interpretability As Compression: Reconsidering Sae Explanations Of Neural Activations with Mdl-saes, by Kola Ayonrinde et al.


Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs

by Kola Ayonrinde, Michael T. Pearce, Lee Sharkey

First submitted to arxiv on: 15 Oct 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Information Theory (cs.IT)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper presents an information-theoretic framework for interpreting Sparse Autoencoders (SAEs) as lossy compression algorithms for explaining neural activations. The authors argue that naively optimizing SAEs for reconstruction loss and sparsity results in a preference for extremely wide and sparse SAEs, which may not provide optimal explanations. Instead, they propose using the Minimal Description Length (MDL) principle to motivate concise and accurate explanations of activations. They demonstrate an example by training SAEs on MNIST handwritten digits and find that SAE features representing significant line segments are optimal. The framework also suggests new hierarchical SAE architectures that provide more concise explanations.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper explores how to use Sparse Autoencoders (SAEs) to understand what neural networks are doing. Right now, people just optimize SAEs to make them good at compressing data and making it sparse. But this doesn’t always give the best results. The authors came up with a new way of thinking about SAEs that’s based on how much information they contain. They use this idea to create more concise and accurate explanations of what neural networks are doing. They tested their method by training SAEs on some handwriting data and found that it worked well. This could be useful for understanding why neural networks make certain decisions.

Keywords

* Artificial intelligence