Loading Now

Summary of Transformer Normalisation Layers and the Independence Of Semantic Subspaces, by Stephen Menary et al.


Transformer Normalisation Layers and the Independence of Semantic Subspaces

by Stephen Menary, Samuel Kaski, Andre Freitas

First submitted to arxiv on: 25 Jun 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A recent study has demonstrated that transformer models can excel at contextual reasoning tasks by internally executing computational graphs called circuits. These circuits often rely on attention mechanisms to align information from various subspaces within the representation. The authors of this paper investigate the notion of semantic subspaces, which refer to independent subspace components of the latent representation that can entirely determine an attention distribution. The study reveals that the placement of normalization layers used in state-of-the-art transformers (Pre-Norm) hinders the ability for models to learn strict representation structures unless they adopt orthogonal spheres. This is due to the interference between linear subspaces caused by their shared normalization factor, which leads to a phenomenon called circuit collapse when attention shifts to a different token. The authors theoretically analyze this issue and predict that it arises from random noise affecting the L2-norms of query/key/value vectors. Empirical experiments confirm this finding, with real-world models trained for mathematical addition exhibiting a 1% rate of circuit collapse when their norms are artificially perturbed by less than or equal to 10%. The study also compares Pre-Norm with QKV-Norm, which places normalization after the attention head’s linear operators, demonstrating comparable in-distribution performance but worse out-of-distribution results.
Low GrooveSquid.com (original content) Low Difficulty Summary
Transformer models can solve complex problems by using internal “circuits” that rely on attention. Researchers have found that these circuits can be unstable and collapse when they’re asked to focus on different parts of the information. They think this is because the way normalization is done in the model, which can make it hard for the model to learn a clear structure. In their study, the authors looked at how well different models do this normalization step, called Pre-Norm or QKV-Norm. They found that while both types of normalization work similarly on normal tasks, QKV-Norm is better when the task gets harder.

Keywords

* Artificial intelligence  * Attention  * Token  * Transformer