Loading Now

Summary of When Attention Sink Emerges in Language Models: An Empirical View, by Xiangming Gu et al.


When Attention Sink Emerges in Language Models: An Empirical View

by Xiangming Gu, Tianyu Pang, Chao Du, Qian Liu, Fengzhuo Zhang, Cunxiao Du, Ye Wang, Min Lin

First submitted to arxiv on: 14 Oct 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper delves into the phenomenon of attention sink in Language Models (LMs), where a significant amount of attention is assigned to the first token, even if it’s not semantically important. Attention sinks have been widely adopted in applications such as streaming and model quantization, but a deep understanding of this phenomenon was lacking. The authors demonstrate that attention sinks exist universally in LMs with various inputs, including small models, and emerge during pre-training. They investigate the factors influencing the emergence of attention sink, finding it’s correlated with loss function and data distribution. Notably, attention sink acts as key biases, storing extra attention scores that don’t contribute to value computation. The authors also explore alternative attention operations, such as sigmoid attention without normalization, which eliminates attention sinks in LMs up to 1B parameters.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about a problem called “attention sink” in computer language models. It’s like how sometimes our brains focus too much on the first thing we hear or see, even if it’s not that important. The authors looked at this phenomenon and found that it happens in many different types of language models, even small ones. They wanted to understand why this happens and what makes it happen. They discovered that attention sink is related to how the model is trained and the type of data it uses. This discovery can help us make better language models that are more accurate and efficient.

Keywords

» Artificial intelligence  » Attention  » Loss function  » Quantization  » Sigmoid  » Token