Loading Now

Summary of Massive Activations in Large Language Models, by Mingjie Sun et al.


Massive Activations in Large Language Models

by Mingjie Sun, Xinlei Chen, J. Zico Kolter, Zhuang Liu

First submitted to arxiv on: 27 Feb 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper investigates an intriguing phenomenon in Large Language Models (LLMs), where only a few activations display extremely large values, often exceeding 100,000 times larger than others. Dubbed “massive activations,” this empirical observation is demonstrated to be widespread across various LLMs and characterized by their locations. The study finds that these massive activations exhibit surprising stability, regardless of input variations, and function as essential bias terms in LLMs. Moreover, the presence of massive activations leads to the concentration of attention probabilities on specific tokens and the emergence of implicit bias terms in self-attention output. The research also explores similar phenomena in Vision Transformers. This work has implications for understanding the behavior and performance of these models.
Low GrooveSquid.com (original content) Low Difficulty Summary
This study looks at something cool about Large Language Models (like Google’s BERT). These models have some special “activations” that are way bigger than others – sometimes 100,000 times bigger! The researchers found out where these big activations show up and how they work. They also discovered that these massive activations help the model make decisions and understand language better. This is important because it can help us create even more powerful language models in the future.

Keywords

* Artificial intelligence  * Attention  * Bert  * Self attention