Summary of Transformers Need Glasses! Information Over-squashing in Language Tasks, by Federico Barbero et al.

Transformers need glasses! Information over-squashing in language tasks

by Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João G.M. Araújo, Alex Vitvitskyi, Razvan Pascanu, Petar Veličković

First submitted to arxiv on: 6 Jun 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper investigates how information propagates within decoder-only Transformers, a crucial component of many large language models (LLMs). It analyzes the representations of the last token in the final layer and finds that certain input sequences can produce arbitrarily close representations. This phenomenon is exacerbated by low-precision floating-point formats used in modern LLMs, leading to errors in tasks like counting or copying. The paper also shows how decoder-only Transformer language models can lose sensitivity to specific tokens in the input, relating to the over-squashing issue seen in graph neural networks. Empirical evidence supports these claims on contemporary LLMs, with simple solutions proposed for alleviating these problems.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This study looks at how information moves through a special type of computer model called a decoder-only Transformer. These models are important because they help make many large language models work. The researchers analyzed what happens when different input sequences go into the model and found that some inputs can make the model produce very similar results. This is bad news because it means the model can’t tell these different inputs apart, which leads to mistakes in tasks like counting or copying. They also discovered that the model can stop paying attention to certain parts of the input, something that happens with other types of computer models too. The study shows how this works on real-world language models and suggests ways to fix the problem.

Keywords

» Artificial intelligence » Attention » Decoder » Precision » Token » Transformer

Transformers need glasses! Information over-squashing in language tasks

by Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João G.M. Araújo, Alex Vitvitskyi, Razvan Pascanu, Petar Veličković

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Securing Equal Share: a Principled Approach For Learning Multiplayer Symmetric Games, by Jiawei Ge et al.

Summary of Refine: Recursive Field Networks For Cross-modal Multi-scene Representation, by Sergey Zakharov et al.

Related Posts