Summary of Transformers Need Glasses! Information Over-squashing in Language Tasks, by Federico Barbero et al.
Transformers need glasses! Information over-squashing in language tasks
by Federico Barbero, Andrea Banino, Steven Kapturowski, Dharshan Kumaran, João G.M. Araújo, Alex Vitvitskyi, Razvan Pascanu, Petar Veličković
First submitted to arxiv on: 6 Jun 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper investigates how information propagates within decoder-only Transformers, a crucial component of many large language models (LLMs). It analyzes the representations of the last token in the final layer and finds that certain input sequences can produce arbitrarily close representations. This phenomenon is exacerbated by low-precision floating-point formats used in modern LLMs, leading to errors in tasks like counting or copying. The paper also shows how decoder-only Transformer language models can lose sensitivity to specific tokens in the input, relating to the over-squashing issue seen in graph neural networks. Empirical evidence supports these claims on contemporary LLMs, with simple solutions proposed for alleviating these problems. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This study looks at how information moves through a special type of computer model called a decoder-only Transformer. These models are important because they help make many large language models work. The researchers analyzed what happens when different input sequences go into the model and found that some inputs can make the model produce very similar results. This is bad news because it means the model can’t tell these different inputs apart, which leads to mistakes in tasks like counting or copying. They also discovered that the model can stop paying attention to certain parts of the input, something that happens with other types of computer models too. The study shows how this works on real-world language models and suggests ways to fix the problem. |
Keywords
» Artificial intelligence » Attention » Decoder » Precision » Token » Transformer