Summary of Self-attention Limits Working Memory Capacity Of Transformer-based Models, by Dongyu Gong and Hantao Zhang
Self-Attention Limits Working Memory Capacity of Transformer-Based Models
by Dongyu Gong, Hantao Zhang
First submitted to arxiv on: 16 Sep 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The research explores the limitations of Transformer-based large language models (LLMs) on N-back tasks, similar to human behavioral studies. The study finds that performance drops significantly as N increases. To investigate this phenomenon, the researchers hypothesize that the self-attention mechanism within Transformer-based models is responsible for their working memory capacity limits. By training vanilla decoder-only transformers to perform N-back tasks and analyzing attention scores, they find that attention scores gradually aggregate to the N-back positions over training. This suggests that the model masters the task by learning a strategy to pay attention to the relationship between the current position and the N-back position. The study also reveals an increase in the total entropy of the attention score matrix as N increases, suggesting that the dispersion of attention scores might be the cause of the capacity limit observed in N-back tasks. This research provides insights into the shared role of attention in both human and artificial intelligence. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper looks at how well big language models do on certain tasks when they need to remember things from earlier on. It finds that these models get worse at doing this as they have to remember more things. The researchers think that this is because the way the model focuses its attention on different parts of what it’s reading or writing is limited. They test this idea by training a simpler version of the language model to do these tasks and see how well it does. They find that the model gets better at doing the task by learning to focus on the right things. This helps us understand how both humans and computers use attention when they’re trying to remember or think about something. |
Keywords
» Artificial intelligence » Attention » Decoder » Language model » Self attention » Transformer