Loading Now

Summary of Self-attention Limits Working Memory Capacity Of Transformer-based Models, by Dongyu Gong and Hantao Zhang


Self-Attention Limits Working Memory Capacity of Transformer-Based Models

by Dongyu Gong, Hantao Zhang

First submitted to arxiv on: 16 Sep 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Neurons and Cognition (q-bio.NC)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The research explores the limitations of Transformer-based large language models (LLMs) on N-back tasks, similar to human behavioral studies. The study finds that performance drops significantly as N increases. To investigate this phenomenon, the researchers hypothesize that the self-attention mechanism within Transformer-based models is responsible for their working memory capacity limits. By training vanilla decoder-only transformers to perform N-back tasks and analyzing attention scores, they find that attention scores gradually aggregate to the N-back positions over training. This suggests that the model masters the task by learning a strategy to pay attention to the relationship between the current position and the N-back position. The study also reveals an increase in the total entropy of the attention score matrix as N increases, suggesting that the dispersion of attention scores might be the cause of the capacity limit observed in N-back tasks. This research provides insights into the shared role of attention in both human and artificial intelligence.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper looks at how well big language models do on certain tasks when they need to remember things from earlier on. It finds that these models get worse at doing this as they have to remember more things. The researchers think that this is because the way the model focuses its attention on different parts of what it’s reading or writing is limited. They test this idea by training a simpler version of the language model to do these tasks and see how well it does. They find that the model gets better at doing the task by learning to focus on the right things. This helps us understand how both humans and computers use attention when they’re trying to remember or think about something.

Keywords

» Artificial intelligence  » Attention  » Decoder  » Language model  » Self attention  » Transformer