Summary of What Matters in Transformers? Not All Attention Is Needed, by Shwai He et al.
What Matters in Transformers? Not All Attention is Needed
by Shwai He, Guoheng Sun, Zheyu Shen, Ang Li
First submitted to arxiv on: 22 Jun 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper investigates the redundancy in Transformer-based large language models (LLMs) to improve their efficiency for real-world deployment. The authors analyze the similarity of different modules within Transformers, including Blocks, MLP, and Attention layers. Surprisingly, they find that a significant portion of attention layers exhibit high similarity and can be pruned without degrading performance. For example, Llama-2-70B achieved a 48.4% speedup with only a 2.4% performance drop by pruning half of the attention layers. The authors also propose a method to jointly drop Attention and MLP layers, allowing for more aggressive layer dropping. They demonstrate the effectiveness of this approach on the MMLU task, retaining 90% of the performance when dropping 31 layers. This work provides valuable insights for future network architecture design. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper looks at how to make large language models (LLMs) more efficient and useful in real-life situations. Researchers analyzed different parts of these models to see which ones are redundant or unnecessary. They found that some parts, like attention layers, can be reduced without hurting the model’s performance. By pruning these layers, they were able to speed up the model by 48% while only losing a little bit of accuracy. The authors also proposed a new way to reduce the number of layers even more aggressively. This work helps us understand how to design better language models for real-world use. |
Keywords
» Artificial intelligence » Attention » Llama » Pruning » Transformer