Loading Now

Summary of What Matters in Transformers? Not All Attention Is Needed, by Shwai He et al.


What Matters in Transformers? Not All Attention is Needed

by Shwai He, Guoheng Sun, Zheyu Shen, Ang Li

First submitted to arxiv on: 22 Jun 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper investigates the redundancy in Transformer-based large language models (LLMs) to improve their efficiency for real-world deployment. The authors analyze the similarity of different modules within Transformers, including Blocks, MLP, and Attention layers. Surprisingly, they find that a significant portion of attention layers exhibit high similarity and can be pruned without degrading performance. For example, Llama-2-70B achieved a 48.4% speedup with only a 2.4% performance drop by pruning half of the attention layers. The authors also propose a method to jointly drop Attention and MLP layers, allowing for more aggressive layer dropping. They demonstrate the effectiveness of this approach on the MMLU task, retaining 90% of the performance when dropping 31 layers. This work provides valuable insights for future network architecture design.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper looks at how to make large language models (LLMs) more efficient and useful in real-life situations. Researchers analyzed different parts of these models to see which ones are redundant or unnecessary. They found that some parts, like attention layers, can be reduced without hurting the model’s performance. By pruning these layers, they were able to speed up the model by 48% while only losing a little bit of accuracy. The authors also proposed a new way to reduce the number of layers even more aggressively. This work helps us understand how to design better language models for real-world use.

Keywords

» Artificial intelligence  » Attention  » Llama  » Pruning  » Transformer