Summary of Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations, by Bowen Shen et al.
Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations
by Bowen Shen, Zheng Lin, Daren Zha, Wei Liu, Jian Luan, Bin Wang, Weiping Wang
First submitted to arxiv on: 8 Jul 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper presents a novel approach to structured pruning for large language models (LLMs), aiming to reduce computational and memory overheads while preserving model accuracy. The proposed method, TransAct, couples a compact Transformer architecture design with task-agnostic structured pruning to prune transitional activations inside multi-head attention (MHA) and multi-layer perceptron (MLP) modules. This approach reduces weights, KV Cache, and attention computation, making it feasible for end-side LLM deployment. Evaluations on the LLaMA model demonstrate the optimality of TransAct in achieving high compression ratios with respect to both efficiency and performance. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about finding a way to make big language models smaller so they can run more efficiently on devices. The current method, called structured pruning, has some limitations. The researchers came up with a new approach that combines two things: a special kind of Transformer architecture and a way to prune the model’s activations. This approach reduces the amount of memory and computation needed, making it possible to use these models in more everyday situations. |
Keywords
» Artificial intelligence » Attention » Llama » Multi head attention » Pruning » Transformer