Summary of Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations, by Bowen Shen et al.

Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

by Bowen Shen, Zheng Lin, Daren Zha, Wei Liu, Jian Luan, Bin Wang, Weiping Wang

First submitted to arxiv on: 8 Jul 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper presents a novel approach to structured pruning for large language models (LLMs), aiming to reduce computational and memory overheads while preserving model accuracy. The proposed method, TransAct, couples a compact Transformer architecture design with task-agnostic structured pruning to prune transitional activations inside multi-head attention (MHA) and multi-layer perceptron (MLP) modules. This approach reduces weights, KV Cache, and attention computation, making it feasible for end-side LLM deployment. Evaluations on the LLaMA model demonstrate the optimality of TransAct in achieving high compression ratios with respect to both efficiency and performance.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper is about finding a way to make big language models smaller so they can run more efficiently on devices. The current method, called structured pruning, has some limitations. The researchers came up with a new approach that combines two things: a special kind of Transformer architecture and a way to prune the model’s activations. This approach reduces the amount of memory and computation needed, making it possible to use these models in more everyday situations.

Keywords

» Artificial intelligence » Attention » Llama » Multi head attention » Pruning » Transformer

Pruning Large Language Models to Intra-module Low-rank Architecture with Transitional Activations

by Bowen Shen, Zheng Lin, Daren Zha, Wei Liu, Jian Luan, Bin Wang, Weiping Wang

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Bevworld: a Multimodal World Model For Autonomous Driving Via Unified Bev Latent Space, by Yumeng Zhang et al.

Summary of Enhancing Vision-language Models with Scene Graphs For Traffic Accident Understanding, by Aaron Lohner et al.

Related Posts