Summary of Efficient Llm Inference Using Dynamic Input Pruning and Cache-aware Masking, by Marco Federici et al.
Efficient LLM Inference using Dynamic Input Pruning and Cache-Aware Masking
by Marco Federici, Davide Belli, Mart van Baalen, Amir Jalalirad, Andrii Skliar, Bence Major, Markus Nagel, Paul Whatmough
First submitted to arxiv on: 2 Dec 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper proposes Dynamic Input Pruning (DIP), a predictor-free dynamic sparsification approach for large language model (LLM) token generation, which is heavily memory-bound due to the slow improvement in DRAM bandwidth on mobile devices. Previous work relied on ReLU-activated LLMs with inherent sparsity, but more recent LLMs use SwiGLU instead, rendering previous approaches ineffective. DIP preserves accuracy with minimal fine-tuning and can further utilize lightweight LoRA adapters to regain some performance lost during sparsification. Additionally, the paper introduces a novel cache-aware masking strategy that considers the cache state and activation magnitude to increase cache hit rate, improving LLM token rate on mobile devices. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper tries to make big language models work better on phones by finding ways to use less memory and make them run faster. Right now, phone processors are getting more powerful, but memory (RAM) is still pretty slow. This makes it hard for these big language models to work well on phones. The researchers found a way to fix this problem using something called Dynamic Input Pruning (DIP). DIP helps the language model use less memory without losing its ability to understand and generate text. It also uses some special tricks to make the model run faster on phones. |
Keywords
» Artificial intelligence » Fine tuning » Language model » Large language model » Lora » Pruning » Relu » Token