Summary of Shadowllm: Predictor-based Contextual Sparsity For Large Language Models, by Yash Akhauri et al.
ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models
by Yash Akhauri, Ahmed F AbouElhamayed, Jordan Dotzel, Zhiru Zhang, Alexander M Rush, Safeen Huda, Mohamed S Abdelfattah
First submitted to arxiv on: 24 Jun 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The novel ShadowLLM predictor is developed to assess attention head and neuron importance in large language models (LLMs), going beyond magnitude-based pruning criteria. This approach enables over 15% improvement in end-to-end accuracy compared to prior methods, achieving a significant boost of up to 20% speed-up over the state-of-the-art DejaVu framework. The enhancements are validated on Llama-2 and OPT models with up to 30 billion parameters. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large language models (LLMs) require efficient techniques like quantization and sparsity due to high power consumption and latency-sensitive deployments. Researchers developed a novel predictor called ShadowLLM, which can assess attention head and neuron importance in LLMs, achieving over 15% improvement in accuracy compared to prior methods. This approach also provides up to 20% speed-up over the state-of-the-art DejaVu framework. |
Keywords
» Artificial intelligence » Attention » Llama » Pruning » Quantization