Loading Now

Summary of Shiftaddllm: Accelerating Pretrained Llms Via Post-training Multiplication-less Reparameterization, by Haoran You et al.


ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

by Haoran You, Yipin Guo, Yichao Fu, Wei Zhou, Huihong Shi, Xiaofan Zhang, Souvik Kundu, Amir Yazdanbakhsh, Yingyan Celine Lin

First submitted to arxiv on: 10 Jun 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Large language models (LLMs) have impressive performance on language tasks, but deploying them on resource-constrained devices is challenging due to their extensive parameters and reliance on dense multiplications. This results in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a solution by replacing costly multiplications with hardware-friendly primitives in both attention and MLP layers of an LLM. However, current techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. The proposed method accelerates pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models dubbed ShiftAddLLM. This is achieved by quantizing each weight matrix into binary matrices paired with group-wise scaling factors and reparameterizing multiplications into shifts between activations and scaling factors or queries and adds according to the binary matrices. To reduce accuracy loss, a multi-objective optimization method minimizes both weight and output activation reparameterization errors. An automated bit allocation strategy is developed to further reduce memory usage and latency by varying sensitivity across layers to reparameterization. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements at comparable or lower latency compared to competitive quantized LLMs.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large language models can do many things well, but they often struggle when used on devices with limited resources. One way to solve this problem is by replacing complex calculations with simpler ones that are easier for the device to handle. This works by taking each weight matrix and turning it into a simple “yes” or “no” answer, along with some additional information. The original complicated calculation can then be replaced with these simple answers and the additional information. This makes it much faster and uses less memory. To make sure this new way of working doesn’t ruin the model’s ability to do its job well, we use a special method that helps minimize any mistakes that might happen.

Keywords

» Artificial intelligence  » Attention  » Fine tuning  » Optimization  » Perplexity