Loading Now

Summary of Lillama: Large Language Models Compression Via Low-rank Feature Distillation, by Yaya Sy and Christophe Cerisara and Irina Illina


Lillama: Large Language Models Compression via Low-Rank Feature Distillation

by Yaya Sy, Christophe Cerisara, Irina Illina

First submitted to arxiv on: 21 Dec 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes Lillama, a novel structured pruning method for large language models (LLMs) that compresses activations with low-rank weights. The approach locally distills activations with teacher and student weights using SVD initialization and a joint loss function. This results in accelerated convergence, reduced memory use, and improved compression ratios compared to existing methods. Lillama compresses the Mixtral-8x7B model within minutes on a single A100 GPU, retaining over 95% of its original performance while removing 10 billion parameters. The method also generalizes well to non-transformer architectures, compressing Mamba-3B by 20% while maintaining 99% performance.
Low GrooveSquid.com (original content) Low Difficulty Summary
Lillama is a new way to make big language models smaller and faster. Normally, making these models smaller would hurt their ability to understand what we say, but Lillama helps keep them accurate even when they’re smaller. It does this by using a special technique called local distillation, which helps the model learn from its own weights instead of needing more training data. This makes it possible to compress big models like Mixtral-8x7B and Mamba-3B without sacrificing too much performance.

Keywords

» Artificial intelligence  » Distillation  » Loss function  » Pruning  » Transformer