Summary of Lillama: Large Language Models Compression Via Low-rank Feature Distillation, by Yaya Sy and Christophe Cerisara and Irina Illina

Lillama: Large Language Models Compression via Low-Rank Feature Distillation

by Yaya Sy, Christophe Cerisara, Irina Illina

First submitted to arxiv on: 21 Dec 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper proposes Lillama, a novel structured pruning method for large language models (LLMs) that compresses activations with low-rank weights. The approach locally distills activations with teacher and student weights using SVD initialization and a joint loss function. This results in accelerated convergence, reduced memory use, and improved compression ratios compared to existing methods. Lillama compresses the Mixtral-8x7B model within minutes on a single A100 GPU, retaining over 95% of its original performance while removing 10 billion parameters. The method also generalizes well to non-transformer architectures, compressing Mamba-3B by 20% while maintaining 99% performance.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Lillama is a new way to make big language models smaller and faster. Normally, making these models smaller would hurt their ability to understand what we say, but Lillama helps keep them accurate even when they’re smaller. It does this by using a special technique called local distillation, which helps the model learn from its own weights instead of needing more training data. This makes it possible to compress big models like Mixtral-8x7B and Mamba-3B without sacrificing too much performance.

Keywords

* Artificial intelligence * Distillation * Loss function * Pruning * Transformer

Lillama: Large Language Models Compression via Low-Rank Feature Distillation

by Yaya Sy, Christophe Cerisara, Irina Illina

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Learning For Cross-layer Resource Allocation in Mec-aided Cell-free Networks, by Chong Zheng et al.

Summary of Fed-zoe: Communication-efficient Over-the-air Federated Learning Via Zeroth-order Estimation, by Jonggyu Jang et al.

Related Posts