Summary of Compact Language Models Via Pruning and Knowledge Distillation, by Saurav Muralidharan et al.
Compact Language Models via Pruning and Knowledge Distillation
by Saurav Muralidharan, Sharath Turuvekere Sreenivas, Raviraj Joshi, Marcin Chochowski, Mostofa Patwary, Mohammad Shoeybi, Bryan Catanzaro, Jan Kautz, Pavlo Molchanov
First submitted to arxiv on: 19 Jul 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Medium Difficulty summary: This research paper investigates alternative methods for producing large language models (LLMs) that target different deployment scales and sizes. Currently, each variant is trained from scratch, which is extremely compute-intensive. The authors propose pruning an existing LLM to reduce its size by up to 4x, followed by re-training with a fraction of the original training data (<3%). They develop practical compression best practices for LLMs, combining depth, width, attention, and MLP pruning with knowledge distillation-based retraining. Using this guide, they compress the Nemotron-4 family of LLMs and compare their performance to similarly-sized models on various language modeling tasks. The results show that deriving smaller models from a larger pre-trained model requires up to 40x fewer training tokens, resulting in significant compute cost savings. Moreover, Minitron models exhibit improved performance compared to training from scratch, outperforming state-of-the-art compression techniques. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Low Difficulty summary: This paper explores ways to make large language models (LLMs) smaller and more efficient without losing their abilities. Right now, making each LLM variant requires a lot of computer power. The researchers propose reducing an existing LLM’s size by up to 4x and then re-training it with less data. They develop guidelines for compressing LLMs using different techniques. By following these guidelines, they make the Nemotron-4 family of LLMs smaller and test how well they perform on language tasks. The results show that making smaller models from a larger pre-trained model takes much less computer power. Additionally, the Minitron models do better than usual when it comes to language modeling. |
Keywords
* Artificial intelligence * Attention * Knowledge distillation * Pruning