Summary of Pre-training Distillation For Large Language Models: a Design Space Exploration, by Hao Peng et al.
Pre-training Distillation for Large Language Models: A Design Space Exploration
by Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, Juanzi Li
First submitted to arxiv on: 21 Oct 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper presents a novel approach to knowledge distillation (KD) for large language models (LLMs), focusing on the pre-training phase rather than the post-training phase. The authors introduce pre-training distillation (PD), which involves transferring knowledge from a large teacher LLM to a smaller student LLM during the pre-training process. The paper explores the design space of PD across four key aspects: logits processing, loss selection, scaling law, and offline or online logits. Through extensive experiments, the authors find that larger student LLMs generally benefit more from pre-training distillation, while a larger teacher LLM does not necessarily guarantee better results. This work aims to inform future practices in pre-training distillation. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about a new way to make large language models (LLMs) smarter. Instead of just teaching them after they’re built, the authors want to teach them as they’re being created. They call this “pre-training distillation”. The idea is that by sharing knowledge between two LLMs – one big and one small – we can make the smaller one better faster. The authors tried different ways to do this and found some interesting things. For example, they discovered that bigger student models get more benefits from this process, but having a bigger teacher model doesn’t always mean better results. |
Keywords
» Artificial intelligence » Distillation » Knowledge distillation » Logits » Teacher model