Loading Now

Summary of Pre-training Distillation For Large Language Models: a Design Space Exploration, by Hao Peng et al.


Pre-training Distillation for Large Language Models: A Design Space Exploration

by Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, Juanzi Li

First submitted to arxiv on: 21 Oct 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper presents a novel approach to knowledge distillation (KD) for large language models (LLMs), focusing on the pre-training phase rather than the post-training phase. The authors introduce pre-training distillation (PD), which involves transferring knowledge from a large teacher LLM to a smaller student LLM during the pre-training process. The paper explores the design space of PD across four key aspects: logits processing, loss selection, scaling law, and offline or online logits. Through extensive experiments, the authors find that larger student LLMs generally benefit more from pre-training distillation, while a larger teacher LLM does not necessarily guarantee better results. This work aims to inform future practices in pre-training distillation.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about a new way to make large language models (LLMs) smarter. Instead of just teaching them after they’re built, the authors want to teach them as they’re being created. They call this “pre-training distillation”. The idea is that by sharing knowledge between two LLMs – one big and one small – we can make the smaller one better faster. The authors tried different ways to do this and found some interesting things. For example, they discovered that bigger student models get more benefits from this process, but having a bigger teacher model doesn’t always mean better results.

Keywords

» Artificial intelligence  » Distillation  » Knowledge distillation  » Logits  » Teacher model