Summary of Pre-training Distillation For Large Language Models: a Design Space Exploration, by Hao Peng et al.

Pre-training Distillation for Large Language Models: A Design Space Exploration

by Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, Juanzi Li

First submitted to arxiv on: 21 Oct 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper presents a novel approach to knowledge distillation (KD) for large language models (LLMs), focusing on the pre-training phase rather than the post-training phase. The authors introduce pre-training distillation (PD), which involves transferring knowledge from a large teacher LLM to a smaller student LLM during the pre-training process. The paper explores the design space of PD across four key aspects: logits processing, loss selection, scaling law, and offline or online logits. Through extensive experiments, the authors find that larger student LLMs generally benefit more from pre-training distillation, while a larger teacher LLM does not necessarily guarantee better results. This work aims to inform future practices in pre-training distillation.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper is about a new way to make large language models (LLMs) smarter. Instead of just teaching them after they’re built, the authors want to teach them as they’re being created. They call this “pre-training distillation”. The idea is that by sharing knowledge between two LLMs – one big and one small – we can make the smaller one better faster. The authors tried different ways to do this and found some interesting things. For example, they discovered that bigger student models get more benefits from this process, but having a bigger teacher model doesn’t always mean better results.

Keywords

* Artificial intelligence * Distillation * Knowledge distillation * Logits * Teacher model

Pre-training Distillation for Large Language Models: A Design Space Exploration

by Hao Peng, Xin Lv, Yushi Bai, Zijun Yao, Jiajie Zhang, Lei Hou, Juanzi Li

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Multi-sensor Fusion For Uav Classification Based on Feature Maps Of Image and Radar Data, by Nikos Sakellariou (1) et al.

Summary of Dynamic Adaptive Rank Space Exploration For Efficient Sentiment Analysis with Large Language Models, by Hongcheng Ding et al.

Related Posts