Summary of Rethinking Kullback-leibler Divergence in Knowledge Distillation For Large Language Models, by Taiqiang Wu et al.

Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

by Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, Ngai Wong

First submitted to arxiv on: 3 Apr 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper explores the role of Kullback-Leibler divergence in Knowledge Distillation (KD) for Large Language Models (LLMs). Contrary to prior claims, the study shows that neither reverse nor forward Kullback-Leibler divergence has mode- or mean-seeking properties. Instead, both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are rarely trained for such extensive periods. The paper proposes an Adaptive Kullback-Leibier (AKL) method that combines forward and reverse Kullback-Leibler divergence. Evaluations using metric-based and GPT-4-based metrics demonstrate that AKL outperforms baselines across various tasks, improving diversity and quality of generated responses.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper is about how to teach large language models new things. Researchers have been trying different ways to make these models better at understanding what we want them to do. They found that two popular methods are actually the same when used with certain big language models. However, they also discovered that one method focuses on the beginning and another method focuses on the end of a sequence. To take advantage of both approaches, they came up with a new way to combine them. Tests showed that this new approach works better than previous ones at generating helpful responses.

Keywords

* Artificial intelligence * Gpt * Knowledge distillation

Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

by Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, Ngai Wong

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Multi-bert: Leveraging Adapters and Prompt Tuning For Low-resource Multi-domain Adaptation, by Parham Abed Azad and Hamid Beigy

Summary of On Few-shot Prompting For Controllable Question-answer Generation in Narrative Comprehension, by Bernardo Leite and Henrique Lopes Cardoso

Related Posts