Summary of Rethinking Kullback-leibler Divergence in Knowledge Distillation For Large Language Models, by Taiqiang Wu et al.
Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models
by Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, Ngai Wong
First submitted to arxiv on: 3 Apr 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper explores the role of Kullback-Leibler divergence in Knowledge Distillation (KD) for Large Language Models (LLMs). Contrary to prior claims, the study shows that neither reverse nor forward Kullback-Leibler divergence has mode- or mean-seeking properties. Instead, both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are rarely trained for such extensive periods. The paper proposes an Adaptive Kullback-Leibier (AKL) method that combines forward and reverse Kullback-Leibler divergence. Evaluations using metric-based and GPT-4-based metrics demonstrate that AKL outperforms baselines across various tasks, improving diversity and quality of generated responses. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about how to teach large language models new things. Researchers have been trying different ways to make these models better at understanding what we want them to do. They found that two popular methods are actually the same when used with certain big language models. However, they also discovered that one method focuses on the beginning and another method focuses on the end of a sequence. To take advantage of both approaches, they came up with a new way to combine them. Tests showed that this new approach works better than previous ones at generating helpful responses. |
Keywords
» Artificial intelligence » Gpt » Knowledge distillation