Loading Now

Summary of Rethinking Kullback-leibler Divergence in Knowledge Distillation For Large Language Models, by Taiqiang Wu et al.


Rethinking Kullback-Leibler Divergence in Knowledge Distillation for Large Language Models

by Taiqiang Wu, Chaofan Tao, Jiahao Wang, Runming Yang, Zhe Zhao, Ngai Wong

First submitted to arxiv on: 3 Apr 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper explores the role of Kullback-Leibler divergence in Knowledge Distillation (KD) for Large Language Models (LLMs). Contrary to prior claims, the study shows that neither reverse nor forward Kullback-Leibler divergence has mode- or mean-seeking properties. Instead, both converge after a sufficient number of epochs. However, due to practical constraints, LLMs are rarely trained for such extensive periods. The paper proposes an Adaptive Kullback-Leibier (AKL) method that combines forward and reverse Kullback-Leibler divergence. Evaluations using metric-based and GPT-4-based metrics demonstrate that AKL outperforms baselines across various tasks, improving diversity and quality of generated responses.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about how to teach large language models new things. Researchers have been trying different ways to make these models better at understanding what we want them to do. They found that two popular methods are actually the same when used with certain big language models. However, they also discovered that one method focuses on the beginning and another method focuses on the end of a sequence. To take advantage of both approaches, they came up with a new way to combine them. Tests showed that this new approach works better than previous ones at generating helpful responses.

Keywords

» Artificial intelligence  » Gpt  » Knowledge distillation