Summary of Bild: Bi-directional Logits Difference Loss For Large Language Model Distillation, by Minchong Li et al.
BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation
by Minchong Li, Feng Zhou, Xiaohui Song
First submitted to arxiv on: 19 Jun 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper investigates the task-specific distillation of large language models (LLMs) at the logit level. It reveals that the logits of fine-tuned LLMs exhibit a more extreme long-tail distribution than those from vision models, with hidden “noise” in the long tail affecting distillation performance. To address these challenges, the paper proposes the Bi-directional Logits Difference (BiLD) loss, which filters out long-tail noise by utilizing only top-k teacher and student logits, and leverages internal logits ranking information by constructing logits differences. The BiLD loss is evaluated on 13 datasets using two types of LLMs, showing that it outperforms supervised fine-tuning (SFT), vanilla KL loss, and five other distillation methods from both NLP and CV fields. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper explores how to make language models smaller without losing their abilities. It finds a problem with existing ways to transfer knowledge from big teacher models to smaller student models, and proposes a new method called BiLD that solves this issue. The BiLD method helps the student model learn by using only the most important information from the teacher model, and discarding unnecessary details. This approach is tested on 13 datasets and shows better results than other methods. |
Keywords
» Artificial intelligence » Distillation » Fine tuning » Logits » Nlp » Student model » Supervised » Teacher model