Summary of An Empirical Investigation Into the Effect Of Parameter Choices in Knowledge Distillation, by Md Arafat Sultan et al.
An Empirical Investigation into the Effect of Parameter Choices in Knowledge Distillation
by Md Arafat Sultan, Aashka Trivedi, Parul Awasthy, Avirup Sil
First submitted to arxiv on: 12 Jan 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Medium Difficulty Summary: This paper presents a large-scale empirical study on how different configuration parameters affect performance in knowledge distillation (KD). Specifically, it investigates the impact of various distance metrics between teacher and student predictions, such as mean squared error (MSE) and KL-divergence. Although previous studies have touched on this topic, there is still a lack of systematic understanding on the general effect of these choices on student performance. The study employs an empirical approach to answer this question across 13 datasets from four NLP tasks and three student sizes. It quantifies the cost of making sub-optimal choices and identifies a single configuration that performs well across various settings. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Low Difficulty Summary: This research looks at how different settings affect how well machines learn from each other. When we want machines to learn from more experienced ones, we use something called knowledge distillation. One important choice is how to measure the difference between what the teacher and student predict. The study tries to figure out which options work best by testing many combinations across 13 datasets and three types of machine learning tasks. It finds that some choices are better than others and can make a big difference in how well machines learn. |
Keywords
* Artificial intelligence * Knowledge distillation * Machine learning * Mse * Nlp