Summary of Measuring Sample Importance in Data Pruning For Language Models Based on Information Entropy, by Minsang Kim et al.
Measuring Sample Importance in Data Pruning for Language Models based on Information Entropy
by Minsang Kim, Seungjun Baek
First submitted to arxiv on: 20 Jun 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes an innovative approach to train large language models (LLMs) with reduced computational costs. By developing a data pruning method based on information entropy, researchers ranked training corpus samples according to their informativeness. The key insight is that less informative samples often contain redundant information, making them prime candidates for pruning first. Entropy functions such as negative log-likelihood and average inverse word frequency serve as surrogates to measure sample informativeness. Experimental results demonstrate the proposed method’s effectiveness in improving language modeling performance on various tasks and enhancing generalization capabilities. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper helps train big computers to understand human language by using less data. They came up with a new way to pick which parts of the training data are most important. This is done by measuring how much each piece of data tells us something new. The idea is that if some data isn’t very helpful, it’s probably just repeating what we already know, so we can get rid of it. By getting rid of less useful data, the computer uses fewer calculations and trains faster. It even makes the trained language model work better on real-life tasks. |
Keywords
» Artificial intelligence » Generalization » Language model » Log likelihood » Pruning