Summary of Measuring Sample Importance in Data Pruning For Language Models Based on Information Entropy, by Minsang Kim et al.

Measuring Sample Importance in Data Pruning for Language Models based on Information Entropy

by Minsang Kim, Seungjun Baek

First submitted to arxiv on: 20 Jun 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper proposes an innovative approach to train large language models (LLMs) with reduced computational costs. By developing a data pruning method based on information entropy, researchers ranked training corpus samples according to their informativeness. The key insight is that less informative samples often contain redundant information, making them prime candidates for pruning first. Entropy functions such as negative log-likelihood and average inverse word frequency serve as surrogates to measure sample informativeness. Experimental results demonstrate the proposed method’s effectiveness in improving language modeling performance on various tasks and enhancing generalization capabilities.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper helps train big computers to understand human language by using less data. They came up with a new way to pick which parts of the training data are most important. This is done by measuring how much each piece of data tells us something new. The idea is that if some data isn’t very helpful, it’s probably just repeating what we already know, so we can get rid of it. By getting rid of less useful data, the computer uses fewer calculations and trains faster. It even makes the trained language model work better on real-life tasks.

Keywords

» Artificial intelligence » Generalization » Language model » Log likelihood » Pruning

Measuring Sample Importance in Data Pruning for Language Models based on Information Entropy

by Minsang Kim, Seungjun Baek

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Ltsm-bundle: a Toolbox and Benchmark on Large Language Models For Time Series Forecasting, by Yu-neng Chuang et al.

Summary of Meat: Median-ensemble Adversarial Training For Improving Robustness and Generalization, by Zhaozhe Hu et al.

Related Posts