Loading Now

Summary of Measuring Sample Importance in Data Pruning For Language Models Based on Information Entropy, by Minsang Kim et al.


Measuring Sample Importance in Data Pruning for Language Models based on Information Entropy

by Minsang Kim, Seungjun Baek

First submitted to arxiv on: 20 Jun 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes an innovative approach to train large language models (LLMs) with reduced computational costs. By developing a data pruning method based on information entropy, researchers ranked training corpus samples according to their informativeness. The key insight is that less informative samples often contain redundant information, making them prime candidates for pruning first. Entropy functions such as negative log-likelihood and average inverse word frequency serve as surrogates to measure sample informativeness. Experimental results demonstrate the proposed method’s effectiveness in improving language modeling performance on various tasks and enhancing generalization capabilities.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper helps train big computers to understand human language by using less data. They came up with a new way to pick which parts of the training data are most important. This is done by measuring how much each piece of data tells us something new. The idea is that if some data isn’t very helpful, it’s probably just repeating what we already know, so we can get rid of it. By getting rid of less useful data, the computer uses fewer calculations and trains faster. It even makes the trained language model work better on real-life tasks.

Keywords

» Artificial intelligence  » Generalization  » Language model  » Log likelihood  » Pruning