Loading Now

Summary of What Is Your Data Worth to Gpt? Llm-scale Data Valuation with Influence Functions, by Sang Keun Choe et al.


What is Your Data Worth to GPT? LLM-Scale Data Valuation with Influence Functions

by Sang Keun Choe, Hwijeen Ahn, Juhan Bae, Kewen Zhao, Minsoo Kang, Youngseog Chung, Adithya Pratapa, Willie Neiswanger, Emma Strubell, Teruko Mitamura, Jeff Schneider, Eduard Hovy, Roger Grosse, Eric Xing

First submitted to arxiv on: 22 May 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A novel approach to data valuation for large language models is proposed, aiming to quantify the contribution or value of each data point to model output. The study focuses on influence functions, a gradient-based method, and develops an efficient strategy called LoGra that leverages the gradient structure in backpropagation. This improvement enables scalable data valuation for recent LLMs and their vast training datasets. Additionally, the authors introduce LogIX, a software package transforming existing training code into data valuation code with minimal effort. Experimental results demonstrate competitive accuracy against baselines while achieving significant improvements in throughput and GPU memory usage.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large language models are trained on huge amounts of text data, but often don’t give credit to the people who provided that data. One way to fix this is by giving each piece of data a score based on how much it contributed to the model’s output. However, doing this for large models has been tricky due to the massive amount of computing power and memory required. This study makes progress in this area by developing an efficient method called LoGra that can handle big models and datasets. The researchers also create a tool called LogIX that makes it easy to apply data valuation to existing training code.

Keywords

» Artificial intelligence  » Backpropagation