Summary of Qurating: Selecting High-quality Data For Training Language Models, by Alexander Wettig et al.

QuRating: Selecting High-Quality Data for Training Language Models

by Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen

First submitted to arxiv on: 15 Feb 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary QuRating is a novel method for selecting high-quality pre-training data that leverages human intuitions about data quality. Researchers investigated four qualities: writing style, required expertise, facts & trivia, and educational value. They found that large language models (LLMs) can discern these qualities, particularly when making pairwise judgments of texts. The QuRater model was trained to learn scalar ratings from pairwise judgments, annotating a 260B training corpus with quality ratings for each criterion. Experiments showed the importance of balancing quality and diversity. When sampling using quality ratings as logits over documents, models obtained lower perplexity and stronger in-context learning performance than baselines. The best model, based on educational value, performed similarly to a model trained with uniform sampling for 50% more steps. Beyond data selection, quality ratings were used to construct a training curriculum that improved performance without changing the training dataset.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Scientists created a new way to pick the best data to train language models. They wanted to make sure the data was helpful and not too easy or hard. The method is called QuRating, and it looks at four things: how well something is written, what kind of expertise you need to understand it, whether it’s true or false information, and whether it’s useful for learning. They found that big computers can tell these differences just by comparing texts. By using this new way to pick data, they were able to train models that performed better than usual. This is important because language models are used in many areas, such as answering questions and generating text.

Keywords

* Artificial intelligence * Logits * Perplexity

QuRating: Selecting High-Quality Data for Training Language Models

by Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Exact, Fast and Expressive Poisson Point Processes Via Squared Neural Families, by Russell Tsuchida and Cheng Soon Ong and Dino Sejdinovic

Summary of Criterion Collapse and Loss Distribution Control, by Matthew J. Holland

Related Posts