Summary of Qurating: Selecting High-quality Data For Training Language Models, by Alexander Wettig et al.
QuRating: Selecting High-Quality Data for Training Language Models
by Alexander Wettig, Aatmik Gupta, Saumya Malik, Danqi Chen
First submitted to arxiv on: 15 Feb 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary QuRating is a novel method for selecting high-quality pre-training data that leverages human intuitions about data quality. Researchers investigated four qualities: writing style, required expertise, facts & trivia, and educational value. They found that large language models (LLMs) can discern these qualities, particularly when making pairwise judgments of texts. The QuRater model was trained to learn scalar ratings from pairwise judgments, annotating a 260B training corpus with quality ratings for each criterion. Experiments showed the importance of balancing quality and diversity. When sampling using quality ratings as logits over documents, models obtained lower perplexity and stronger in-context learning performance than baselines. The best model, based on educational value, performed similarly to a model trained with uniform sampling for 50% more steps. Beyond data selection, quality ratings were used to construct a training curriculum that improved performance without changing the training dataset. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Scientists created a new way to pick the best data to train language models. They wanted to make sure the data was helpful and not too easy or hard. The method is called QuRating, and it looks at four things: how well something is written, what kind of expertise you need to understand it, whether it’s true or false information, and whether it’s useful for learning. They found that big computers can tell these differences just by comparing texts. By using this new way to pick data, they were able to train models that performed better than usual. This is important because language models are used in many areas, such as answering questions and generating text. |
Keywords
* Artificial intelligence * Logits * Perplexity