Summary of The Fair Language Model Paradox, by Andrea Pinto and Tomer Galanti and Randall Balestriero
The Fair Language Model Paradox
by Andrea Pinto, Tomer Galanti, Randall Balestriero
First submitted to arxiv on: 15 Oct 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The abstract presents a study on the training dynamics of Large Language Models (LLMs) at the token level. Current evaluations focus on aggregated training loss measured at the batch level, overlooking per-token biases resulting from varying token-level dynamics and structural biases introduced by hyperparameters. The study reveals that weight decay, commonly used to stabilize training, silently introduces performance biases detectable only at the token level. The researchers empirically demonstrate across different dataset sizes, model architectures, and sizes ranging from 270M to 3B parameters that increasing weight decay disproportionately depreciates low-frequency tokens, which represent a vast majority of the token distribution in most languages. The findings call for novel regularization techniques ensuring fairness across all available tokens. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large Language Models are widely used in real-world applications, but researchers know little about how they’re trained at the smallest level. Currently, scientists evaluate these models by looking at their overall performance over a batch of data, which can hide some important biases. The study shows that a common technique called weight decay actually introduces these biases and affects certain types of words more than others. This is concerning because most languages have many rare or low-frequency words that are crucial for understanding. The researchers suggest new ways to train models so they don’t favor one type of word over another. |
Keywords
» Artificial intelligence » Regularization » Token