Summary of The Fair Language Model Paradox, by Andrea Pinto and Tomer Galanti and Randall Balestriero

The Fair Language Model Paradox

by Andrea Pinto, Tomer Galanti, Randall Balestriero

First submitted to arxiv on: 15 Oct 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The abstract presents a study on the training dynamics of Large Language Models (LLMs) at the token level. Current evaluations focus on aggregated training loss measured at the batch level, overlooking per-token biases resulting from varying token-level dynamics and structural biases introduced by hyperparameters. The study reveals that weight decay, commonly used to stabilize training, silently introduces performance biases detectable only at the token level. The researchers empirically demonstrate across different dataset sizes, model architectures, and sizes ranging from 270M to 3B parameters that increasing weight decay disproportionately depreciates low-frequency tokens, which represent a vast majority of the token distribution in most languages. The findings call for novel regularization techniques ensuring fairness across all available tokens.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Large Language Models are widely used in real-world applications, but researchers know little about how they’re trained at the smallest level. Currently, scientists evaluate these models by looking at their overall performance over a batch of data, which can hide some important biases. The study shows that a common technique called weight decay actually introduces these biases and affects certain types of words more than others. This is concerning because most languages have many rare or low-frequency words that are crucial for understanding. The researchers suggest new ways to train models so they don’t favor one type of word over another.

Keywords

» Artificial intelligence » Regularization » Token

The Fair Language Model Paradox

by Andrea Pinto, Tomer Galanti, Randall Balestriero

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Mllm Can See? Dynamic Correction Decoding For Hallucination Mitigation, by Chenxi Wang et al.

Summary of Age-of-gradient Updates For Federated Learning Over Random Access Channels, by Yu Heng Wu et al.

Related Posts