Loading Now

Summary of The Fair Language Model Paradox, by Andrea Pinto and Tomer Galanti and Randall Balestriero


The Fair Language Model Paradox

by Andrea Pinto, Tomer Galanti, Randall Balestriero

First submitted to arxiv on: 15 Oct 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The abstract presents a study on the training dynamics of Large Language Models (LLMs) at the token level. Current evaluations focus on aggregated training loss measured at the batch level, overlooking per-token biases resulting from varying token-level dynamics and structural biases introduced by hyperparameters. The study reveals that weight decay, commonly used to stabilize training, silently introduces performance biases detectable only at the token level. The researchers empirically demonstrate across different dataset sizes, model architectures, and sizes ranging from 270M to 3B parameters that increasing weight decay disproportionately depreciates low-frequency tokens, which represent a vast majority of the token distribution in most languages. The findings call for novel regularization techniques ensuring fairness across all available tokens.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large Language Models are widely used in real-world applications, but researchers know little about how they’re trained at the smallest level. Currently, scientists evaluate these models by looking at their overall performance over a batch of data, which can hide some important biases. The study shows that a common technique called weight decay actually introduces these biases and affects certain types of words more than others. This is concerning because most languages have many rare or low-frequency words that are crucial for understanding. The researchers suggest new ways to train models so they don’t favor one type of word over another.

Keywords

» Artificial intelligence  » Regularization  » Token