Loading Now

Summary of Infinite Limits Of Multi-head Transformer Dynamics, by Blake Bordelon et al.


Infinite Limits of Multi-head Transformer Dynamics

by Blake Bordelon, Hamza Tahir Chaudhry, Cengiz Pehlevan

First submitted to arxiv on: 24 May 2024

Categories

  • Main: Machine Learning (stat.ML)
  • Secondary: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The researchers investigate the scaling limits of transformer models in the feature learning regime, focusing on the training dynamics and attention layers. They identify specific parameterizations that allow for well-defined infinite width and depth limits, enabling the attention layers to update throughout training. The team employs tools from dynamical mean field theory (DMFT) to analyze various infinite limits, including infinite key/query dimension, infinite heads, and infinite depth, each with distinct statistical descriptions depending on the scaling of attention layers. Numerical evidence supports convergence to these limits, and the study explores how parameterization influences learned features.
Low GrooveSquid.com (original content) Low Difficulty Summary
Transformer models are important for feature learning. The researchers studied how these models work when they get really big. They found some special ways to make the model big that allow it to learn good features. The team used a new way of thinking about big models, called dynamical mean field theory (DMFT). They looked at what happens when different parts of the model get really big and how this affects what the model learns.

Keywords

» Artificial intelligence  » Attention  » Transformer