Summary of Infinite Limits Of Multi-head Transformer Dynamics, by Blake Bordelon et al.
Infinite Limits of Multi-head Transformer Dynamics
by Blake Bordelon, Hamza Tahir Chaudhry, Cengiz Pehlevan
First submitted to arxiv on: 24 May 2024
Categories
- Main: Machine Learning (stat.ML)
- Secondary: Disordered Systems and Neural Networks (cond-mat.dis-nn); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The researchers investigate the scaling limits of transformer models in the feature learning regime, focusing on the training dynamics and attention layers. They identify specific parameterizations that allow for well-defined infinite width and depth limits, enabling the attention layers to update throughout training. The team employs tools from dynamical mean field theory (DMFT) to analyze various infinite limits, including infinite key/query dimension, infinite heads, and infinite depth, each with distinct statistical descriptions depending on the scaling of attention layers. Numerical evidence supports convergence to these limits, and the study explores how parameterization influences learned features. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Transformer models are important for feature learning. The researchers studied how these models work when they get really big. They found some special ways to make the model big that allow it to learn good features. The team used a new way of thinking about big models, called dynamical mean field theory (DMFT). They looked at what happens when different parts of the model get really big and how this affects what the model learns. |
Keywords
» Artificial intelligence » Attention » Transformer