Loading Now

Summary of Breaking Symmetry When Training Transformers, by Chunsheng Zuo et al.


Breaking Symmetry When Training Transformers

by Chunsheng Zuo, Michael Guerzhoy

First submitted to arxiv on: 6 Feb 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The authors investigate the relationship between Transformer architectures and their ability to model input sequences where order is important. They show that when positional encodings and causal attention are not used, the prediction of the next output token is invariant to permutations of previous tokens. This symmetry breaking is enabled by the causal connection mechanism, which encourages “slices” of the Transformer to represent the same location in the sequence. The authors hypothesize that residual connections contribute to this phenomenon and provide evidence for it.
Low GrooveSquid.com (original content) Low Difficulty Summary
Transformers are powerful language models that can process sequences of words. But did you know that they can do more than just recognize patterns? Researchers found that when Transformers don’t use certain mechanisms, their predictions become immune to the order in which input tokens appear. This is a big deal because many real-world tasks rely on sequence order being important. The authors think that something called residual connections makes this possible and prove it through their research.

Keywords

* Artificial intelligence  * Attention  * Token  * Transformer