Summary of On the Optimal Memorization Capacity Of Transformers, by Tokio Kajitsuka et al.
On the Optimal Memorization Capacity of Transformers
by Tokio Kajitsuka, Issei Sato
First submitted to arxiv on: 26 Sep 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A recent study in machine learning has explored the memorization capabilities of Transformers, but their efficiency remains unclear. Our research shows that Transformers can effectively memorize labels with approximately square root of N parameters in a next-token prediction setting for sequences of length n. This optimal performance is unaffected by input sequence lengths due to parameter sharing. We also investigate memorization capacity in sequence-to-sequence tasks and find that at least O(sqrt(nN)) parameters are required for Transformers with hardmax, suggesting that self-attention mechanisms excel at identifying input sequences while feed-forward networks become a bottleneck when associating labels. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Transformers have been shown to be effective at memorizing labels, but how efficient they are has not been well understood. Our study found that Transformers can memorize labels using approximately square root of N parameters in a next-token prediction setting for input sequences of length n. This means they can do this efficiently without being influenced by the sequence length. |
Keywords
» Artificial intelligence » Machine learning » Self attention » Token