Loading Now

Summary of Learning to Grok: Emergence Of In-context Learning and Skill Composition in Modular Arithmetic Tasks, by Tianyu He et al.


Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks

by Tianyu He, Darshil Doshi, Aritra Das, Andrey Gromov

First submitted to arxiv on: 4 Jun 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Disordered Systems and Neural Networks (cond-mat.dis-nn); High Energy Physics – Theory (hep-th); Machine Learning (stat.ML)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper explores the emergence of in-context learning and skill composition in modular arithmetic tasks. A GPT-style transformer is pre-trained on a subset of these tasks and tested on out-of-distribution ones. The results show that as more pre-training tasks are added, the model transitions from in-distribution to out-of-distribution generalization. Specifically, the smallest model capable of out-of-distribution generalization requires two transformer blocks, while deeper models require early stopping due to a transient phase of out-of-distribution generalization. An interpretability study reveals highly structured representations in attention heads and MLPs, with an algorithmic shift observed as the number of in-context examples increases.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper studies how large language models learn new skills and apply them to different problems they haven’t seen before. The researchers use a special type of math problem called modular arithmetic to test the model’s ability to generalize. They find that as the model is trained on more of these problems, it becomes better at solving new, unseen problems. However, they also discover that this process has an upper limit and that deeper models need to be stopped early before they start doing poorly. By looking inside the model’s representations, they see that it’s building a complex understanding of the math problems.

Keywords

» Artificial intelligence  » Attention  » Early stopping  » Generalization  » Gpt  » Transformer