Summary of What Happened in Llms Layers When Trained For Fast Vs. Slow Thinking: a Gradient Perspective, by Ming Li et al.
What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective
by Ming Li, Yanhong Li, Tianyi Zhou
First submitted to arxiv on: 31 Oct 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper investigates the training patterns of different layers in large language models (LLMs) by analyzing gradient dynamics during training with various responses and initial models. The study focuses on how fast vs. slow thinking affects layer-wise gradients, building upon recent advances in training LLMs using chain-of-thoughts (CoT) and process rewards. Results show that fast thinking without CoT leads to larger gradients and differences across layers compared to slow thinking with CoT, indicating the stability brought by the latter. Pre-trained LLMs are less affected by instability than instruction-tuned LLMs. The study also explores whether gradient patterns can reflect response correctness during training with different LLMs using slow vs. fast thinking paths. Findings suggest that gradients from slow thinking can distinguish correct and irrelevant reasoning paths. For comparison, the paper analyzes gradient dynamics on non-reasoning knowledge learning tasks, where increasing response length does not lead to similar behaviors. The study contributes fundamental understandings of LLM training and sheds novel insights on efficiency and stability, paving the way for building a generalizable System-2 agent. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large language models (LLMs) are trained using different responses and initial models. This paper investigates how fast vs. slow thinking affects layer-wise gradients during this process. The results show that slow thinking is more stable than fast thinking. Pre-trained LLMs are less affected by instability than instruction-tuned LLMs. The study also looks at whether gradient patterns can reflect response correctness. |