Loading Now

Summary of What Happened in Llms Layers When Trained For Fast Vs. Slow Thinking: a Gradient Perspective, by Ming Li et al.


What Happened in LLMs Layers when Trained for Fast vs. Slow Thinking: A Gradient Perspective

by Ming Li, Yanhong Li, Tianyi Zhou

First submitted to arxiv on: 31 Oct 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper investigates the training patterns of different layers in large language models (LLMs) by analyzing gradient dynamics during training with various responses and initial models. The study focuses on how fast vs. slow thinking affects layer-wise gradients, building upon recent advances in training LLMs using chain-of-thoughts (CoT) and process rewards. Results show that fast thinking without CoT leads to larger gradients and differences across layers compared to slow thinking with CoT, indicating the stability brought by the latter. Pre-trained LLMs are less affected by instability than instruction-tuned LLMs. The study also explores whether gradient patterns can reflect response correctness during training with different LLMs using slow vs. fast thinking paths. Findings suggest that gradients from slow thinking can distinguish correct and irrelevant reasoning paths. For comparison, the paper analyzes gradient dynamics on non-reasoning knowledge learning tasks, where increasing response length does not lead to similar behaviors. The study contributes fundamental understandings of LLM training and sheds novel insights on efficiency and stability, paving the way for building a generalizable System-2 agent.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large language models (LLMs) are trained using different responses and initial models. This paper investigates how fast vs. slow thinking affects layer-wise gradients during this process. The results show that slow thinking is more stable than fast thinking. Pre-trained LLMs are less affected by instability than instruction-tuned LLMs. The study also looks at whether gradient patterns can reflect response correctness.

Keywords

* Artificial intelligence