Loading Now

Summary of Smaller, Weaker, Yet Better: Training Llm Reasoners Via Compute-optimal Sampling, by Hritik Bansal et al.


Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling

by Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, Mehran Kazemi

First submitted to arxiv on: 29 Aug 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Training high-quality synthetic data from strong language models (LMs) is a common strategy to improve LMs’ reasoning performance. This paper investigates whether this approach is computationally optimal under a fixed inference budget, exploring the trade-offs between using stronger but more expensive SE models versus weaker but cheaper WC models for generating synthetic data. The study evaluates generated data across coverage, diversity, and false positive rate metrics, showing that WC-generated data may have higher coverage and diversity but also exhibit higher false positive rates. Finetuning LMs on data from SE and WC models in different settings (knowledge distillation, self-improvement, and weak-to-strong improvement) reveals that models trained on WC-generated data consistently outperform those trained on SE-generated data across multiple benchmarks and model choices. This research challenges the prevailing practice of relying on SE models for synthetic data generation, suggesting WC may be the compute-optimal approach.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper explores whether a common way to improve language models is the best use of computer resources. Language models are trained using fake data created by stronger or weaker models. The study compares these approaches and finds that weaker models can create higher-quality fake data, even if it’s not as accurate. When training language models with this fake data, the ones using weaker models perform better than those using stronger models. This research shows that we might not need to use the strongest models to train our language models.

Keywords

» Artificial intelligence  » Inference  » Knowledge distillation  » Synthetic data