Summary of Smaller, Weaker, Yet Better: Training Llm Reasoners Via Compute-optimal Sampling, by Hritik Bansal et al.
Smaller, Weaker, Yet Better: Training LLM Reasoners via Compute-Optimal Sampling
by Hritik Bansal, Arian Hosseini, Rishabh Agarwal, Vinh Q. Tran, Mehran Kazemi
First submitted to arxiv on: 29 Aug 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Training high-quality synthetic data from strong language models (LMs) is a common strategy to improve LMs’ reasoning performance. This paper investigates whether this approach is computationally optimal under a fixed inference budget, exploring the trade-offs between using stronger but more expensive SE models versus weaker but cheaper WC models for generating synthetic data. The study evaluates generated data across coverage, diversity, and false positive rate metrics, showing that WC-generated data may have higher coverage and diversity but also exhibit higher false positive rates. Finetuning LMs on data from SE and WC models in different settings (knowledge distillation, self-improvement, and weak-to-strong improvement) reveals that models trained on WC-generated data consistently outperform those trained on SE-generated data across multiple benchmarks and model choices. This research challenges the prevailing practice of relying on SE models for synthetic data generation, suggesting WC may be the compute-optimal approach. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper explores whether a common way to improve language models is the best use of computer resources. Language models are trained using fake data created by stronger or weaker models. The study compares these approaches and finds that weaker models can create higher-quality fake data, even if it’s not as accurate. When training language models with this fake data, the ones using weaker models perform better than those using stronger models. This research shows that we might not need to use the strongest models to train our language models. |
Keywords
» Artificial intelligence » Inference » Knowledge distillation » Synthetic data