Loading Now

Summary of Teaching Large Language Models to Reason with Reinforcement Learning, by Alex Havrilla et al.


Teaching Large Language Models to Reason with Reinforcement Learning

by Alex Havrilla, Yuqing Du, Sharath Chandra Raparthy, Christoforos Nalmpantis, Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, Roberta Raileanu

First submitted to arxiv on: 7 Mar 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper explores the performance of multiple reinforcement learning algorithms, including Expert Iteration, Proximal Policy Optimization (PPO), and Return-Conditioned RL, on improving language model reasoning capabilities. The algorithms learn from feedback provided to the language models, either heuristically or via a learned reward model. The study investigates the impact of sparse and dense rewards, as well as the effect of different model sizes and initializations with and without supervised fine-tuning (SFT) data. The results show that all algorithms perform similarly, with Expert Iteration performing best in most cases. Interestingly, the sample complexity of Expert Iteration is comparable to that of PPO, requiring around 10^6 samples to converge from a pre-trained checkpoint.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper studies how different reinforcement learning (RL) algorithms can improve language models’ reasoning abilities by learning from human feedback. The researchers tested various algorithms and found that they all perform similarly well, with one algorithm called Expert Iteration doing the best in most cases. They also discovered that some RL algorithms require a lot of training data to work well.

Keywords

* Artificial intelligence  * Fine tuning  * Language model  * Optimization  * Reinforcement learning  * Supervised