Loading Now

Summary of Rewarding Progress: Scaling Automated Process Verifiers For Llm Reasoning, by Amrith Setlur et al.


Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

by Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, Aviral Kumar

First submitted to arxiv on: 10 Oct 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper explores ways to improve reasoning in large language models by using process reward models (PRMs) that provide feedback at each step of a multi-step reasoning trace. The authors suggest that PRMs can lead to better credit assignment compared to outcome reward models (ORMs), which only provide feedback at the final step. However, collecting dense per-step human labels is not scalable, and training PRMs from automatically-labeled data has shown limited gains. To address this challenge, the paper proposes a novel approach to designing process rewards that measure progress made in producing correct responses. The authors theoretically characterize the set of good provers and demonstrate that optimizing process rewards from such provers improves exploration during test-time search and online reinforcement learning (RL). The results show that using process advantage verifiers (PAVs) as dense rewards for RL leads to a significant gain in sample efficiency, accuracy, and compute efficiency compared to ORMs.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper helps us make better language models by giving feedback at every step. Right now, our language models only get feedback when they’re done, which isn’t very helpful. The authors suggest that if we give feedback at each step, it will help them learn faster and be more accurate. But how do we make this work? They propose a new way to measure progress and show that using this approach can improve the results of test-time search and online learning.

Keywords

» Artificial intelligence  » Online learning  » Reinforcement learning