Loading Now

Summary of Entropy-regularized Process Reward Model, by Hanning Zhang et al.


Entropy-Regularized Process Reward Model

by Hanning Zhang, Pengcheng Wang, Shizhe Diao, Yong Lin, Rui Pan, Hanze Dong, Dylan Zhang, Pavlo Molchanov, Tong Zhang

First submitted to arxiv on: 15 Dec 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes a novel approach to improving large language models (LLMs) in performing complex mathematical reasoning tasks by integrating reinforcement learning (RL) guided by process rewards. The authors introduce an entropy-regularized process reward model (ER-PRM), which balances policy optimization with the need to prevent the policy from shifting too far from its initial distribution. The proposed ER-PRM consistently outperforms existing process reward models on two benchmark datasets, achieving 1% improvement on GSM8K and 2-3% improvement on MATH under best-of-N evaluation.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large language models (LLMs) are very smart computers that can understand human language. These models are great at answering questions, but they struggle with math problems. A new way to help them is by giving them rewards for each step of the calculation. This helps them learn to think more logically and avoid making mistakes. The researchers in this paper propose a new reward system called ER-PRM that helps LLMs do better on math problems. They tested it on two big datasets and found that it worked really well, even better than other approaches.

Keywords

* Artificial intelligence  * Optimization  * Reinforcement learning