Loading Now

Summary of Free Process Rewards Without Process Labels, by Lifan Yuan et al.


Free Process Rewards without Process Labels

by Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kaiyan Zhang, Bowen Zhou, Zhiyuan Liu, Hao Peng

First submitted to arxiv on: 2 Dec 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper addresses the challenge of training process reward models (PRMs), which score reasoning trajectories step-by-step, by proposing an implicit PRM approach that can be obtained at no additional cost. The authors show theoretically and empirically that an ORM trained on response-level labels can be used to obtain an implicit PRM, with the only assumption being a specific parameterization of the outcome reward as log-likelihood ratios. The paper instantiates the implicit PRM with various objectives and evaluates its performance on MATH, demonstrating that it outperforms a strong baseline using less than 1/38th of the training data. The authors also explore scaling up instructions and responses, finding that the latter brings larger gains. Additionally, they find that their approach is more data-efficient when instantiated with the cross-entropy loss, allowing for training on minimal response sets.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper helps us better understand how to train machines to reason step-by-step. Right now, it’s hard to teach machines this way because we need lots of labels, but the authors found a clever trick that lets them do it without extra work. They showed that by using a different approach to reward models, they can get the same results with much less data. This is important because sometimes we only have very little data, and it’s hard to train machines when that happens. The authors tested their idea on math problems and found that it worked really well. They also learned that making the questions more relevant to what people are trying to solve helps a lot, but adding extra responses doesn’t make things better.

Keywords

» Artificial intelligence  » Cross entropy  » Log likelihood