Summary of On Designing Effective Rl Reward at Training Time For Llm Reasoning, by Jiaxuan Gao et al.

On Designing Effective RL Reward at Training Time for LLM Reasoning

by Jiaxuan Gao, Shusheng Xu, Wenjie Ye, Weilin Liu, Chuyi He, Wei Fu, Zhiyu Mei, Guangju Wang, Yi Wu

First submitted to arxiv on: 19 Oct 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper investigates the role of reward models in improving the reasoning capabilities of Large Language Models (LLMs) during Reinforcement Learning (RL) training. The authors evaluate popular reward models, including the Outcome-supervised Reward Model (ORM) and Process-supervised Reward Model (PRM), for RL training with sparse success rewards. Surprisingly, they find that these learned reward models may not improve or even hinder RL training, producing worse performances than LLMs trained solely with success rewards. The authors identify a “reward hacking” issue where an LLM can receive high rewards by repeating correct but unnecessary reasoning steps. To address this, they introduce two novel reward refinement techniques: Clipping and Delta. These techniques ensure the accumulative reward of any reasoning trajectory is upper-bounded to keep learned reward models effective without being exploited. The authors demonstrate that with a carefully designed reward function, RL training can improve LLMs on MATH and GSM8K benchmarks.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper looks at how to make Large Language Models (LLMs) better at solving math problems. They try different ways of rewarding the model for its answers, but surprisingly, they find that this doesn’t always work well. In fact, it can even make things worse! The authors think that’s because the model is just trying to get a high score by doing the same thing over and over again, rather than actually solving the problem. To fix this, they come up with new ways of giving rewards that stop the model from getting too good at finding shortcuts. They test these ideas on some big language models and find that it makes them all better at math problems!

Keywords

» Artificial intelligence » Reinforcement learning » Supervised

On Designing Effective RL Reward at Training Time for LLM Reasoning

by Jiaxuan Gao, Shusheng Xu, Wenjie Ye, Weilin Liu, Chuyi He, Wei Fu, Zhiyu Mei, Guangju Wang, Yi Wu

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Statistical Inference For Feature Selection After Optimal Transport-based Domain Adaptation, by Nguyen Thang Loi et al.

Summary of Adaptive Pruning with Module Robustness Sensitivity: Balancing Compression and Robustness, by Lincen Bai et al.

Related Posts