Summary of Self-generated Critiques Boost Reward Modeling For Language Models, by Yue Yu et al.
Self-Generated Critiques Boost Reward Modeling for Language Models
by Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, Rui Hou
First submitted to arxiv on: 25 Nov 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes a novel framework called Critic-RM that enhances large language model (LLM) reward modeling by incorporating self-generated natural language critiques. The authors hypothesize that predicting both scalar rewards and critiques would improve the accuracy of reward models, particularly in reinforcement learning from human feedback (RLHF). To achieve this, Critic-RM employs a two-stage process: generating high-quality critiques without additional supervision, followed by joint fine-tuning on reward prediction and critique generation. Experimental results show that Critic-RM outperforms standard reward models and LLM judges across various benchmarks, with accuracy improvements ranging from 3.7% to 7.3%. Additionally, the authors demonstrate the effectiveness of generated critiques in rectifying flawed reasoning steps, achieving gains of up to 3.2%. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large language models (LLMs) need a way to learn what people want them to do. One way they can do this is by using rewards that align with human preferences. However, current reward models are limited because they only give one score and don’t understand what people are saying. The authors of this paper think that if the model can also predict what people are saying about their mistakes, it will be better at giving rewards. To make this happen, they created a new way to train the model called Critic-RM. This method uses two steps: first, it generates good critiques without needing more data, and then it fine-tunes the model to get both rewards and critiques right. The authors tested their method on some tasks and found that it works really well. |
Keywords
» Artificial intelligence » Fine tuning » Large language model » Reinforcement learning from human feedback » Rlhf