Summary of Improving Reinforcement Learning From Human Feedback Using Contrastive Rewards, by Wei Shen et al.
Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards
by Wei Shen, Xiaoying Zhang, Yuanshun Yao, Rui Zheng, Hongyi Guo, Yang Liu
First submitted to arxiv on: 12 Mar 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes an innovative approach to reinforce learning from human feedback (RLHF) for large language models (LLMs). The traditional RLHF relies heavily on accurate and informative reward models, which are prone to errors. To address this limitation, the authors introduce a penalty term called contrastive rewards. This method involves two steps: offline sampling to obtain baseline responses and calculating the contrastive reward using these baselines and Proximal Policy Optimization (PPO). The proposed approach enables LLMs to penalize reward uncertainty, improve robustness, and reduce variance in PPO. Experimental results demonstrate that contrastive rewards can significantly improve RLHF, outperforming strong baselines evaluated by both GPTs and humans. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper improves how large language models (LLMs) learn from human feedback. Right now, this process is fragile because it relies on good reward models. Reward models are like a set of instructions that tells the LLM what to do. But sometimes these instructions can be wrong or unclear. The authors of this paper came up with a new idea called contrastive rewards. This idea makes the LLM better at ignoring bad instructions and focusing on good ones. They tested their method and found it works really well, even better than some strong competitors. |
Keywords
» Artificial intelligence » Optimization » Rlhf