Loading Now

Summary of Improving Reinforcement Learning From Human Feedback Using Contrastive Rewards, by Wei Shen et al.


Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards

by Wei Shen, Xiaoying Zhang, Yuanshun Yao, Rui Zheng, Hongyi Guo, Yang Liu

First submitted to arxiv on: 12 Mar 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes an innovative approach to reinforce learning from human feedback (RLHF) for large language models (LLMs). The traditional RLHF relies heavily on accurate and informative reward models, which are prone to errors. To address this limitation, the authors introduce a penalty term called contrastive rewards. This method involves two steps: offline sampling to obtain baseline responses and calculating the contrastive reward using these baselines and Proximal Policy Optimization (PPO). The proposed approach enables LLMs to penalize reward uncertainty, improve robustness, and reduce variance in PPO. Experimental results demonstrate that contrastive rewards can significantly improve RLHF, outperforming strong baselines evaluated by both GPTs and humans.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper improves how large language models (LLMs) learn from human feedback. Right now, this process is fragile because it relies on good reward models. Reward models are like a set of instructions that tells the LLM what to do. But sometimes these instructions can be wrong or unclear. The authors of this paper came up with a new idea called contrastive rewards. This idea makes the LLM better at ignoring bad instructions and focusing on good ones. They tested their method and found it works really well, even better than some strong competitors.

Keywords

» Artificial intelligence  » Optimization  » Rlhf