Loading Now

Summary of Zeroth-order Policy Gradient For Reinforcement Learning From Human Feedback Without Reward Inference, by Qining Zhang et al.


Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inference

by Qining Zhang, Lei Ying

First submitted to arxiv on: 25 Sep 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Machine Learning (stat.ML)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper tackles challenges in Reinforcement Learning from Human Feedback (RLHF) by developing two novel algorithms that eliminate the need for reward inference. The authors address fundamental issues like distribution shift, overfitting, and problem misspecification in RLHF pipelines. To achieve this, they propose estimating local value functions from human preferences and using zeroth-order gradient approximators to approximate policy gradients. The resulting methods demonstrate polynomial convergence rates and outperform popular baselines like Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO). This work highlights the potential for efficient RLHF solutions without reward inference.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about helping computers learn from people’s preferences without needing to understand what makes people happy. Right now, it’s tricky to teach computers using human feedback because the computer might not fully grasp what we want. The authors found a way to skip that step and directly optimize the computer’s behavior without understanding why it’s good or bad. They did this by estimating how much better one action is than another based on people’s preferences. This new approach worked well in simulations and was even better than other methods that tried to understand rewards.

Keywords

» Artificial intelligence  » Inference  » Optimization  » Overfitting  » Reinforcement learning from human feedback  » Rlhf