Loading Now

Summary of Scaling Laws For Reward Model Overoptimization in Direct Alignment Algorithms, by Rafael Rafailov et al.


Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithms

by Rafael Rafailov, Yaswanth Chittepu, Ryan Park, Harshit Sikchi, Joey Hejna, Bradley Knox, Chelsea Finn, Scott Niekum

First submitted to arxiv on: 5 Jun 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper presents an investigation into the limitations of Direct Alignment Algorithms (DDAs) in Reinforcement Learning from Human Feedback (RLHF), a crucial component in Large Language Model (LLM) development. Although DAAs bypass the reward modeling phase, they still exhibit similar degradation patterns to classical RLHF methods, including over-optimization and reward hacking. The study formalizes and explores these issues across various objectives, training regimes, and model scales.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper looks into why some language models trained with human feedback don’t always get better over time. Researchers found that even when they skip the part where a reward model is created, these models can still have problems optimizing their performance. The study shows that this “reward hacking” issue happens in both old and new ways of training language models.

Keywords

» Artificial intelligence  » Alignment  » Large language model  » Optimization  » Reinforcement learning from human feedback  » Rlhf