Summary of Provably Mitigating Overoptimization in Rlhf: Your Sft Loss Is Implicitly An Adversarial Regularizer, by Zhihan Liu et al.
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
by Zhihan Liu, Miao Lu, Shenao Zhang, Boyi Liu, Hongyi Guo, Yingxiang Yang, Jose Blanchet, Zhaoran Wang
First submitted to arxiv on: 26 May 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper investigates the problem of overoptimization in generative models aligned with human preferences via Reinforcement Learning from Human Feedback (RLHF). The authors identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. They propose a theoretical algorithm that simultaneously minimizes maximum likelihood estimation of loss and a reward penalty term to mitigate overoptimization, achieving provable sample efficiency under partial coverage conditions. A practical reformulation is also presented, combining preference optimization loss with supervised fine-tuning loss to align large language models (LLMs) with human preferences while preventing undesired responses. The authors demonstrate the improved performance of this algorithm, Regularized Preference Optimization (RPO), in experiments aligning LLMs. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper helps us understand how computers can learn to generate text that people like. Sometimes, these computer programs get stuck making bad text because they don’t really know what people want. The authors figured out why this happens and came up with a new way to make the computer programs better by combining two different approaches: one that tries to make good text directly and another that helps the program learn from examples of good text. This new approach, called Regularized Preference Optimization (RPO), worked better than previous methods in tests. |
Keywords
» Artificial intelligence » Fine tuning » Likelihood » Optimization » Reinforcement learning from human feedback » Rlhf » Supervised