Summary of Provably Mitigating Overoptimization in Rlhf: Your Sft Loss Is Implicitly An Adversarial Regularizer, by Zhihan Liu et al.

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

by Zhihan Liu, Miao Lu, Shenao Zhang, Boyi Liu, Hongyi Guo, Yingxiang Yang, Jose Blanchet, Zhaoran Wang

First submitted to arxiv on: 26 May 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper investigates the problem of overoptimization in generative models aligned with human preferences via Reinforcement Learning from Human Feedback (RLHF). The authors identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. They propose a theoretical algorithm that simultaneously minimizes maximum likelihood estimation of loss and a reward penalty term to mitigate overoptimization, achieving provable sample efficiency under partial coverage conditions. A practical reformulation is also presented, combining preference optimization loss with supervised fine-tuning loss to align large language models (LLMs) with human preferences while preventing undesired responses. The authors demonstrate the improved performance of this algorithm, Regularized Preference Optimization (RPO), in experiments aligning LLMs.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper helps us understand how computers can learn to generate text that people like. Sometimes, these computer programs get stuck making bad text because they don’t really know what people want. The authors figured out why this happens and came up with a new way to make the computer programs better by combining two different approaches: one that tries to make good text directly and another that helps the program learn from examples of good text. This new approach, called Regularized Preference Optimization (RPO), worked better than previous methods in tests.

Keywords

* Artificial intelligence * Fine tuning * Likelihood * Optimization * Reinforcement learning from human feedback * Rlhf * Supervised

Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer

by Zhihan Liu, Miao Lu, Shenao Zhang, Boyi Liu, Hongyi Guo, Yingxiang Yang, Jose Blanchet, Zhaoran Wang

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Dynamic Inhomogeneous Quantum Resource Scheduling with Reinforcement Learning, by Linsen Li et al.

Summary of Node Identifiers: Compact, Discrete Representations For Efficient Graph Learning, by Yuankai Luo et al.

Related Posts