Summary of Catastrophic Goodhart: Regularizing Rlhf with Kl Divergence Does Not Mitigate Heavy-tailed Reward Misspecification, by Thomas Kwa et al.
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
by Thomas Kwa, Drake Thomas, Adrià Garriga-Alonso
First submitted to arxiv on: 19 Jul 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper investigates the effects of reinforcement learning from human feedback (RLHF) and its limitations. Researchers typically use a combination of rewards learned from data and regularization techniques like KL divergence to balance between these two components. The study shows that when the error in the reward function is light-tailed, optimal policies can achieve high utility even with reduced regularization penalties. However, if the error is heavy-tailed, some policies can still receive high rewards despite not improving upon the base model, a phenomenon known as catastrophic Goodhart. The authors adapt a discrete optimization method to analyze the tails of reward models and find that they are consistent with light-tailed error. Nevertheless, the prevalence of heavy-tailed distributions in many real-world applications increases the likelihood of reward hacking even when using KL regularization. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper looks at how we learn from humans by trying different things (reinforcement learning). We usually use a combination of what people like and what doesn’t work well to make good decisions. The researchers found that if the rewards are not too wrong, we can still do very well even if we’re not super careful about what we choose. But if the rewards are really bad, some choices can lead to great rewards even though they’re not actually good for us. This is called “catastrophic Goodhart”. The scientists also found a way to see how likely this is happening and it seems like it might happen often. |
Keywords
» Artificial intelligence » Likelihood » Optimization » Regularization » Reinforcement learning » Reinforcement learning from human feedback » Rlhf