Summary of Corruption Robust Offline Reinforcement Learning with Human Feedback, by Debmalya Mandal et al.
Corruption Robust Offline Reinforcement Learning with Human Feedback
by Debmalya Mandal, Andi Nika, Parameswaran Kamalaruban, Adish Singla, Goran Radanović
First submitted to arxiv on: 9 Feb 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
| Summary difficulty | Written by | Summary | 
|---|---|---|
| High | Paper authors | High Difficulty Summary Read the original abstract here | 
| Medium | GrooveSquid.com (original content) | Medium Difficulty Summary We study how to design algorithms for reinforcement learning with human feedback (RLHF) in an offline setting when a portion of the data is corrupted. This is important because in real-world scenarios, we may encounter adversarial attacks or noisy human preferences. Our goal is to identify a near-optimal policy from the corrupted data while ensuring provable guarantees. Existing works have focused on either corruption-robust RL (learning from scalar rewards) or offline RLHF (learning from human feedback), but our problem requires a combination of both. We design novel methods for corruption robust offline RLHF under different assumptions about the coverage of the data-generating distributions. Our approach involves learning a reward model and confidence sets, then using an offline corruption-robust RL oracle to learn a pessimistic optimal policy over the confidence set. To our knowledge, this is the first work providing provable corruption robust offline RLHF methods. | 
| Low | GrooveSquid.com (original content) | Low Difficulty Summary Imagine trying to teach a robot or computer program what’s good and bad by giving it feedback from humans. But what if some of that feedback is wrong or misleading? That’s the problem we’re tackling in this paper. We want to design algorithms that can learn from human feedback even when some of it is corrupted. This is important because in real life, we might get fake or biased feedback from humans. Our solution involves learning a model of what makes something good or bad and then using that model to find the best actions despite the corrupted data. To our knowledge, this is the first time someone has come up with a way to do this that also gives us provable guarantees. | 
Keywords
* Artificial intelligence * Reinforcement learning * Rlhf




