Summary of A Unified Linear Programming Framework For Offline Reward Learning From Human Demonstrations and Feedback, by Kihyun Kim et al.
A Unified Linear Programming Framework for Offline Reward Learning from Human Demonstrations and Feedback
by Kihyun Kim, Jiawei Zhang, Asuman Ozdaglar, Pablo A. Parrilo
First submitted to arxiv on: 20 May 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (stat.ML)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel linear programming (LP) framework is introduced for offline reward learning in sequential decision-making problems. The framework, tailored for inverse reinforcement learning (IRL) and reinforcement learning from human feedback (RLHF), estimates a feasible reward set without online exploration, offering an optimality guarantee with provable sample efficiency. This approach enables aligning the reward functions with human feedback while maintaining computational tractability and sample efficiency. Compared to conventional maximum likelihood estimation (MLE), the LP framework potentially achieves better performance through analytical examples and numerical experiments. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper introduces a new way to learn rewards in decision-making problems. It uses a special type of math problem called linear programming to figure out what rewards are best. This approach doesn’t need to try things out online, which can be hard or slow. Instead, it looks at past experiences and makes sure the rewards match what people think is good or bad. This new method works well and might even do better than other ways of doing reward learning. |
Keywords
» Artificial intelligence » Likelihood » Reinforcement learning » Reinforcement learning from human feedback » Rlhf