Summary of Ompo: a Unified Framework For Rl Under Policy and Dynamics Shifts, by Yu Luo et al.
OMPO: A Unified Framework for RL under Policy and Dynamics Shifts
by Yu Luo, Tianying Ji, Fuchun Sun, Jianwei Zhang, Huazhe Xu, Xianyuan Zhan
First submitted to arxiv on: 29 May 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper addresses a fundamental challenge in reinforcement learning (RL): how to train policies using environment interaction data collected from varying policies or dynamics. Existing works often overlook distribution discrepancies induced by policy or dynamics shifts, leading to suboptimal policy performances and high learning variances. The authors propose a unified strategy called transition occupancy matching, which involves introducing a surrogate policy learning objective that considers the transition occupancy discrepancies and reformulates it as a tractable min-max optimization problem. This approach is implemented through the Occupancy-Matching Policy Optimization (OMPO) method, which features an actor-critic structure with a distribution discriminator and a small-size local buffer. The authors conduct extensive experiments on various environments, including OpenAI Gym, Meta-World, and Panda Robots, showcasing OMPO’s effectiveness in policy shifts under stationary and nonstationary dynamics, as well as domain adaptation. Notably, OMPO outperforms specialized baselines from different categories in all settings. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Imagine you’re trying to teach a robot new skills by giving it lots of different tasks to do. The problem is that the robot might learn to do these tasks in different ways depending on what task it’s doing, or how it’s being controlled. This paper figures out a way to make the robot learn more quickly and accurately by paying attention to how the tasks change over time. They call this method “transition occupancy matching” and it helps the robot learn new skills even when the tasks are changing all the time. |
Keywords
» Artificial intelligence » Attention » Domain adaptation » Optimization » Reinforcement learning