Loading Now

Summary of Policy Agnostic Rl: Offline Rl and Online Rl Fine-tuning Of Any Class and Backbone, by Max Sobol Mark et al.


Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

by Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, Aviral Kumar

First submitted to arxiv on: 9 Dec 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A recent advancement in decision-making policies has been attributed to training expressive policy models through imitation learning. While imitation learning discards non-expert data, reinforcement learning (RL) can still learn from suboptimal data. However, instantiating RL training of a new policy class often presents a different challenge: most deep RL machinery is co-developed with assumptions on the policy class and backbone, resulting in poor performance when the policy class changes. To address this issue, researchers develop an offline RL and online fine-tuning approach called policy-agnostic RL (PA-RL) that can effectively train multiple policy classes, with varying architectures and sizes. The proposed PA-RL approach builds off the basic idea that a universal supervised learning loss can replace the policy improvement step in RL, as long as it is applied on “optimized” actions. To obtain these optimized actions, the method first samples multiple actions from a base policy and runs global optimization (re-ranking multiple action samples using the Q-function) and local optimization (running gradient steps on an action sample) to maximize the critic on these candidates. PA-RL enables fine-tuning diffusion and transformer policies with either autoregressive tokens or continuous action outputs, at different sizes, entirely via actor-critic RL. Moreover, PA-RL improves the performance and sample-efficiency by up to 2 times compared to existing offline RL and online fine-tuning methods. The approach is demonstrated through a successful fine-tuning of OpenVLA, a 7B generalist robot policy, autonomously with Cal-QL, an online RL fine-tuning algorithm, improving from 40% to 70% in the real world in 40 minutes.
Low GrooveSquid.com (original content) Low Difficulty Summary
A recent breakthrough in decision-making policies has trained expressive models through imitation learning. This paper proposes a new approach called PA-RL that allows training multiple policy classes with different architectures and sizes. The idea is to use a universal supervised learning loss instead of traditional RL methods. This enables fine-tuning of diffusion and transformer policies, which can be challenging. The authors demonstrate the effectiveness of their approach by successfully fine-tuning an OpenVLA robot policy. This leads to improved performance and efficiency in real-world applications.

Keywords

» Artificial intelligence  » Autoregressive  » Diffusion  » Fine tuning  » Optimization  » Reinforcement learning  » Supervised  » Transformer