Loading Now

Summary of In-dataset Trajectory Return Regularization For Offline Preference-based Reinforcement Learning, by Songjun Tu et al.


In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning

by Songjun Tu, Jingbo Sun, Qichao Zhang, Yaocheng Zhang, Jia Liu, Ke Chen, Dongbin Zhao

First submitted to arxiv on: 12 Dec 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed In-Dataset Trajectory Return Regularization (DTR) method for offline preference-based reinforcement learning (PbRL) tackles the challenge of accurately modeling step-wise rewards from trajectory-level preference feedback. By leveraging conditional sequence modeling, DTR mitigates the risk of optimistic trajectory stitching and reward overestimation, which can undermine the pessimism mechanism in offline RL. To achieve this, DTR employs Decision Transformer and TD-Learning to balance fidelity to the behavior policy with high in-dataset trajectory returns and optimal actions based on high reward labels. Additionally, an ensemble normalization technique is introduced to integrate multiple reward models, balancing reward differentiation and accuracy. Experimental results demonstrate the superiority of DTR over state-of-the-art baselines.
Low GrooveSquid.com (original content) Low Difficulty Summary
Offline preference-based reinforcement learning (PbRL) helps machines learn from human feedback. This paper fixes a problem in PbRL called optimistic trajectory stitching. It happens when we use human preferences to learn what rewards are good or bad. The new method, In-Dataset Trajectory Return Regularization (DTR), makes sure the machine learns accurate rewards by combining two techniques: Decision Transformer and TD-Learning. DTR also helps combine different reward models so they work well together. This makes PbRL more reliable.

Keywords

» Artificial intelligence  » Regularization  » Reinforcement learning  » Transformer