Summary of Learning Reward and Policy Jointly From Demonstration and Preference Improves Alignment, by Chenliang Li et al.

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

by Chenliang Li, Siliang Zeng, Zeyi Liao, Jiaxiang Li, Dongyeop Kang, Alfredo Garcia, Mingyi Hong

First submitted to arxiv on: 11 Jun 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The proposed Alignment with Integrated Human Feedback (AIHF) approach integrates human preference and demonstration to train reward models and policies in a single stage. This addresses issues with popular approaches like RLHF, which break down alignment into separate stages, resulting in underutilization of data and distribution mismatch. AIHF admits efficient algorithms that can reduce to or leverage existing alignment pipelines, such as RLHF and Directly Policy Optimization (DPO). The approach is demonstrated through extensive experiments on language models and robotic control problems, showing significant performance improvements over existing methods when high-quality preference data is limited.
Low	GrooveSquid.com (original content)	Low Difficulty Summary AIHF is a new way to align human preferences and values with AI. It combines two things: what humans like and how they behave. This makes it better than other approaches that do these things separately. The result is more accurate alignment, which is important for building good foundation models and embodied AI. The method is tested on language models and robotic control problems, showing it works well even with limited data.

Keywords

» Artificial intelligence » Alignment » Optimization » Rlhf

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

by Chenliang Li, Siliang Zeng, Zeyi Liao, Jiaxiang Li, Dongyeop Kang, Alfredo Garcia, Mingyi Hong

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Are Llms Classical or Nonmonotonic Reasoners? Lessons From Generics, by Alina Leidinger et al.

Summary of Caap: Context-aware Action Planning Prompting to Solve Computer Tasks with Front-end Ui Only, by Junhee Cho et al.

Related Posts