Summary of Offline Regularised Reinforcement Learning For Large Language Models Alignment, by Pierre Harvey Richemond et al.
Offline Regularised Reinforcement Learning for Large Language Models Alignment
by Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos, Bilal Piot
First submitted to arxiv on: 29 May 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes a new framework called Direct Reward Optimisation (DRO) to align large language models (LLMs) without requiring pairwise preference data. Instead of learning from quadruplet datasets, DRO uses single-trajectory datasets consisting of prompts, responses, and user feedback, which are more abundant and cheaper to collect. The proposed method uses a simple mean-squared objective that can be implemented in various ways. The authors validate their findings using T5 encoder-decoder language models and show DRO’s performance surpasses selected baselines such as Kahneman-Tversky Optimization (KTO). This work demonstrates the effectiveness of DRO for single-trajectory policy optimisation. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about a new way to make large language models better. Normally, people use special data with four parts: a question, two answers, and which answer is best. But this data can be hard and expensive to get. The authors propose a new method called Direct Reward Optimisation (DRO) that uses simpler data with only three parts: a question, an answer, and whether the answer is good or not. They show that DRO works well using special language models and compare it to other methods. This research can help improve how we make language models better. |
Keywords
» Artificial intelligence » Encoder decoder » Optimization » T5