Loading Now

Summary of Offline Regularised Reinforcement Learning For Large Language Models Alignment, by Pierre Harvey Richemond et al.


Offline Regularised Reinforcement Learning for Large Language Models Alignment

by Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos, Bilal Piot

First submitted to arxiv on: 29 May 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes a new framework called Direct Reward Optimisation (DRO) to align large language models (LLMs) without requiring pairwise preference data. Instead of learning from quadruplet datasets, DRO uses single-trajectory datasets consisting of prompts, responses, and user feedback, which are more abundant and cheaper to collect. The proposed method uses a simple mean-squared objective that can be implemented in various ways. The authors validate their findings using T5 encoder-decoder language models and show DRO’s performance surpasses selected baselines such as Kahneman-Tversky Optimization (KTO). This work demonstrates the effectiveness of DRO for single-trajectory policy optimisation.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about a new way to make large language models better. Normally, people use special data with four parts: a question, two answers, and which answer is best. But this data can be hard and expensive to get. The authors propose a new method called Direct Reward Optimisation (DRO) that uses simpler data with only three parts: a question, an answer, and whether the answer is good or not. They show that DRO works well using special language models and compare it to other methods. This research can help improve how we make language models better.

Keywords

» Artificial intelligence  » Encoder decoder  » Optimization  » T5