Loading Now

Summary of Online Dpo: Online Direct Preference Optimization with Fast-slow Chasing, by Biqing Qi et al.


Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

by Biqing Qi, Pengfei Li, Fangyuan Li, Junqi Gao, Kaiyan Zhang, Bowen Zhou

First submitted to arxiv on: 8 Jun 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: Computation and Language (cs.CL); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes a novel approach called Online Fast-Slow chasing DPO (OFS-DPO) for improving the alignment of large language models (LLMs) with human values. Unlike traditional methods that rely on reward models, OFS-DPO trains directly on human preference datasets, eliminating the need for intermediate rewards. However, due to the presence of cross-domain human preferences, direct continual training can lead to catastrophic forgetting, limiting DPO’s performance and efficiency. To address this issue, the authors propose a new method that simulates intraspecific competition among models through fast and slow chasing, facilitating rapid adaptation. Specifically, they introduce two identical modules using Low-rank Adaptive (LoRA) with different optimization speeds to simulate intraspecific competition. The paper also introduces a new regularization term to guide learning and proposes an extension called Cross domain Online Fast-Slow chasing DPO (COFS-DPO) that leverages linear combinations of fast module parameters from different task domains. This method achieves continual value alignment by fully utilizing historical information. Experimental results show that OFS-DPO outperforms traditional DPO in in-domain alignment, while COFS-DPO excels in cross-domain continual learning scenarios.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper helps large language models (LLMs) understand what humans want them to do. Normally, LLMs are trained using rewards given by humans, but this can be tricky because rewards might not always match what humans really want. The authors came up with a new way called OFS-DPO that trains directly on human preferences without needing intermediate rewards. This makes it more efficient and better at learning from feedback. The big problem is when LLMs have to learn from different sources of human feedback, like different tasks or domains. They can forget what they learned before, which is bad news! To solve this issue, the authors introduced a new method that simulates competition among models, making them adapt faster and better.

Keywords

» Artificial intelligence  » Alignment  » Continual learning  » Lora  » Optimization  » Regularization