Loading Now

Summary of Smaug: Fixing Failure Modes Of Preference Optimisation with Dpo-positive, by Arka Pal et al.


Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

by Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White

First submitted to arxiv on: 20 Feb 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Direct Preference Optimisation (DPO) is a technique that can significantly improve the performance of large language models (LLMs) on tasks like reasoning, summarization, and alignment. By modeling the relative probability of choosing one response over another using preferred and dispreferred data, DPO aims to optimize this preference. However, researchers found that the standard DPO loss can actually decrease the model’s likelihood of picking preferred examples if the relative probability between classes increases. This phenomenon was observed when fine-tuning LLMs on common datasets, especially those with low edit distances between completions. To avoid this issue, a new loss function and training procedure called DPO-Positive (DPOP) were designed. Surprisingly, DPOP outperformed other fine-tuning procedures across various datasets and tasks, including those with high edit distances. Moreover, the DPOP-tuned model achieved higher accuracy on benchmarks independent of the fine-tuning data. As a result, open-source LLMs like Smaug-34B and Smaug-72B were created using DPOP, with the latter surpassing 80% average accuracy on the HuggingFace Open LLM Leaderboard.
Low GrooveSquid.com (original content) Low Difficulty Summary
DPO is a way to make language models better. It helps them choose the right answers by looking at how often one answer is chosen over another. But sometimes this makes the model less likely to pick the best answer if there’s a big difference in what people prefer. This happens when fine-tuning the model on certain datasets, especially those with small differences between correct and incorrect answers. To fix this, researchers created DPO-Positive (DPOP), a new way to train the model that avoids this problem. Surprisingly, it worked better than other methods! The best part is that it didn’t just work for one type of task or dataset; it improved performance across many different areas.

Keywords

* Artificial intelligence  * Alignment  * Fine tuning  * Likelihood  * Loss function  * Probability  * Summarization