Summary of Smaug: Fixing Failure Modes Of Preference Optimisation with Dpo-positive, by Arka Pal et al.

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

by Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White

First submitted to arxiv on: 20 Feb 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary Direct Preference Optimisation (DPO) is a technique that can significantly improve the performance of large language models (LLMs) on tasks like reasoning, summarization, and alignment. By modeling the relative probability of choosing one response over another using preferred and dispreferred data, DPO aims to optimize this preference. However, researchers found that the standard DPO loss can actually decrease the model’s likelihood of picking preferred examples if the relative probability between classes increases. This phenomenon was observed when fine-tuning LLMs on common datasets, especially those with low edit distances between completions. To avoid this issue, a new loss function and training procedure called DPO-Positive (DPOP) were designed. Surprisingly, DPOP outperformed other fine-tuning procedures across various datasets and tasks, including those with high edit distances. Moreover, the DPOP-tuned model achieved higher accuracy on benchmarks independent of the fine-tuning data. As a result, open-source LLMs like Smaug-34B and Smaug-72B were created using DPOP, with the latter surpassing 80% average accuracy on the HuggingFace Open LLM Leaderboard.
Low	GrooveSquid.com (original content)	Low Difficulty Summary DPO is a way to make language models better. It helps them choose the right answers by looking at how often one answer is chosen over another. But sometimes this makes the model less likely to pick the best answer if there’s a big difference in what people prefer. This happens when fine-tuning the model on certain datasets, especially those with small differences between correct and incorrect answers. To fix this, researchers created DPO-Positive (DPOP), a new way to train the model that avoids this problem. Surprisingly, it worked better than other methods! The best part is that it didn’t just work for one type of task or dataset; it improved performance across many different areas.

Keywords

* Artificial intelligence * Alignment * Fine tuning * Likelihood * Loss function * Probability * Summarization

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

by Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, Colin White

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Sprinql: Sub-optimal Demonstrations Driven Offline Imitation Learning, by Huy Hoang et al.

Summary of Unsupervised Concept Discovery Mitigates Spurious Correlations, by Md Rifat Arefin et al.

Related Posts