Loading Now

Summary of 2d-dpo: Scaling Direct Preference Optimization with 2-dimensional Supervision, by Shilong Li et al.


2D-DPO: Scaling Direct Preference Optimization with 2-Dimensional Supervision

by Shilong Li, Yancheng He, Hui Huang, Xingyuan Bu, Jiaheng Liu, Hangyu Guo, Weixun Wang, Jihao Gu, Wenbo Su, Bo Zheng

First submitted to arxiv on: 25 Oct 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Recent advancements in Direct Preference Optimization (DPO) have led to significant improvements in aligning Large Language Models (LLMs) with human preferences. This is attributed to the simplicity and effectiveness of DPO. However, existing methods primarily optimize scalar scores or ranking rewards, neglecting the multi-dimensional nature of human preferences. To address this limitation, we propose extending preference optimization from one dimension to two: segments and aspects. We introduce a 2D supervision dataset called HelpSteer-2D, where scores are assigned to each segment and several criteria cover response quality rubrics for the aspect dimension. Our 2D-DPO framework decomposes the overall objective into multi-segment and multi-aspect objectives using the 2-dimensional feedback signals. Experimental results on popular benchmarks demonstrate that 2D-DPO outperforms methods optimizing for scalar or 1-dimensional preferences.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper improves how computers learn to understand human opinions. Currently, these computers are only good at understanding simple things, like ranking items from best to worst. But humans have many different ways of expressing their opinions. The researchers created a new way to teach these computers to better understand people’s opinions by considering two important aspects: the quality of each response and how well it answers the question being asked. They tested this approach on several datasets and found that it works much better than older methods.

Keywords

» Artificial intelligence  » Optimization