Loading Now

Summary of Preference Learning Algorithms Do Not Learn Preference Rankings, by Angelica Chen et al.


Preference Learning Algorithms Do Not Learn Preference Rankings

by Angelica Chen, Sadhika Malladi, Lily H. Zhang, Xinyi Chen, Qiuyi Zhang, Rajesh Ranganath, Kyunghyun Cho

First submitted to arxiv on: 29 May 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This research investigates the effectiveness of preference learning algorithms in steering language models to produce preferred outputs. Specifically, it examines whether these algorithms train models to assign higher likelihoods to more preferred outputs than less preferred ones, as measured by ranking accuracy. Surprisingly, the study finds that most state-of-the-art preference-tuned models achieve a ranking accuracy of less than 60% on common datasets. The researchers also derive an idealized ranking accuracy for perfect optimization and demonstrate a significant alignment gap between observed and idealized rankings. They attribute this gap to the DPO objective’s limitations in fixing mild ranking errors. Additionally, they propose a simple formula for quantifying the difficulty of learning a given preference datapoint.
Low GrooveSquid.com (original content) Low Difficulty Summary
This study looks at how well language models can be trained to produce preferred outputs using certain algorithms. It found that most good models don’t actually get very good at this – only about 60% accurate on typical tests. The researchers also figured out what would happen if these models were perfect, and it turns out they’re missing the mark by a lot! They think this is because of how one of those algorithms works. Overall, this study helps us understand what’s going on when we try to teach language models what humans like.

Keywords

» Artificial intelligence  » Alignment  » Optimization