Loading Now

Summary of Maxmin-rlhf: Alignment with Diverse Human Preferences, by Souradip Chakraborty et al.


MaxMin-RLHF: Alignment with Diverse Human Preferences

by Souradip Chakraborty, Jiahao Qiu, Hui Yuan, Alec Koppel, Furong Huang, Dinesh Manocha, Amrit Singh Bedi, Mengdi Wang

First submitted to arxiv on: 14 Feb 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Robotics (cs.RO)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper addresses the limitations of Reinforcement Learning from Human Feedback (RLHF) by introducing a novel approach to better represent diverse human preferences. The authors demonstrate that single-reward RLHF is insufficient for capturing this diversity, leading to an impossibility result. To overcome this limitation, they propose a MaxMin alignment objective and learn a mixture of preference distributions via an expectation-maximization algorithm. This approach is connected to distributionally robust optimization and general utility RL, showcasing its generality and robustness. Experimental results on small-scale (GPT-2) and large-scale language models (Tulu2-7B) demonstrate the efficacy of the proposed method in improving win-rates by over 16% compared to conventional RLHF algorithms, while also achieving better performance for minority groups without compromising majority group performance.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about making computers learn from people’s preferences. Right now, there are only a few ways that computers can learn what we like or dislike. But humans have different tastes and opinions, so it’s not fair to make one approach work for everyone. The researchers found that the current way of doing this (called RLHF) doesn’t take into account these differences. To fix this problem, they came up with a new method that combines lots of people’s preferences together. This new method works better and is more fair. It even does better for people who don’t have as much influence in making decisions.

Keywords

* Artificial intelligence  * Alignment  * Gpt  * Optimization  * Reinforcement learning from human feedback  * Rlhf