Summary of Iterative Nash Policy Optimization: Aligning Llms with General Preferences Via No-regret Learning, by Yuheng Zhang et al.
Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning
by Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, Dong Yu
First submitted to arxiv on: 30 Jun 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Computer Science and Game Theory (cs.GT)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper explores Reinforcement Learning with Human Feedback (RLHF) for large language models (LLMs), departing from traditional reward-based approaches that assume Bradley-Terry (BT) model simplicity. Instead, it formulates RLHF as a two-player game and proposes iterative Nash policy optimization (INPO), an online algorithm that bypasses estimating expected win rates for individual responses, thus reducing computational or annotation costs. INPO directly minimizes a new loss objective over a preference dataset. Theoretical analysis supports this approach, which outperforms state-of-the-art online RLHF algorithms when using an LLaMA-3-8B-based SFT model on benchmarks AlpacaEval 2.0 and Arena-Hard. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper is about using artificial intelligence to get better at understanding what humans want. Right now, big language models are really good at producing text, but they don’t always make sense or say what we mean. This paper tries a new way of teaching these models how to be more helpful by playing a game with them. The goal is to find the best way for the model to respond to humans so that it gets rewarded and improves over time. The authors come up with a new algorithm called INPO, which helps the model learn quickly without needing a lot of extra data or computing power. |
Keywords
» Artificial intelligence » Llama » Optimization » Reinforcement learning » Rlhf