Summary of Selective Preference Optimization Via Token-level Reward Function Estimation, by Kailai Yang et al.
Selective Preference Optimization via Token-Level Reward Function Estimation
by Kailai Yang, Zhiwei Liu, Qianqian Xie, Jimin Huang, Erxue Min, Sophia Ananiadou
First submitted to arxiv on: 24 Aug 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper proposes a novel selective alignment strategy called Selective Preference Optimization (SePO) that efficiently selects key tokens for fine-grained preference optimization in large language models. SePO leverages Direct Preference Optimization (DPO) to estimate a token-level reward function, which is then used to score all tokens and select only the most important ones to supervise the target policy model. The authors demonstrate the effectiveness of SePO on three public evaluation benchmarks, showing significant improvements over competitive baseline methods while optimizing only 30% of the key tokens. Additionally, SePO is applied to weak-to-strong generalization tasks, where it effectively supervises strong policy models with up to 16.8x more parameters. The paper also explores the selection of key tokens from out-of-distribution data to enhance strong policy models and alleviate the over-optimization problem. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary SePO is a new way to help large language models understand what’s important in their training data. Right now, these models are trained on all kinds of words and phrases, which can make it hard for them to focus on what really matters. SePO changes this by letting the model choose only the most important “key tokens” to train on. This makes it easier for the model to learn from its data and understand what’s really important. The paper shows that using SePO leads to better results than other methods, even when the model is trained on a smaller amount of data. |
Keywords
» Artificial intelligence » Alignment » Generalization » Optimization » Token