Summary of Selective Preference Optimization Via Token-level Reward Function Estimation, by Kailai Yang et al.

Selective Preference Optimization via Token-Level Reward Function Estimation

by Kailai Yang, Zhiwei Liu, Qianqian Xie, Jimin Huang, Erxue Min, Sophia Ananiadou

First submitted to arxiv on: 24 Aug 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper proposes a novel selective alignment strategy called Selective Preference Optimization (SePO) that efficiently selects key tokens for fine-grained preference optimization in large language models. SePO leverages Direct Preference Optimization (DPO) to estimate a token-level reward function, which is then used to score all tokens and select only the most important ones to supervise the target policy model. The authors demonstrate the effectiveness of SePO on three public evaluation benchmarks, showing significant improvements over competitive baseline methods while optimizing only 30% of the key tokens. Additionally, SePO is applied to weak-to-strong generalization tasks, where it effectively supervises strong policy models with up to 16.8x more parameters. The paper also explores the selection of key tokens from out-of-distribution data to enhance strong policy models and alleviate the over-optimization problem.
Low	GrooveSquid.com (original content)	Low Difficulty Summary SePO is a new way to help large language models understand what’s important in their training data. Right now, these models are trained on all kinds of words and phrases, which can make it hard for them to focus on what really matters. SePO changes this by letting the model choose only the most important “key tokens” to train on. This makes it easier for the model to learn from its data and understand what’s really important. The paper shows that using SePO leads to better results than other methods, even when the model is trained on a smaller amount of data.

Keywords

* Artificial intelligence * Alignment * Generalization * Optimization * Token

Selective Preference Optimization via Token-Level Reward Function Estimation

by Kailai Yang, Zhiwei Liu, Qianqian Xie, Jimin Huang, Erxue Min, Sophia Ananiadou

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of What If? Causal Machine Learning in Supply Chain Risk Management, by Mateusz Wyrembek et al.

Summary of Hybrid Training For Enhanced Multi-task Generalization in Multi-agent Reinforcement Learning, by Mingliang Zhang et al.

Related Posts