Loading Now

Summary of T-reg: Preference Optimization with Token-level Reward Regularization, by Wenxuan Zhou et al.


T-REG: Preference Optimization with Token-Level Reward Regularization

by Wenxuan Zhou, Shujian Zhang, Lingxiao Zhao, Tao Meng

First submitted to arxiv on: 3 Dec 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed token-level reward regularization (T-REG) approach leverages the self-refinement capabilities of large language models (LLMs) to enable them to generate token-level rewards. This method uses contrastive prompting, allowing LLMs to optimize preference alignment by distributing sequence-level rewards across tokens. T-REG outperforms baseline methods on instruction following benchmarks, including Alpaca Eval 2 and Arena-Hard, with up to 3.8% and 4.4% improvements, respectively.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large language models need help understanding what we want from them. Currently, they’re given a single reward for the whole response, which isn’t very helpful. Some methods try to improve this by giving rewards for individual words, but these methods rely on special training or human helpers. This paper proposes a new way of giving rewards called token-level reward regularization (T-REG). It uses something called contrastive prompting that lets the model figure out how to give rewards to individual words itself. This helps the model learn better and make more accurate predictions.

Keywords

» Artificial intelligence  » Alignment  » Prompting  » Regularization  » Token