Summary of Reward Modeling with Ordinal Feedback: Wisdom Of the Crowd, by Shang Liu et al.

Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

by Shang Liu, Yu Pan, Guanting Chen, Xiaocheng Li

First submitted to arxiv on: 19 Nov 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary Learning reward models (RMs) from human preferences is crucial for aligning large language models (LLMs). The canonical setup, based on the Bradley-Terry (BT) model, uses binary feedback, discarding potentially useful samples and fine-grained information. This paper proposes a framework for learning RMs under ordinal feedback, generalizing binary preference feedback to any granularity. We identify a marginal unbiasedness condition, which validates itself via the wisdom of the crowd concept. A natural probability model is developed, analyzing its properties and proving statistical benefits in reducing Rademacher complexity compared to binary feedback. The proposed learning objective and theory extend to hinge loss and direct policy optimization (DPO). Numerical experiments show that fine-grained feedback leads to better reward learning for both in-distribution and out-of-distribution settings.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper is about finding a way to teach computers what makes something good or bad. Right now, we only tell them which one is better, but this can leave out important information. The researchers propose a new method that lets us give more details, like “slightly better.” This helps the computer learn better and make more accurate decisions.

Keywords

* Artificial intelligence * Hinge loss * Optimization * Probability

Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

by Shang Liu, Yu Pan, Guanting Chen, Xiaocheng Li

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Mdae : Modified Denoising Autoencoder For Missing Data Imputation, by Mariette Dupuy et al.

Summary of Integrating Secondary Structures Information Into Triangular Spatial Relationships (tsr) For Advanced Protein Classification, by Poorya Khajouie et al.

Related Posts