Summary of Towards Data-centric Rlhf: Simple Metrics For Preference Dataset Comparison, by Judy Hanwen Shen et al.
Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparisonby Judy Hanwen Shen, Archit Sharma, Jun…
Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparisonby Judy Hanwen Shen, Archit Sharma, Jun…
Scores as Actions: a framework of fine-tuning diffusion models by continuous-time reinforcement learningby Hanyang Zhao,…
Semi-Supervised Reward Modeling via Iterative Self-Trainingby Yifei He, Haoxiang Wang, Ziyan Jiang, Alexandros Papangelis, Han…
Policy Filtration in RLHF to Fine-Tune LLM for Code Generationby Wei Shen, Chuheng ZhangFirst submitted…
AGR: Age Group fairness Reward for Bias Mitigation in LLMsby Shuirong Cao, Ruoxi Cheng, Zhiqiang…
On the Limited Generalization Capability of the Implicit Reward Model Induced by Direct Preference Optimizationby…
A Gradient Analysis Framework for Rewarding Good and Penalizing Bad Examples in Language Modelsby Yi-Lin…
UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Functionby Zhichao…
SEAL: Systematic Error Analysis for Value ALignmentby Manon Revel, Matteo Cargnelutti, Tyna Eloundou, Greg LeppertFirst…
Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learningby Sriyash Poddar, Yanming Wan, Hamish…