Summary of Mallowspo: Fine-tune Your Llm with Preference Dispersions, by Haoxian Chen et al.
MallowsPO: Fine-Tune Your LLM with Preference Dispersionsby Haoxian Chen, Hanyang Zhao, Henry Lam, David Yao,…
MallowsPO: Fine-Tune Your LLM with Preference Dispersionsby Haoxian Chen, Hanyang Zhao, Henry Lam, David Yao,…
SimPO: Simple Preference Optimization with a Reference-Free Rewardby Yu Meng, Mengzhou Xia, Danqi ChenFirst submitted…
DLPO: Diffusion Model Loss-Guided Reinforcement Learning for Fine-Tuning Text-to-Speech Diffusion Modelsby Jingyi Chen, Ju-Seung Byun,…
Online Self-Preferring Language Modelsby Yuanzhao Zhai, Zhuo Zhang, Kele Xu, Hanyang Peng, Yue Yu, Dawei…
LIRE: listwise reward enhancement for preference alignmentby Mingye Zhu, Yi Liu, Lei Zhang, Junbo Guo,…
A Unified Linear Programming Framework for Offline Reward Learning from Human Demonstrations and Feedbackby Kihyun…
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Frameworkby Jian Hu, Xibin Wu, Zilin Zhu, Xianyu,…
The Power of Active Multi-Task Learning in Reinforcement Learning from Human Feedbackby Ruitao Chen, Liwei…
Understanding the performance gap between online and offline alignment algorithmsby Yunhao Tang, Daniel Zhaohan Guo,…
RLHF Workflow: From Reward Modeling to Online RLHFby Hanze Dong, Wei Xiong, Bo Pang, Haoxiang…