Summary of Simpo: Simple Preference Optimization with a Reference-free Reward, by Yu Meng et al.
SimPO: Simple Preference Optimization with a Reference-Free Rewardby Yu Meng, Mengzhou Xia, Danqi ChenFirst submitted…
SimPO: Simple Preference Optimization with a Reference-Free Rewardby Yu Meng, Mengzhou Xia, Danqi ChenFirst submitted…
LIRE: listwise reward enhancement for preference alignmentby Mingye Zhu, Yi Liu, Lei Zhang, Junbo Guo,…
A Unified Linear Programming Framework for Offline Reward Learning from Human Demonstrations and Feedbackby Kihyun…
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Frameworkby Jian Hu, Xibin Wu, Zilin Zhu, Xianyu,…
The Power of Active Multi-Task Learning in Reinforcement Learning from Human Feedbackby Ruitao Chen, Liwei…
Understanding the performance gap between online and offline alignment algorithmsby Yunhao Tang, Daniel Zhaohan Guo,…
RLHF Workflow: From Reward Modeling to Online RLHFby Hanze Dong, Wei Xiong, Bo Pang, Haoxiang…
Open Challenges and Opportunities in Federated Foundation Models Towards Biomedical Healthcareby Xingyu Li, Lu Peng,…
MetaRM: Shifted Distributions Alignment via Meta-Learningby Shihan Dou, Yan Liu, Enyu Zhou, Tianlong Li, Haoxiang…
DPO Meets PPO: Reinforced Token Optimization for RLHFby Han Zhong, Zikang Shan, Guhao Feng, Wei…