Summary of Filtered Direct Preference Optimization, by Tetsuro Morimura et al.
Filtered Direct Preference Optimizationby Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito AriuFirst submitted…
Filtered Direct Preference Optimizationby Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, Kaito AriuFirst submitted…
Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedbackby Vincent Conitzer, Rachel…
RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMsby Shreyas Chaudhari,…
Investigating Regularization of Self-Play Language Modelsby Reda Alami, Abdalgader Abubaker, Mastane Achab, Mohamed El Amine…
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferencesby Corby Rosset, Ching-An Cheng,…
Fine-Tuning Language Models with Reward Learning on Policyby Hao Lang, Fei Huang, Yongbin LiFirst submitted…
The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarizationby Shengyi…
Parameter Efficient Reinforcement Learning from Human Feedbackby Hakim Sidahmed, Samrat Phatale, Alex Hutcheson, Zhuonan Lin,…
HRLAIF: Improvements in Helpfulness and Harmlessness in Open-domain Reinforcement Learning From AI Feedbackby Ang Li,…
ALaRM: Align Language Models via Hierarchical Rewards Modelingby Yuhang Lai, Siyuan Wang, Shujun Liu, Xuanjing…