Summary of Online Bandit Learning with Offline Preference Data, by Akhil Agnihotri et al.
Online Bandit Learning with Offline Preference Databy Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng WenFirst…
Online Bandit Learning with Offline Preference Databy Akhil Agnihotri, Rahul Jain, Deepak Ramachandran, Zheng WenFirst…
It Takes Two: On the Seamlessness between Reward and Policy Model in RLHFby Taiming Lu,…
OPTune: Efficient Online Preference Tuningby Lichang Chen, Jiuhai Chen, Chenxi Liu, John Kirchenbauer, Davit Soselia,…
Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysisby Qining Zhang,…
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithmsby Rafael Rafailov, Yaswanth Chittepu, Ryan…
Aligning Large Language Models via Fine-grained Supervisionby Dehong Xu, Liang Qiu, Minseok Kim, Faisal Ladhak,…
Adaptive Preference Scaling for Reinforcement Learning with Human Feedbackby Ilgee Hong, Zichong Li, Alexander Bukharin,…
Self-Improving Robust Preference Optimizationby Eugene Choi, Arash Ahmadian, Matthieu Geist, Oilvier Pietquin, Mohammad Gheshlaghi AzarFirst…
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHFby Tengyang Xie, Dylan J. Foster, Akshay…
Group Robust Preference Optimization in Reward-free RLHFby Shyam Sundhar Ramesh, Yifan Hu, Iason Chaimalas, Viraj…