Summary of Scaling Laws For Reward Model Overoptimization in Direct Alignment Algorithms, by Rafael Rafailov et al.
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithmsby Rafael Rafailov, Yaswanth Chittepu, Ryan…
Scaling Laws for Reward Model Overoptimization in Direct Alignment Algorithmsby Rafael Rafailov, Yaswanth Chittepu, Ryan…
Adaptive Preference Scaling for Reinforcement Learning with Human Feedbackby Ilgee Hong, Zichong Li, Alexander Bukharin,…
Self-Improving Robust Preference Optimizationby Eugene Choi, Arash Ahmadian, Matthieu Geist, Oilvier Pietquin, Mohammad Gheshlaghi AzarFirst…
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHFby Tengyang Xie, Dylan J. Foster, Akshay…
Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Headsby…
One-Shot Safety Alignment for Large Language Models via Optimal Dualizationby Xinmeng Huang, Shuo Li, Edgar…
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHFby Shicong Cen, Jincheng Mei,…
Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scalesby Ju-Seung Byun,…
On the Algorithmic Bias of Aligning Large Language Models with RLHF: Preference Collapse and Matching…
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizerby Zhihan Liu,…