Summary of Sharp Analysis For Kl-regularized Contextual Bandits and Rlhf, by Heyang Zhao and Chenlu Ye and Quanquan Gu and Tong Zhang
Sharp Analysis for KL-Regularized Contextual Bandits and RLHFby Heyang Zhao, Chenlu Ye, Quanquan Gu, Tong…
Sharp Analysis for KL-Regularized Contextual Bandits and RLHFby Heyang Zhao, Chenlu Ye, Quanquan Gu, Tong…
SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHFby Atoosa Chegini, Hamid Kazemi, Iman Mirzadeh,…
Towards Reliable Alignment: Uncertainty-aware RLHFby Debangshu Banerjee, Aditya GopalanFirst submitted to arxiv on: 31 Oct…
RA-PbRL: Provably Efficient Risk-Aware Preference-Based Reinforcement Learningby Yujie Zhao, Jose Efraim Aguilar Escamill, Weyl Lu,…
COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferencesby Yixin Liu, Argyris Oikonomou, Weiqiang…
VPO: Leveraging the Number of Votes in Preference Optimizationby Jae Hyeon Cho, Minkyung Park, Byung-Jun…
Uncertainty-Penalized Direct Preference Optimizationby Sam Houliston, Alizée Pace, Alexander Immer, Gunnar RätschFirst submitted to arxiv…
Diff-Instruct++: Training One-step Text-to-image Generator Model to Align with Human Preferencesby Weijian LuoFirst submitted to…
Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Modelsby Michael Noukhovitch, Shengyi Huang,…
Optimal Design for Reward Modeling in RLHFby Antoine Scheid, Etienne Boursier, Alain Durmus, Michael I.…