Summary of Correlated Proxies: a New Definition and Improved Mitigation For Reward Hacking, by Cassidy Laidlaw et al.
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hackingby Cassidy Laidlaw, Shivam Singhal,…
Correlated Proxies: A New Definition and Improved Mitigation for Reward Hackingby Cassidy Laidlaw, Shivam Singhal,…
Enhancing LLM Safety via Constrained Direct Preference Optimizationby Zixuan Liu, Xiaolin Sun, Zizhan ZhengFirst submitted…
Reward Model Learning vs. Direct Policy Optimization: A Comparative Analysis of Learning from Human Preferencesby Andi…
Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewardsby Haoxiang…
CogBench: a large language model walks into a psychology labby Julian Coda-Forno, Marcel Binz, Jane…
When Your AIs Deceive You: Challenges of Partial Observability in Reinforcement Learning from Human Feedbackby…
Leveraging Domain Knowledge for Efficient Reward Modelling in RLHF: A Case-Study in E-Commerce Opinion Summarizationby…
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMsby Arash…
Advancing Translation Preference Modeling with RLHF: A Step Towards Cost-Effective Solutionby Nuo Xu, Jun Zhao,…
Active Preference Optimization for Sample Efficient RLHFby Nirjhar Das, Souradip Chakraborty, Aldo Pacchiano, Sayak Ray…