Summary of Reward Difference Optimization For Sample Reweighting in Offline Rlhf, by Shiqi Wang et al.
Reward Difference Optimization For Sample Reweighting In Offline RLHFby Shiqi Wang, Zhengze Zhang, Rui Zhao,…
Reward Difference Optimization For Sample Reweighting In Offline RLHFby Shiqi Wang, Zhengze Zhang, Rui Zhao,…
A Comparison of LLM Finetuning Methods & Evaluation Metrics with Travel Chatbot Use Caseby Sonia…
Model Surgery: Modulating LLM’s Behavior Via Simple Parameter Editingby Huanqian Wang, Yang Yue, Rui Lu,…
Predicting vs. Acting: A Trade-off Between World Modeling & Agent Modelingby Margaret Li, Weijia Shi, Artidoro…
Towards Comprehensive Preference Data Collection for Reward Modelingby Yulan Hu, Qingyang Li, Sheng Ouyang, Ge…
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generationby Xuan He, Dongfu…
Aqulia-Med LLM: Pioneering Full-Process Open-Source Medical Language Modelsby Lulu Zhao, Weihao Zeng, Xiaofeng Shi, Hua…
Toward Optimal LLM Alignments Using Two-Player Gamesby Rui Zheng, Hongyi Guo, Zhihan Liu, Xiaoying Zhang,…
Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMsby Rui Yang, Ruomeng Ding, Yong…
Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignmentby Chenliang Li, Siliang Zeng,…