Summary of Online Learning From Strategic Human Feedback in Llm Fine-tuning, by Shugang Hao and Lingjie Duan
Online Learning from Strategic Human Feedback in LLM Fine-Tuningby Shugang Hao, Lingjie DuanFirst submitted to…
Online Learning from Strategic Human Feedback in LLM Fine-Tuningby Shugang Hao, Lingjie DuanFirst submitted to…
Cannot or Should Not? Automatic Analysis of Refusal Composition in IFT/RLHF Datasets and Refusal Behavior…
CareBot: A Pioneering Full-Process Open-Source Medical Language Modelby Lulu Zhao, Weihao Zeng, Xiaofeng Shi, Hua…
Linear Probe Penalties Reduce LLM Sycophancyby Henry Papadatos, Rachel FreedmanFirst submitted to arxiv on: 1…
R3HF: Reward Redistribution for Enhancing Reinforcement Learning from Human Feedbackby Jiahui Li, Tai-wei Chang, Fengda…
Combining Theory of Mind and Kindness for Self-Supervised Human-AI Alignmentby Joshua T. S. HewsonFirst submitted…
Self-Evolved Reward Learning for LLMsby Chenghua Huang, Zhizhen Fan, Lu Wang, Fangkai Yang, Pu Zhao,…
Evolving Alignment via Asymmetric Self-Playby Ziyu Ye, Rishabh Agarwal, Tianqi Liu, Rishabh Joshi, Sarmishta Velury,…
Improving Small-Scale Large Language Models Function Calling for Reasoning Tasksby Graziano A. Manduzio, Federico A.…
Cross-lingual Transfer of Reward Models in Multilingual Alignmentby Jiwoo Hong, Noah Lee, Rodrigo Martínez-Castaño, César…