Summary of Copr: Continual Learning Human Preference Through Optimal Policy Regularization, by Han Zhang et al.
COPR: Continual Learning Human Preference through Optimal Policy Regularizationby Han Zhang, Lin Gui, Yuanzhao Zhai,…
COPR: Continual Learning Human Preference through Optimal Policy Regularizationby Han Zhang, Lin Gui, Yuanzhao Zhai,…
Mitigating the Alignment Tax of RLHFby Yong Lin, Hangyu Lin, Wei Xiong, Shizhe Diao, Jianmeng…