Summary of Zeroth-order Policy Gradient For Reinforcement Learning From Human Feedback Without Reward Inference, by Qining Zhang et al.
Zeroth-Order Policy Gradient for Reinforcement Learning from Human Feedback without Reward Inferenceby Qining Zhang, Lei…