Summary of Honesty to Subterfuge: In-context Reinforcement Learning Can Make Honest Models Reward Hack, by Leo Mckee-reid et al.
Honesty to Subterfuge: In-Context Reinforcement Learning Can Make Honest Models Reward Hackby Leo McKee-Reid, Christoph…