Summary of Sycophancy to Subterfuge: Investigating Reward-tampering in Large Language Models, by Carson Denison et al.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Modelsby Carson Denison, Monte MacDiarmid, Fazl Barez,…