Summary of Subversion Strategy Eval: Can Language Models Statelessly Strategize to Subvert Control Protocols?, by Alex Mallen et al.
Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?by Alex Mallen, Charlie…