Summary of Best-of-n Jailbreaking, by John Hughes et al.
Best-of-N Jailbreaking
by John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, Mrinank Sharma
First submitted to arxiv on: 4 Dec 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed Best-of-N (BoN) Jailbreaking algorithm is a simple black-box method that successfully attacks frontier AI systems across various modalities. By repeatedly sampling variations of prompts with augmentations such as random shuffling or capitalization, BoN achieves high attack success rates on closed-source language models like GPT-4o and Claude 3.5 Sonnet. The algorithm also circumvents state-of-the-art open-source defenses like circuit breakers and extends to other modalities like vision and audio language models. Furthermore, the attack’s effectiveness improves with more sampled prompts, following power-law-like behavior. BoN can be combined with other black-box algorithms for even more effective attacks. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Best-of-N (BoN) Jailbreaking is a new way to test AI systems. It works by changing small parts of what we ask an AI system and seeing if it does something bad. This method is very good at breaking closed-source language models like GPT-4o and Claude 3.5 Sonnet, with success rates as high as 89% and 78%, respectively. BoN also works well against open-source defenses and can be used to test other AI systems that understand pictures or sound. The more times we try this method, the better it gets at breaking the AI system. |
Keywords
» Artificial intelligence » Claude » Gpt