Summary of Derail Yourself: Multi-turn Llm Jailbreak Attack Through Self-discovered Clues, by Qibing Ren et al.
Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues
by Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, Jing Shao
First submitted to arxiv on: 14 Oct 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This study uncovers the vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, revealing how malicious users can conceal harmful intents across multiple queries. The researchers introduce ActorAttack, a novel method that models a network of semantically linked actors as attack clues to generate diverse and effective paths towards harmful targets. This approach addresses two key challenges: concealing intents by creating an innocuous conversation topic about the actor, and uncovering various attack paths towards the same target by leveraging LLMs’ knowledge to specify correlated actors as attack clues. ActorAttack outperforms existing methods in single-turn and multi-turn attacks across advanced aligned LLMs, including GPT-o1. The study also introduces a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data generated by ActorAttack. This dataset can be used to tune models for improved robustness against multi-turn attacks. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about how big language models can be tricked into doing bad things when people ask them questions in a tricky way. The researchers created a new method called ActorAttack that helps find ways to make the models do what you want, even if it’s not good. They tested this method on some big language models and found out that it works really well! The paper also includes a special dataset that other people can use to make the models better at handling tricky questions. |
Keywords
» Artificial intelligence » Alignment » Gpt