Loading Now

Summary of Derail Yourself: Multi-turn Llm Jailbreak Attack Through Self-discovered Clues, by Qibing Ren et al.


Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues

by Qibing Ren, Hao Li, Dongrui Liu, Zhanxu Xie, Xiaoya Lu, Yu Qiao, Lei Sha, Junchi Yan, Lizhuang Ma, Jing Shao

First submitted to arxiv on: 14 Oct 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This study uncovers the vulnerabilities of Large Language Models (LLMs) in multi-turn interactions, revealing how malicious users can conceal harmful intents across multiple queries. The researchers introduce ActorAttack, a novel method that models a network of semantically linked actors as attack clues to generate diverse and effective paths towards harmful targets. This approach addresses two key challenges: concealing intents by creating an innocuous conversation topic about the actor, and uncovering various attack paths towards the same target by leveraging LLMs’ knowledge to specify correlated actors as attack clues. ActorAttack outperforms existing methods in single-turn and multi-turn attacks across advanced aligned LLMs, including GPT-o1. The study also introduces a dataset called SafeMTData, which includes multi-turn adversarial prompts and safety alignment data generated by ActorAttack. This dataset can be used to tune models for improved robustness against multi-turn attacks.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about how big language models can be tricked into doing bad things when people ask them questions in a tricky way. The researchers created a new method called ActorAttack that helps find ways to make the models do what you want, even if it’s not good. They tested this method on some big language models and found out that it works really well! The paper also includes a special dataset that other people can use to make the models better at handling tricky questions.

Keywords

» Artificial intelligence  » Alignment  » Gpt