Loading Now

Summary of Llm Defenses Are Not Robust to Multi-turn Human Jailbreaks Yet, by Nathaniel Li et al.


LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet

by Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, Summer Yue

First submitted to arxiv on: 27 Aug 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL); Cryptography and Security (cs.CR); Computers and Society (cs.CY)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Recent large language model (LLM) defenses have successfully blocked harmful queries using adversarial attacks. However, these defenses primarily focus on single-turn automated attacks, which are insufficient for real-world malicious use. Our study demonstrates that multi-turn human jailbreaks reveal significant vulnerabilities in LLM defenses, achieving an attack success rate of over 70% on HarmBench against defenses that report single-digit ASRs with automated single-turn attacks. Additionally, we uncover vulnerabilities in machine unlearning defenses, successfully recovering dual-use biosecurity knowledge from unlearned models. Our research compiles these results into the Multi-Turn Human Jailbreaks (MHJ) dataset, comprising 2,912 prompts across 537 multi-turn jailbreaks. We publicly release MHJ alongside a compendium of jailbreak tactics developed through dozens of commercial red teaming engagements, supporting research towards stronger LLM defenses.
Low GrooveSquid.com (original content) Low Difficulty Summary
This study shows that big language models can be tricked into answering bad questions, even when people try to stop them. The current ways to defend these models are not good enough because they only work against simple attacks. Our research finds that if people keep trying in multiple turns, the models can still be tricked over 70% of the time. We also found flaws in a way to “unlearn” harmful information from these models, which could let bad things happen again. We’re sharing our results and ideas with others so they can make better defenses.

Keywords

» Artificial intelligence  » Large language model