Summary of Leveraging the Context Through Multi-round Interactions For Jailbreaking Attacks, by Yixin Cheng et al.
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
by Yixin Cheng, Markos Georgopoulos, Volkan Cevher, Grigorios G. Chrysos
First submitted to arxiv on: 14 Feb 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes a new attack method called Contextual Interaction Attack on Large Language Models (LLMs). The authors draw inspiration from Chomsky’s transformational-generative grammar theory and human practices, developing an indirect approach to elicit harmful information. By leveraging benign preliminary questions, the multi-turn attack creates a context aligned with the attack query, exploiting the autoregressive nature of LLMs. Experiments on seven different LLMs demonstrate the efficacy of this black-box attack, which can also transfer across models. This research contributes to understanding security in LLMs and has implications for developing robust defenses. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper shows how attackers can use Large Language Models (LLMs) to get harmful information by asking questions before asking the main question they want to trick the model with. The authors found a way to make this work without directly asking for the bad information, which makes it harder to detect. They tested this method on seven different LLMs and found that it works well. This research can help us understand how to keep LLMs safe from attacks like these. |
Keywords
* Artificial intelligence * Autoregressive