Summary of Agentpoison: Red-teaming Llm Agents Via Poisoning Memory or Knowledge Bases, by Zhaorun Chen et al.
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases
by Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, Bo Li
First submitted to arxiv on: 17 Jul 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed AgentPoison red teaming approach targets generic and retrieval-augmented generation-based large language model (LLM) agents by poisoning their long-term memory or knowledge base, enabling backdoor attacks to manipulate the models’ behavior. By optimizing triggers through constrained optimization, the attack ensures high probability of retrieving malicious demonstrations when a user instruction contains the trigger. This novel approach requires no additional model training or fine-tuning and exhibits superior transferability, in-context coherence, and stealthiness. AgentPoison is demonstrated to be effective against three real-world LLM agents, achieving an average attack success rate higher than 80% with minimal impact on benign performance. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary AgentPoison is a new way to trick large language models by poisoning their memories or knowledge bases. This makes the models do things they weren’t supposed to do. The approach works by creating special triggers that make the models behave in a certain way when used. It’s clever because it doesn’t require any extra training and can be very sneaky. Researchers tested this attack on three real-life language models and found that it was very effective, making them do things they weren’t supposed to do over 80% of the time. |
Keywords
» Artificial intelligence » Fine tuning » Knowledge base » Large language model » Optimization » Probability » Retrieval augmented generation » Transferability