Loading Now

Summary of Agentpoison: Red-teaming Llm Agents Via Poisoning Memory or Knowledge Bases, by Zhaorun Chen et al.


AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

by Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, Bo Li

First submitted to arxiv on: 17 Jul 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Cryptography and Security (cs.CR); Information Retrieval (cs.IR)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed AgentPoison red teaming approach targets generic and retrieval-augmented generation-based large language model (LLM) agents by poisoning their long-term memory or knowledge base, enabling backdoor attacks to manipulate the models’ behavior. By optimizing triggers through constrained optimization, the attack ensures high probability of retrieving malicious demonstrations when a user instruction contains the trigger. This novel approach requires no additional model training or fine-tuning and exhibits superior transferability, in-context coherence, and stealthiness. AgentPoison is demonstrated to be effective against three real-world LLM agents, achieving an average attack success rate higher than 80% with minimal impact on benign performance.
Low GrooveSquid.com (original content) Low Difficulty Summary
AgentPoison is a new way to trick large language models by poisoning their memories or knowledge bases. This makes the models do things they weren’t supposed to do. The approach works by creating special triggers that make the models behave in a certain way when used. It’s clever because it doesn’t require any extra training and can be very sneaky. Researchers tested this attack on three real-life language models and found that it was very effective, making them do things they weren’t supposed to do over 80% of the time.

Keywords

» Artificial intelligence  » Fine tuning  » Knowledge base  » Large language model  » Optimization  » Probability  » Retrieval augmented generation  » Transferability