Loading Now

Summary of Injecguard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models, by Hao Li et al.


InjecGuard: Benchmarking and Mitigating Over-defense in Prompt Injection Guardrail Models

by Hao Li, Xiaogeng Liu

First submitted to arxiv on: 30 Oct 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper introduces NotInject, a new evaluation dataset designed to measure over-defense in prompt guard models. Over-defense occurs when these models incorrectly flag benign inputs as malicious due to trigger word bias. The authors show that state-of-the-art models suffer from this issue, with accuracy dropping to near-random guessing levels (60%). To mitigate this, they propose InjecGuard, a novel prompt guard model that incorporates the Mitigating Over-defense for Free (MOF) training strategy. InjecGuard demonstrates state-of-the-art performance on diverse benchmarks, including NotInject, outperforming existing models by 30.8%. The authors release their code and datasets at this GitHub URL.
Low GrooveSquid.com (original content) Low Difficulty Summary
Prompt injection attacks are a major threat to large language models (LLMs). These attacks can hijack goals and leak data. To protect against these attacks, prompt guard models were developed. However, these models often over-defend by mistakenly flagging normal inputs as malicious due to trigger word bias. The authors introduce NotInject, a new dataset that helps measure this over-defense. They also propose InjecGuard, a better prompt guard model that reduces bias on trigger words. InjecGuard is more accurate than existing models and can help keep LLMs safe.

Keywords

» Artificial intelligence  » Prompt