Summary of Gracefully Filtering Backdoor Samples For Generative Large Language Models Without Retraining, by Zongru Wu et al.
Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining
by Zongru Wu, Pengzhou Cheng, Lingyong Fang, Zhuosheng Zhang, Gongshen Liu
First submitted to arxiv on: 3 Dec 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Medium Difficulty summary: This paper tackles the pressing issue of backdoor attacks in generative large language models (LLMs), specifically those that output high-dimensional token logits. Building on observations about the frequency space, researchers propose Gradient Clustering in the Frequency Space for Backdoor Sample Filtering (GraCeFul). This method leverages gradients in the frequency space to identify backdoor samples without retraining LLMs. Experimental results show GraCeFul outperforms baselines significantly, achieving 100% recall and F1 scores while reducing average success rate of backdoor attacks to 0%. The approach generalizes well across multiple free-style question answering datasets, including Llama-2 and Vicuna. Notably, GraCeFul exhibits remarkable computational efficiency. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Low Difficulty summary: This research paper talks about keeping large language models safe from hackers who try to make them give wrong answers. These models are super good at generating text that looks real, but some bad guys can trick them into saying the opposite of what’s true. The scientists behind this project noticed something weird about how these models learn and used it to create a new way to find the fake answers. They called it GraCeFul. In tests, GraCeFul was really good at finding the fake answers and even better than other methods at keeping the model from giving wrong answers. |
Keywords
» Artificial intelligence » Clustering » Llama » Logits » Question answering » Recall » Token