Summary of Gracefully Filtering Backdoor Samples For Generative Large Language Models Without Retraining, by Zongru Wu et al.

Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining

by Zongru Wu, Pengzhou Cheng, Lingyong Fang, Zhuosheng Zhang, Gongshen Liu

First submitted to arxiv on: 3 Dec 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary Medium Difficulty summary: This paper tackles the pressing issue of backdoor attacks in generative large language models (LLMs), specifically those that output high-dimensional token logits. Building on observations about the frequency space, researchers propose Gradient Clustering in the Frequency Space for Backdoor Sample Filtering (GraCeFul). This method leverages gradients in the frequency space to identify backdoor samples without retraining LLMs. Experimental results show GraCeFul outperforms baselines significantly, achieving 100% recall and F1 scores while reducing average success rate of backdoor attacks to 0%. The approach generalizes well across multiple free-style question answering datasets, including Llama-2 and Vicuna. Notably, GraCeFul exhibits remarkable computational efficiency.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Low Difficulty summary: This research paper talks about keeping large language models safe from hackers who try to make them give wrong answers. These models are super good at generating text that looks real, but some bad guys can trick them into saying the opposite of what’s true. The scientists behind this project noticed something weird about how these models learn and used it to create a new way to find the fake answers. They called it GraCeFul. In tests, GraCeFul was really good at finding the fake answers and even better than other methods at keeping the model from giving wrong answers.

Keywords

* Artificial intelligence * Clustering * Llama * Logits * Question answering * Recall * Token

Gracefully Filtering Backdoor Samples for Generative Large Language Models without Retraining

by Zongru Wu, Pengzhou Cheng, Lingyong Fang, Zhuosheng Zhang, Gongshen Liu

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Graph Learning For Planning: the Story Thus Far and Open Challenges, by Dillon Z. Chen et al.

Summary of F-se-lstm: a Time Series Anomaly Detection Method with Frequency Domain Information, by Yi-xiang Lu et al.

Related Posts