Summary of The Dark Side Of Human Feedback: Poisoning Large Language Models Via User Inputs, by Bocheng Chen et al.

The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs

by Bocheng Chen, Hanqing Guo, Guangjing Wang, Yuanda Wang, Qiben Yan

First submitted to arxiv on: 1 Sep 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper investigates the vulnerabilities of Large Language Models (LLMs) in their alignment training process. Researchers reveal that malicious users can exploit this process to create user-guided poisoning attacks, which can degrade model performance on a specific keyword. The attack involves crafting prompts that elicit toxic responses, which are then used to alter the reward feedback mechanism. Two mechanisms are proposed: selection-based and generation-based, which aim to control the model output. By injecting 1% of these malicious prompts into the data, the paper demonstrates a significant increase in toxicity score when using a specific trigger word.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This research shows how big language models can be tricked into saying bad things. The models learn by looking at lots of text and getting feedback from users. But hackers can secretly tell the model to say mean or offensive things by giving it special prompts. These prompts make the model think it’s doing a good job, but really it’s just writing bad stuff. The researchers found two ways to do this: one way is to pick prompts that make the model say something bad, and another way is to create new prompts that can control what the model says. They showed that even a small amount of these bad prompts can make the model write mean things.

Keywords

* Artificial intelligence * Alignment

The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs

by Bocheng Chen, Hanqing Guo, Guangjing Wang, Yuanda Wang, Qiben Yan

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Situate: Indoor Human Trajectory Prediction Through Geometric Features and Self-supervised Vision Representation, by Luigi Capogrosso et al.

Summary of Real-time Weather Image Classification with Svm, by Eden Ship et al.

Related Posts