Loading Now

Summary of The Dark Side Of Human Feedback: Poisoning Large Language Models Via User Inputs, by Bocheng Chen et al.


The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs

by Bocheng Chen, Hanqing Guo, Guangjing Wang, Yuanda Wang, Qiben Yan

First submitted to arxiv on: 1 Sep 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper investigates the vulnerabilities of Large Language Models (LLMs) in their alignment training process. Researchers reveal that malicious users can exploit this process to create user-guided poisoning attacks, which can degrade model performance on a specific keyword. The attack involves crafting prompts that elicit toxic responses, which are then used to alter the reward feedback mechanism. Two mechanisms are proposed: selection-based and generation-based, which aim to control the model output. By injecting 1% of these malicious prompts into the data, the paper demonstrates a significant increase in toxicity score when using a specific trigger word.
Low GrooveSquid.com (original content) Low Difficulty Summary
This research shows how big language models can be tricked into saying bad things. The models learn by looking at lots of text and getting feedback from users. But hackers can secretly tell the model to say mean or offensive things by giving it special prompts. These prompts make the model think it’s doing a good job, but really it’s just writing bad stuff. The researchers found two ways to do this: one way is to pick prompts that make the model say something bad, and another way is to create new prompts that can control what the model says. They showed that even a small amount of these bad prompts can make the model write mean things.

Keywords

» Artificial intelligence  » Alignment