Loading Now

Summary of Is Poisoning a Real Threat to Llm Alignment? Maybe More So Than You Think, by Pankayaraj Pathmanathan et al.


Is poisoning a real threat to LLM alignment? Maybe more so than you think

by Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, Furong Huang

First submitted to arxiv on: 17 Jun 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL); Cryptography and Security (cs.CR)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The abstract discusses recent advancements in Reinforcement Learning with Human Feedback (RLHF) for Large Language Models (LLMs). Specifically, it focuses on Direct Policy Optimization (DPO), a supervised learning framework that treats RLHF. The authors analyze the vulnerabilities of DPO to poisoning attacks under different scenarios and compare the effectiveness of preference poisoning. They find that DPO is more vulnerable to poisoning than PPO-based methods, requiring only 0.5% of data to be poisoned to elicit harmful behavior. The paper investigates the potential reasons behind this vulnerability and its translation into backdoor vs non-backdoor attacks.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper looks at how well Reinforcement Learning with Human Feedback (RLHF) works for Large Language Models (LLMs). It focuses on a new way of doing RLHF called Direct Policy Optimization (DPO), which is different from what’s been used before. The authors want to see if DPO has weaknesses that bad actors could exploit. They try to poison the model with fake data and find that it’s easy to do, requiring only a small amount of data. This is a problem because it means someone could make the model do something bad by giving it just a little bit of bad information.

Keywords

» Artificial intelligence  » Optimization  » Reinforcement learning  » Rlhf  » Supervised  » Translation