Summary of Is Poisoning a Real Threat to Llm Alignment? Maybe More So Than You Think, by Pankayaraj Pathmanathan et al.

Is poisoning a real threat to LLM alignment? Maybe more so than you think

by Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, Furong Huang

First submitted to arxiv on: 17 Jun 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The abstract discusses recent advancements in Reinforcement Learning with Human Feedback (RLHF) for Large Language Models (LLMs). Specifically, it focuses on Direct Policy Optimization (DPO), a supervised learning framework that treats RLHF. The authors analyze the vulnerabilities of DPO to poisoning attacks under different scenarios and compare the effectiveness of preference poisoning. They find that DPO is more vulnerable to poisoning than PPO-based methods, requiring only 0.5% of data to be poisoned to elicit harmful behavior. The paper investigates the potential reasons behind this vulnerability and its translation into backdoor vs non-backdoor attacks.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The paper looks at how well Reinforcement Learning with Human Feedback (RLHF) works for Large Language Models (LLMs). It focuses on a new way of doing RLHF called Direct Policy Optimization (DPO), which is different from what’s been used before. The authors want to see if DPO has weaknesses that bad actors could exploit. They try to poison the model with fake data and find that it’s easy to do, requiring only a small amount of data. This is a problem because it means someone could make the model do something bad by giving it just a little bit of bad information.

Keywords

» Artificial intelligence » Optimization » Reinforcement learning » Rlhf » Supervised » Translation

Is poisoning a real threat to LLM alignment? Maybe more so than you think

by Pankayaraj Pathmanathan, Souradip Chakraborty, Xiangyu Liu, Yongyuan Liang, Furong Huang

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Self-moe: Towards Compositional Large Language Models with Self-specialized Experts, by Junmo Kang et al.

Summary of Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis Using Slice Discovery Methods, by Vincent Olesen et al.

Related Posts