Summary of Does Refusal Training in Llms Generalize to the Past Tense?, by Maksym Andriushchenko and Nicolas Flammarion
Does Refusal Training in LLMs Generalize to the Past Tense?
by Maksym Andriushchenko, Nicolas Flammarion
First submitted to arxiv on: 16 Jul 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper investigates the effectiveness of refusal training in Large Language Models (LLMs) to prevent the generation of harmful outputs. It reveals a surprising “generalization gap” in current approaches, where simply reformulating a harmful request in the past tense can “jailbreak” many state-of-the-art LLMs. The authors evaluate this method on various models using GPT-3.5 Turbo as a reformulation model and find that it successfully jailbreaks models such as GPT-4o with an impressive 88% success rate. Interestingly, future tense reformulations are less effective, suggesting that refusal guardrails consider past historical questions more benign than hypothetical future questions. The study also explores fine-tuning GPT-3.5 Turbo and finds that defending against past reformulations is feasible when including past tense examples in the fine-tuning data. Overall, the paper highlights the brittleness of widely used alignment techniques such as SFT, RLHF, and adversarial training. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This study looks at how well language models can be stopped from generating harmful content. Researchers found that just changing a harmful question into a past tense question can trick many advanced language models into giving an answer. This means that current methods to stop language models from producing bad responses aren’t as effective as they seem. The study also shows that trying to change the question into a future tense one doesn’t work as well, and that adding examples of past tense questions during training can help make language models better at ignoring harmful prompts. |
Keywords
» Artificial intelligence » Alignment » Fine tuning » Generalization » Gpt » Rlhf