Loading Now

Summary of Flrt: Fluent Student-teacher Redteaming, by T. Ben Thompson and Michael Sklar (confirm Labs)


FLRT: Fluent Student-Teacher Redteaming

by T. Ben Thompson, Michael Sklar

First submitted to arxiv on: 24 Jul 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed method improves adversarial prompting techniques for safety-tuned language models by developing powerful and fluent attacks. The technique centers around a distillation-based approach that encourages the victim model to emulate toxified fine-tuning. To encourage human-fluent attacks, multi-model perplexity and repetition penalties are added to the objective. The resulting process reliably jailbreaks difficult target models with prompts similar to human-written prompts. On Advbench, attack success rates exceed 93% for Llama-2-7B, Llama-3-8B, and Vicuna-7B, while maintaining model-measured perplexity below 33. The method also finds a universally-optimized single fluent prompt that induces compliance on previously unseen tasks across Llama-2-7B, Phi-3-mini, and Vicuna-7B.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper develops new ways to “hack” language models to make them produce toxic or undesirable text. They improve existing methods by creating more human-like attacks that can fool even the safest models. The approach involves adjusting the prompt in a way that makes the model mimic how it would be fine-tuned for toxicity. This allows for a high success rate of getting the model to comply with malicious requests.

Keywords

» Artificial intelligence  » Distillation  » Fine tuning  » Llama  » Perplexity  » Prompt  » Prompting