Summary of Does Safety Training Of Llms Generalize to Semantically Related Natural Prompts?, by Sravanti Addepalli et al.
Does Safety Training of LLMs Generalize to Semantically Related Natural Prompts?
by Sravanti Addepalli, Yerram Varun, Arun Suggala, Karthikeyan Shanmugam, Prateek Jain
First submitted to arxiv on: 4 Dec 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Large Language Models (LLMs) are vulnerable to crafted adversarial attacks or jailbreaks that generate objectionable content despite safety fine-tuning methods. This study evaluates whether popular aligned LLMs like GPT-4 can be compromised using natural prompts related to toxic seed prompts. Surprisingly, the authors find that naive prompts without a jailbreaking objective can compromise these models. They propose Response Guided Question Augmentation (ReG-QA) to evaluate safety alignment’s generalization to natural prompts. The approach generates several toxic answers from an unaligned LLM and then leverages another LLM to create questions producing these answers. GPT-4o, a safety fine-tuned LLM, is found to be vulnerable to producing natural jailbreak questions from unsafe content. The authors achieve attack success rates comparable to/better than leading adversarial attack methods on the JailbreakBench leaderboard while being more stable against defenses like Smooth-LLM and Synonym Substitution. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This study is about making sure that language models don’t generate bad or offensive content, even when they’re trying to be good. The researchers tested how well these models can be tricked into producing bad content using normal-sounding questions instead of special “hacking” prompts. They found that some of the best models at being good can still be tricked into producing bad content with just a few simple questions. The study shows that we need to come up with new ways to make sure language models are safe and don’t produce bad content. |
Keywords
» Artificial intelligence » Alignment » Fine tuning » Generalization » Gpt