Summary of Steering Language Model Refusal with Sparse Autoencoders, by Kyle O’brien et al.
Steering Language Model Refusal with Sparse Autoencoders
by Kyle O’Brien, David Majercak, Xavier Fernandes, Richard Edgar, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, Forough Poursabzi-Sangde
First submitted to arxiv on: 18 Nov 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary As machine learning educators, we present a paper that delves into the realm of responsible language model deployment. The study focuses on steering models to recognize and refuse answering prompts deemed unsafe, while adhering to safe ones. This typically requires updating model weights, which can be costly and inflexible. Our research explores the possibility of guiding model activations at inference time without requiring weight updates. We employ sparse autoencoders to identify and steer features in Phi-3 Mini, a language model that exhibits refusal behavior. Our findings indicate that feature steering can enhance Phi-3 Minis robustness against jailbreak attempts across various harms, including multi-turn attacks. However, we also discover that this approach may negatively impact overall performance on benchmarks. These results suggest that identifying steerable mechanisms for refusal via sparse autoencoders is a promising route to improving language model safety, but more research is needed to mitigate the adverse effects of feature steering on performance. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Imagine you’re working with a super smart computer program that can answer questions and write stories. But what if someone asks it to do something mean or hurtful? That’s where responsible practices come in. The goal is to make sure the program knows when to say “no” to those kinds of requests. One way to achieve this is by updating the program’s inner workings, but that can be time-consuming and limiting. Our research looks into a different approach: steering the program’s behavior at the moment it answers questions. We used special techniques to help the Phi-3 Mini program recognize when it should refuse certain prompts. Our results show that this method can make the program more resilient against attempts to trick it, but we also found that it might slightly decrease its overall performance. |
Keywords
* Artificial intelligence * Inference * Language model * Machine learning