Summary of Rule Based Rewards For Language Model Safety, by Tong Mu et al.
Rule Based Rewards for Language Model Safety
by Tong Mu, Alec Helyar, Johannes Heidecke, Joshua Achiam, Andrea Vallone, Ian Kivlichan, Molly Lin, Alex Beutel, John Schulman, Lilian Weng
First submitted to arxiv on: 2 Nov 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
| Summary difficulty | Written by | Summary | 
|---|---|---|
| High | Paper authors | High Difficulty Summary Read the original abstract here | 
| Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Medium Difficulty summary: Reinforcement learning-based fine-tuning of large language models (LLMs) on human preferences has been shown to enhance their capabilities and safety behavior. However, without precise instructions, the collected data may cause the model to become overly cautious or respond undesirably, such as being judgmental. To address this issue, we propose a novel preference modeling approach that utilizes AI feedback and only requires a small amount of human data. Our method, Rule Based Rewards (RBR), uses a collection of rules for desired or undesired behaviors along with a LLM grader. Unlike prior methods using AI feedback, our approach employs fine-grained, composable, LLM-graded few-shot prompts as rewards directly in RL training, resulting in greater control, accuracy, and ease of updating. We demonstrate that RBRs are an effective training method, achieving an F1 score of 97.1 compared to a human-feedback baseline of 91.7, leading to higher safety-behavior accuracy through better balancing usefulness and safety. | 
| Low | GrooveSquid.com (original content) | Low Difficulty Summary Low Difficulty summary: This paper is about how to make language models safer by giving them instructions on what behaviors are good or bad. Right now, making language models safe can be tricky because it’s hard to get the right data to train them. We came up with a new way to do this using AI feedback and only a little bit of human help. Our method uses rules to say what kind of behavior is good or bad, and it works really well. It even beats a method that uses human feedback by a lot! This means we can make language models safer and more useful at the same time. | 
Keywords
* Artificial intelligence * F1 score * Few shot * Fine tuning * Reinforcement learning




