Summary of Policy Filtration in Rlhf to Fine-tune Llm For Code Generation, by Wei Shen et al.
Policy Filtration in RLHF to Fine-Tune LLM for Code Generation
by Wei Shen, Chuheng Zhang
First submitted to arxiv on: 11 Sep 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This research paper investigates the application of Reinforcement Learning from Human Feedback (RLHF) in large language models (LLMs). The authors explore how direct policy optimization methods can be used to train LLMs to generate helpful and harmless responses. They identify a challenge in current RL-based methods, such as PPO, which rely on an intermediate reward model learned from preference data. The accuracy of this reward model varies across responses, leading to unreliable rewards. To address this issue, the authors propose Policy Filtration for Proximal Policy Optimization (PF-PPO), which filters samples with potentially unreliable rewards to improve signal-to-noise ratio during policy learning. The paper also presents several promising strategies for choosing a proper policy filtration strategy based on the coefficient of determination (R^2) between rewards and actual scores on filtered samples. Experimental results demonstrate the effectiveness of PF-PPO in code generation tasks, achieving new state-of-the-art performance across 7-billion-parameter models. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Reinforcement learning from human feedback helps large language models follow instructions and provide helpful responses. The problem is that current methods use an intermediate reward model learned from preference data, which can be inaccurate. This paper proposes a way to fix this issue by filtering out bad rewards and improving the signal-to-noise ratio during training. They also find several strategies for choosing the right filter based on how well the rewards match actual scores. The experiments show that this new approach works well in code generation tasks. |
Keywords
» Artificial intelligence » Optimization » Reinforcement learning from human feedback » Rlhf