Summary of Policy Filtration in Rlhf to Fine-tune Llm For Code Generation, by Wei Shen et al.

Policy Filtration in RLHF to Fine-Tune LLM for Code Generation

by Wei Shen, Chuheng Zhang

First submitted to arxiv on: 11 Sep 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This research paper investigates the application of Reinforcement Learning from Human Feedback (RLHF) in large language models (LLMs). The authors explore how direct policy optimization methods can be used to train LLMs to generate helpful and harmless responses. They identify a challenge in current RL-based methods, such as PPO, which rely on an intermediate reward model learned from preference data. The accuracy of this reward model varies across responses, leading to unreliable rewards. To address this issue, the authors propose Policy Filtration for Proximal Policy Optimization (PF-PPO), which filters samples with potentially unreliable rewards to improve signal-to-noise ratio during policy learning. The paper also presents several promising strategies for choosing a proper policy filtration strategy based on the coefficient of determination (R^2) between rewards and actual scores on filtered samples. Experimental results demonstrate the effectiveness of PF-PPO in code generation tasks, achieving new state-of-the-art performance across 7-billion-parameter models.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Reinforcement learning from human feedback helps large language models follow instructions and provide helpful responses. The problem is that current methods use an intermediate reward model learned from preference data, which can be inaccurate. This paper proposes a way to fix this issue by filtering out bad rewards and improving the signal-to-noise ratio during training. They also find several strategies for choosing the right filter based on how well the rewards match actual scores. The experiments show that this new approach works well in code generation tasks.

Keywords

* Artificial intelligence * Optimization * Reinforcement learning from human feedback * Rlhf

Policy Filtration in RLHF to Fine-Tune LLM for Code Generation

by Wei Shen, Chuheng Zhang

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Applied Federated Model Personalisation in the Industrial Domain: a Comparative Study, by Ilias Siniosoglou et al.

Summary of Combined Optimization Of Dynamics and Assimilation with End-to-end Learning on Sparse Observations, by Vadim Zinchenko and David S. Greenberg

Related Posts