Summary of Cost-effective Proxy Reward Model Construction with On-policy and Active Learning, by Yifang Chen et al.
Cost-Effective Proxy Reward Model Construction with On-Policy and Active Learning
by Yifang Chen, Shuohang Wang, Ziyi Yang, Hiteshi Sharma, Nikos Karampatziakis, Donghan Yu, Kevin Jamieson, Simon Shaolei Du, Yelong Shen
First submitted to arxiv on: 2 Jul 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel approach in large language model pipelines, Reinforcement Learning with Human Feedback (RLHF), is constrained by the size of human preference data. Traditional methods rely on offline dataset constructions, while recent online approaches use labeled seed data and unlabeled prompts to construct new preferences through self-generated responses and high-quality feedback. However, most current algorithms focus on preference labeling during policy model updates, incurring significant expert query costs. This paper introduces cost-effective proxy reward oracles construction strategies for labeling preferences with limited data and expert queries. The approach includes two innovations: on-policy querying to avoid OOD issues and active learning to select informative data. The method trains an evaluation model with minimal labeled data, effectively labeling nine times more preference pairs for further RLHF training. For instance, the Direct Preference Optimization (DPO) model gains a 1% average improvement on AlpacaEval2, MMLU-5shot, and MMLU-0shot with only 1.7K query cost. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary RLHF is an approach used in large language models that helps them learn from human feedback. The problem is that it requires a lot of data and expert input to train the model. This paper proposes new ways to reduce the amount of data needed while still getting good results. It uses two main techniques: on-policy querying, which helps avoid mistakes by making sure the model is learning from its own actions, and active learning, which selects the most important data for the model to learn from. |
Keywords
* Artificial intelligence * Active learning * Large language model * Optimization * Reinforcement learning * Rlhf