Loading Now

Summary of Skywork-reward: Bag Of Tricks For Reward Modeling in Llms, by Chris Yuhao Liu et al.


Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

by Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, Yahui Zhou

First submitted to arxiv on: 24 Oct 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This report presents a set of methods to improve reward modeling for large language models (LLMs), with a focus on data-centric techniques. The authors propose effective strategies for selecting and filtering open-source preference datasets, resulting in the Skywork-Reward dataset containing 80K pairs, which is significantly smaller than existing datasets. Using this curated dataset, they developed two model series: Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B, with the former currently holding the top position on the RewardBench leaderboard. The authors’ techniques and datasets have directly improved the performance of many top-ranked models on RewardBench, highlighting their practical impact in real-world preference learning applications.
Low GrooveSquid.com (original content) Low Difficulty Summary
This report is about improving how we teach large language models to learn from rewards. The researchers came up with new ways to pick and choose the best data for training these models. They created a special dataset called Skywork-Reward that has only 80,000 pairs of preferences, which is much smaller than what they started with. Using this dataset, they made two versions of their model: Gemma-27B and Llama-3.1-8B. One of them became the best performer on a leaderboard called RewardBench. The results show that their approach can help real-world applications learn from rewards.

Keywords

» Artificial intelligence  » Llama