Loading Now

Summary of Cream: Consistency Regularized Self-rewarding Language Models, by Zhaoyang Wang et al.


CREAM: Consistency Regularized Self-Rewarding Language Models

by Zhaoyang Wang, Weilei He, Zhiyuan Liang, Xuchao Zhang, Chetan Bansal, Ying Wei, Weitong Zhang, Huaxiu Yao

First submitted to arxiv on: 16 Oct 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A recent study on large language models (LLMs) has explored the concept of LLM-as-a-Judge, where the same model is used to generate responses and score them. This approach enables iterative improvement in alignment performance without requiring human annotations for preference data. However, this process relies heavily on the accuracy of the rewarding and ranking mechanisms, which can be critical for ensuring reliable rewards and high-quality preference data. The study also highlights that improvements from self-rewarding may diminish after several iterations due to accumulated bias in the reward system, leading to unreliable preference data. To address this issue, the researchers formulated a generalized iterative preference fine-tuning framework and introduced regularization to mitigate overconfident preference labeling. This led to the proposal of the Consistency Regularized sElf-rewarding lAnguage Model (CREAM), which leverages consistency across iterations to regularize self-rewarding training. The empirical results demonstrate the superiority of CREAM in improving reward consistency and alignment performance. The code for this study is publicly available at GitHub.
Low GrooveSquid.com (original content) Low Difficulty Summary
Recent large language models have successfully used themselves as judges to improve alignment performance without needing human annotations. These models act as both policy models (generating responses) and reward models (scoring and ranking those responses). However, there’s no guarantee of accuracy in the rewarding and ranking process, which is crucial for reliable rewards and high-quality preference data. The study shows that improvements from self-rewarding may stop after a few iterations because of accumulated bias. This can lead to bad preference data. To fix this, researchers came up with a new framework and added rules to make sure the model doesn’t get too confident in its own judgments. They proposed a special type of model called CREAM (Consistency Regularized sElf-rewarding lAnguage Model) that helps the model learn from better preference data. The results show that CREAM does a better job at making good rewards and improving alignment performance.

Keywords

» Artificial intelligence  » Alignment  » Fine tuning  » Language model  » Regularization