Summary of Regularizing Hidden States Enables Learning Generalizable Reward Model For Llms, by Rui Yang et al.
Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs
by Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, Tong Zhang
First submitted to arxiv on: 14 Jun 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This research paper introduces a novel approach to improve the generalization ability of Large Language Models (LLMs) trained on human preference data. The current reward models are effective in aligning LLMs with human intent but have limited capabilities to generalize to unseen prompts and responses, leading to an unexpected phenomenon called reward over-optimization. To address this issue, the study proposes a regularization technique that retains the base model’s language model head and incorporates text-generation losses to preserve the hidden states’ text-generation capabilities. The experimental results show that this approach significantly improves the accuracy of learned reward models across various out-of-distribution (OOD) tasks and alleviates the over-optimization issue in reinforcement learning from human feedback (RLHF). This novel paradigm offers a more reliable and robust preference learning framework for large language models. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large Language Models are trained on data that humans prefer. But this makes them very good at understanding what humans like, but not as good at understanding new things they haven’t seen before. This can cause the model to get too good at following rules instead of being creative and original. The researchers found a way to fix this by adding some extra steps in the training process that help the model stay flexible and adaptable. This means it will be better at understanding new things and not just memorizing old patterns. This is important because it will allow humans to use these models for more complex tasks like generating creative writing or music. |
Keywords
» Artificial intelligence » Generalization » Language model » Optimization » Regularization » Reinforcement learning from human feedback » Rlhf » Text generation