Summary of Self-boosting Large Language Models with Synthetic Preference Data, by Qingxiu Dong et al.
Self-Boosting Large Language Models with Synthetic Preference Data
by Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, Furu Wei
First submitted to arxiv on: 9 Oct 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper introduces SynPO, a self-boosting paradigm that leverages synthetic preference data to align Large Language Models (LLMs) with human preferences. The approach employs an iterative mechanism where a self-prompt generator creates diverse prompts and a response improver refines model responses progressively. This method trains LLMs to autonomously learn the generative rewards for their own outputs, eliminating the need for large-scale annotation of prompts and human preferences. The authors report significant enhancements in instruction-following abilities for Llama3-8B and Mistral-7B after four SynPO iterations, achieving over 22.1% win rate improvements on AlpacaEval 2.0 and ArenaHard. Additionally, SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the Open LLM leaderboard. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This research paper introduces a new way to make computers better at understanding what humans want them to do. Currently, these computers are trained with lots of data and guidance from humans, but this process is time-consuming and expensive. The authors propose an innovative approach called SynPO that uses fake preference data to help computers learn on their own. This method allows computers to improve their performance and make better decisions over time. The results show that the new approach leads to significant improvements in the ability of computers to follow instructions and perform various tasks. |
Keywords
» Artificial intelligence » Boosting » Prompt