Summary of Token-level Proximal Policy Optimization For Query Generation, by Yichen Ouyang et al.
Token-level Proximal Policy Optimization for Query Generation
by Yichen Ouyang, Lu Wang, Fangkai Yang, Pu Zhao, Chenghua Huang, Jianfeng Liu, Bochen Pang, Yaming Yang, Yuefeng Zhan, Hao Sun, Qingwei Lin, Saravan Rajmohan, Weiwei Deng, Dongmei Zhang, Feng Sun, Qi Zhang
First submitted to arxiv on: 1 Nov 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel approach to query generation is proposed, leveraging Large Language Models (LLMs) to infer user intent based on web search interaction history. The Token-level Proximal Policy Optimization (TPPO) method fine-tunes LLMs for better performance in generating high-quality queries. TPPO combines a token-level reward model and a proximal policy optimization module to address the sparse reward challenge in Reinforcement Learning from AI Feedback (RLAIF) frameworks. Experiments on both open-source and industrial datasets show that TPPO significantly improves query generation performance, outperforming existing methods. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Query generation is important for search engines and recommendation systems. Researchers used Large Language Models to improve this task. However, they still struggled with generating good queries. The solution proposed in this paper is called Token-level Proximal Policy Optimization (TPPO). It helps the models learn by giving them rewards when they do a good job. This new approach worked better than previous methods. |
Keywords
* Artificial intelligence * Optimization * Reinforcement learning * Token