Loading Now

Summary of Token-level Proximal Policy Optimization For Query Generation, by Yichen Ouyang et al.


Token-level Proximal Policy Optimization for Query Generation

by Yichen Ouyang, Lu Wang, Fangkai Yang, Pu Zhao, Chenghua Huang, Jianfeng Liu, Bochen Pang, Yaming Yang, Yuefeng Zhan, Hao Sun, Qingwei Lin, Saravan Rajmohan, Weiwei Deng, Dongmei Zhang, Feng Sun, Qi Zhang

First submitted to arxiv on: 1 Nov 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A novel approach to query generation is proposed, leveraging Large Language Models (LLMs) to infer user intent based on web search interaction history. The Token-level Proximal Policy Optimization (TPPO) method fine-tunes LLMs for better performance in generating high-quality queries. TPPO combines a token-level reward model and a proximal policy optimization module to address the sparse reward challenge in Reinforcement Learning from AI Feedback (RLAIF) frameworks. Experiments on both open-source and industrial datasets show that TPPO significantly improves query generation performance, outperforming existing methods.
Low GrooveSquid.com (original content) Low Difficulty Summary
Query generation is important for search engines and recommendation systems. Researchers used Large Language Models to improve this task. However, they still struggled with generating good queries. The solution proposed in this paper is called Token-level Proximal Policy Optimization (TPPO). It helps the models learn by giving them rewards when they do a good job. This new approach worked better than previous methods.

Keywords

* Artificial intelligence  * Optimization  * Reinforcement learning  * Token