Loading Now

Summary of Arithmetic Control Of Llms For Diverse User Preferences: Directional Preference Alignment with Multi-objective Rewards, by Haoxiang Wang et al.


Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards

by Haoxiang Wang, Yong Lin, Wei Xiong, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, Tong Zhang

First submitted to arxiv on: 28 Feb 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A novel framework, Directional Preference Alignment (DPA), is introduced to fine-tune large language models (LLMs) for adaptability to diverse user needs. Unlike traditional Reinforcement Learning from Human Feedback (RLHF) methods that rely on scalar rewards, DPA employs multi-objective reward modeling and represents user preferences as directions in the reward space. This allows for user-dependent preference control, enabling users to intuitively specify their desired trade-offs between different objectives, such as helpfulness and verbosity. The proposed method combines training a multi-objective reward model with fine-tuning an LLM using Rejection Sampling Finetuning (RSF), achieving a better performance trade-off across various reward objectives. Experimental results on Mistral-7B demonstrate the effectiveness of DPA in providing arithmetic control over the trade-off between helpfulness and verbosity while maintaining competitive performance with strong baselines.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large language models can be difficult to control, making it hard for users to get what they want. A new way to fine-tune these models is introduced, called Directional Preference Alignment (DPA). Unlike other methods that use a single reward, DPA uses multiple rewards to capture different user preferences. This allows users to specify exactly how they want the model to behave, such as being more helpful but less verbose. The method combines training a special kind of model with fine-tuning an existing language model. Results show that this approach works well and provides users with control over the trade-off between different objectives.

Keywords

* Artificial intelligence  * Alignment  * Fine tuning  * Language model  * Reinforcement learning from human feedback  * Rlhf