Loading Now

Summary of Bi-factorial Preference Optimization: Balancing Safety-helpfulness in Language Models, by Wenxuan Zhang et al.


Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models

by Wenxuan Zhang, Philip H.S. Torr, Mohamed Elhoseiny, Adel Bibi

First submitted to arxiv on: 27 Aug 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: Computation and Language (cs.CL); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper proposes a novel supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to fine-tune large language models (LLMs) while ensuring their safety. Existing methods use reinforcement learning from human feedback, which can lead to conflicts between safety and helpfulness. BFPO re-parameters the joint objective into a single supervised learning problem, using a labeling function to capture global preferences. The framework is evaluated on a comprehensive benchmark of discriminative and generative tasks for helpfulness and harmlessness. Results show that BFPO outperforms existing approaches in both safety and helpfulness, while requiring less computational resources.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper talks about how to make language models safer without needing humans to help train them. Currently, people use reinforcement learning from human feedback to fine-tune these models, but this can cause problems because it’s hard to balance keeping the model safe with helping it be helpful. The new method, called Bi-Factorial Preference Optimization (BFPO), uses a single goal that combines safety and helpfulness. It then trains the model using labeled data and achieves the same level of safety as other methods, but much faster.

Keywords

» Artificial intelligence  » Optimization  » Reinforcement learning from human feedback  » Supervised