Loading Now

Summary of Would I Lie to You? Inference Time Alignment Of Language Models Using Direct Preference Heads, by Avelina Asada Hadji-kyriacou and Ognjen Arandjelovic


Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads

by Avelina Asada Hadji-Kyriacou, Ognjen Arandjelovic

First submitted to arxiv on: 30 May 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This research paper proposes a novel approach to fine-tuning pre-trained language models using Reinforcement Learning from Human Feedback (RLHF). The authors aim to improve the controllability of these models, which are known for their zero-shot and in-context learning capabilities. However, RLHF has been shown to potentially harm the models’ reasoning capabilities and introduce artifacts like hallucinations. To address this issue, the paper introduces Direct Preference Heads (DPH), a fine-tuning framework that enables LMs to learn human preference signals without affecting the output distribution of the language modeling head. The authors theoretically analyze their objective function and find strong ties to Conservative Direct Preference Optimization (cDPO). They evaluate their models on GLUE, RACE, and the GPT4All evaluation suite, demonstrating that DPH produces higher scores than Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) alone.
Low GrooveSquid.com (original content) Low Difficulty Summary
This research paper helps us make language models better at following instructions. Right now, these models are really good at learning new things on their own, but it’s hard to control what they do. The authors want to change that by using a special kind of training called Reinforcement Learning from Human Feedback (RLHF). However, RLHF can sometimes make the models worse and create fake information. To fix this problem, the paper proposes a new way to train the models called Direct Preference Heads (DPH). This method lets the models learn what humans prefer without affecting what they say. The authors tested their approach on some big datasets and showed that it works better than other methods.

Keywords

» Artificial intelligence  » Fine tuning  » Objective function  » Optimization  » Reinforcement learning from human feedback  » Rlhf  » Supervised  » Zero shot