Summary of Would I Lie to You? Inference Time Alignment Of Language Models Using Direct Preference Heads, by Avelina Asada Hadji-kyriacou and Ognjen Arandjelovic

Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads

by Avelina Asada Hadji-Kyriacou, Ognjen Arandjelovic

First submitted to arxiv on: 30 May 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This research paper proposes a novel approach to fine-tuning pre-trained language models using Reinforcement Learning from Human Feedback (RLHF). The authors aim to improve the controllability of these models, which are known for their zero-shot and in-context learning capabilities. However, RLHF has been shown to potentially harm the models’ reasoning capabilities and introduce artifacts like hallucinations. To address this issue, the paper introduces Direct Preference Heads (DPH), a fine-tuning framework that enables LMs to learn human preference signals without affecting the output distribution of the language modeling head. The authors theoretically analyze their objective function and find strong ties to Conservative Direct Preference Optimization (cDPO). They evaluate their models on GLUE, RACE, and the GPT4All evaluation suite, demonstrating that DPH produces higher scores than Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) alone.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This research paper helps us make language models better at following instructions. Right now, these models are really good at learning new things on their own, but it’s hard to control what they do. The authors want to change that by using a special kind of training called Reinforcement Learning from Human Feedback (RLHF). However, RLHF can sometimes make the models worse and create fake information. To fix this problem, the paper proposes a new way to train the models called Direct Preference Heads (DPH). This method lets the models learn what humans prefer without affecting what they say. The authors tested their approach on some big datasets and showed that it works better than other methods.

Keywords

* Artificial intelligence * Fine tuning * Objective function * Optimization * Reinforcement learning from human feedback * Rlhf * Supervised * Zero shot

Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads

by Avelina Asada Hadji-Kyriacou, Ognjen Arandjelovic

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Unraveling the Impact Of Heterophilic Structures on Graph Positive-unlabeled Learning, by Yuhao Wu et al.

Summary of Towards Faster Decentralized Stochastic Optimization with Communication Compression, by Rustem Islamov et al.

Related Posts