Loading Now

Summary of Aligning Codellms with Direct Preference Optimization, by Yibo Miao et al.


Aligning CodeLLMs with Direct Preference Optimization

by Yibo Miao, Bofei Gao, Shanghaoran Quan, Junyang Lin, Daoguang Zan, Jiaheng Liu, Jian Yang, Tianyu Liu, Zhijie Deng

First submitted to arxiv on: 24 Oct 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper focuses on improving large language models (LLMs) specifically designed to assist with programming tasks, known as CodeLLMs. These models can demonstrate decision-making and logical reasoning capabilities. While current approaches mainly focus on pre-training and supervised fine-tuning, this work explores the alignment stage for post-training LLMs. The authors identify that the commonly used PPO algorithm may not be optimal due to coarse-grained reward rules. They propose using the DPO algorithm, which relies on preference data pairs to create a fine-grained rewarding pattern. A pipeline is also presented for collecting preference pairs for DPO. Experimental results show significant performance improvements for existing CodeLLMs on benchmarks like MBPP and HumanEval.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about making computer programming tasks easier by using special language models called CodeLLMs. These models can help with things like writing code and solving problems. Right now, these models are mostly trained to do one specific task, but this research explores a new way to make them better at handling multiple tasks. The authors found that the usual method for training these models might not be working as well as it could, so they came up with a new approach called DPO. They also created a system to collect data needed for this new approach. The results show that their method can improve how well CodeLLMs do their job.

Keywords

» Artificial intelligence  » Alignment  » Fine tuning  » Supervised