Summary of Language Model Preference Evaluation with Multiple Weak Evaluators, by Zhengyu Hu et al.

Language Model Preference Evaluation with Multiple Weak Evaluators

by Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Hui Xiong, Ranjay Krishna

First submitted to arxiv on: 14 Oct 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary Medium Difficulty Summary: Despite the success of Large Language Models (LLMs), evaluating their output quality regarding preference remains a critical challenge. Existing works rely on an LLM as the judge, but this approach is flawed due to conflicting preferences. To address this, we introduce GED (Preference Graph Ensemble and Denoise), a novel method that leverages multiple model-based evaluators to construct preference graphs and ensemble them for better evaluation results. Our framework consists of two stages: aggregating evaluations into a unified graph and applying denoising to eliminate inconsistencies. We provide theoretical guarantees and demonstrate its efficacy in recovering the ground truth preference structure. Extensive experiments on ten benchmarks show GED’s superiority in model ranking, response selection, and model alignment tasks, outperforming stronger LLMs by combining small evaluators.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Low Difficulty Summary: Researchers have a hard time figuring out which Large Language Models (LLMs) are best at making decisions. They usually ask one LLM to judge how good another LLM is, but this doesn’t work well because the judging LLM can have its own biases. To fix this problem, we created a new way called GED that uses multiple LLMs to decide which ones are better. Our method has two steps: combining what all the LLMs say into one graph and then cleaning up any mistakes. We tested our method on many different datasets and found it works really well in picking the best models for tasks like choosing responses and aligning information.

Keywords

» Artificial intelligence » Alignment

Language Model Preference Evaluation with Multiple Weak Evaluators

by Zhengyu Hu, Jieyu Zhang, Zhihan Xiong, Alexander Ratner, Hui Xiong, Ranjay Krishna

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Towards Homogeneous Lexical Tone Decoding From Heterogeneous Intracranial Recordings, by Di Wu et al.

Summary of Sset: Swapping-sliding Explanation For Time Series Classifiers in Affect Detection, by Nazanin Fouladgar et al.

Related Posts