Summary of Judging the Judges: Evaluating Alignment and Vulnerabilities in Llms-as-judges, by Aman Singh Thakur et al.
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
by Aman Singh Thakur, Kartik Choudhary, Venkat Srinik Ramayapally, Sankaran Vaidyanathan, Dieuwke Hupkes
First submitted to arxiv on: 18 Jun 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed LLM-as-a-judge paradigm for evaluating large language models (LLMs) is gaining traction, but its strengths and weaknesses remain unclear. This paper investigates the performance of various LLMs as judges in a clean scenario with high inter-human agreement. Thirteen judge models were evaluated on their alignment with humans in judging answers from nine examtaker models. The results show that only the largest and best models achieve reasonable alignment, but are still far behind human scores. Smaller models can provide a reasonable signal for ranking exam-takers, while lexical metrics may also contain valuable information. However, vulnerabilities in judge models were identified, including sensitivity to prompt complexity and length, and a tendency toward leniency. The findings suggest that caution is warranted when using judges in more complex settings. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large language models (LLMs) are being used to evaluate other LLMs, but how well do they work? This paper looks at 13 different judge models that try to guess the answers from 9 other LLMs. The results show that only the biggest and best judge models get close to human scores. But even those models can be far off, and smaller models can still give us useful information about how good or bad an answer is. We also found some problems with these judge models, like how they change their answers depending on how hard the question is. This makes us wonder if we should use them in more complicated situations. |
Keywords
» Artificial intelligence » Alignment » Prompt