Loading Now

Summary of Ranking Large Language Models Without Ground Truth, by Amit Dhurandhar et al.


Ranking Large Language Models without Ground Truth

by Amit Dhurandhar, Rahul Nair, Moninder Singh, Elizabeth Daly, Karthikeyan Natesan Ramamurthy

First submitted to arxiv on: 21 Feb 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper addresses the problem of evaluating and ranking large language models (LLMs) without relying on human responses or unreliable peer-evaluation. The authors propose a novel approach that considers triplets of models, where each model evaluates the other two, correctly identifying the worst model with high probability. They analyze their idea and provide sufficient conditions for it to succeed, then apply this idea repeatedly to propose two methods for ranking LLMs. In experiments on different generative tasks, these methods reliably recover close to true rankings without reference data, making them a viable low-resource mechanism for practical use.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about how to compare and rank big language models without needing human help or relying on other models that might not be accurate. The idea is to look at groups of three models, where each model says which of the other two is the worst. This helps identify the least good model with high accuracy. By repeating this process, the authors come up with ways to rank models without needing any reference data. They test these methods on different tasks like summarization and multiple-choice questions, and they work well.

Keywords

* Artificial intelligence  * Probability  * Summarization