Loading Now

Summary of Quantifying Variance in Evaluation Benchmarks, by Lovish Madaan et al.


Quantifying Variance in Evaluation Benchmarks

by Lovish Madaan, Aaditya K. Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, Dieuwke Hupkes

First submitted to arxiv on: 14 Jun 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper investigates the role of evaluation benchmarks in measuring the capabilities of large language models (LLMs) and driving progress in their development. Originally designed to assess fully pretrained models, these benchmarks are now also used to compare different training choices. However, despite widespread usage, the variance in these benchmarks is rarely quantified, which can lead to unclear differences in performance between models. To address this issue, the authors define and measure various metrics that quantify the variance in evaluation benchmarks, including seed variance and monotonicity during training. The authors then provide empirical estimates for a range of variance metrics using a large number of models, both openly available and pretrained from scratch. They also evaluate the utility and tradeoffs of continuous versus discrete performance measures and explore options for better understanding and reducing this variance.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper looks at how we measure the abilities of big language models (LLMs). We use special tests to see how well these models can do things like answer questions or complete tasks. But sometimes, these tests don’t quite work as expected. The authors of this paper want to know why that is and what we can do about it. They came up with some new ways to measure the differences between different LLMs and found out that making small changes to how we ask the models to do things can help reduce these differences. This means that when we compare our language models, we need to think carefully about how we’re testing them.

Keywords

* Artificial intelligence