Loading Now

Summary of Metabench — a Sparse Benchmark Of Reasoning and Knowledge in Large Language Models, by Alex Kipnis et al.


metabench – A Sparse Benchmark of Reasoning and Knowledge in Large Language Models

by Alex Kipnis, Konstantinos Voudouris, Luca M. Schulze Buschoff, Eric Schulz

First submitted to arxiv on: 4 Jul 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Machine Learning (cs.LG); Machine Learning (stat.ML)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Large Language Models (LLMs) exhibit varying abilities across different tasks. To quantify these differences, initiatives like the Open LLM Leaderboard employ large benchmarks. However, high correlations between scores suggest that a small set of common underlying abilities are being measured, and redundant information is tapped into. A new approach uses data from over 5,000 LLMs to identify informative items from six benchmarks: ARC, GSM8K, HellaSwag, MMLU, TruthfulQA, and WinoGrande (totaling 28,632 items). By distilling these items, a sparse benchmark called metabench is created, which has less than 3% of the original size. This new benchmark not only provides point scores but also yields estimators of underlying abilities. The study demonstrates that these estimators can reconstruct original scores with low error rates (1.24% RMSE for individual benchmarks and 0.58% RMSE for total scores). Moreover, a single underlying common factor is identified, correlating strongly with the total score (r = 0.94).
Low GrooveSquid.com (original content) Low Difficulty Summary
Large Language Models are like super smart computers that can do lots of things. Right now, we’re trying to figure out how good they are at different tasks. To do this, we use big lists of questions or problems for the models to solve. But when we look at how well the models do on these lists, we see that they’re all doing similar things. So, we came up with a new way to look at this data and create a smaller list that tells us more about what each model is good at. This new list is like a special code that helps us understand which models are better than others.

Keywords

* Artificial intelligence