Summary of Metabench — a Sparse Benchmark Of Reasoning and Knowledge in Large Language Models, by Alex Kipnis et al.
metabench – A Sparse Benchmark of Reasoning and Knowledge in Large Language Models
by Alex Kipnis, Konstantinos Voudouris, Luca M. Schulze Buschoff, Eric Schulz
First submitted to arxiv on: 4 Jul 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Machine Learning (cs.LG); Machine Learning (stat.ML)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Large Language Models (LLMs) exhibit varying abilities across different tasks. To quantify these differences, initiatives like the Open LLM Leaderboard employ large benchmarks. However, high correlations between scores suggest that a small set of common underlying abilities are being measured, and redundant information is tapped into. A new approach uses data from over 5,000 LLMs to identify informative items from six benchmarks: ARC, GSM8K, HellaSwag, MMLU, TruthfulQA, and WinoGrande (totaling 28,632 items). By distilling these items, a sparse benchmark called metabench is created, which has less than 3% of the original size. This new benchmark not only provides point scores but also yields estimators of underlying abilities. The study demonstrates that these estimators can reconstruct original scores with low error rates (1.24% RMSE for individual benchmarks and 0.58% RMSE for total scores). Moreover, a single underlying common factor is identified, correlating strongly with the total score (r = 0.94). |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large Language Models are like super smart computers that can do lots of things. Right now, we’re trying to figure out how good they are at different tasks. To do this, we use big lists of questions or problems for the models to solve. But when we look at how well the models do on these lists, we see that they’re all doing similar things. So, we came up with a new way to look at this data and create a smaller list that tells us more about what each model is good at. This new list is like a special code that helps us understand which models are better than others. |