Loading Now

Summary of Onebench to Test Them All: Sample-level Benchmarking Over Open-ended Capabilities, by Adhiraj Ghosh et al.


ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

by Adhiraj Ghosh, Sebastian Dziadzio, Ameya Prabhu, Vishaal Udandarao, Samuel Albanie, Matthias Bethge

First submitted to arxiv on: 9 Dec 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes ONEBench, a new testing paradigm for evaluating open-ended capabilities of foundation models. The traditional fixed test sets fall short in assessing the diverse capabilities of these models. To address this limitation, ONEBench consolidates individual evaluation datasets into a unified, ever-expanding sample pool. This allows users to generate custom, open-ended evaluation benchmarks from this pool, corresponding to specific capabilities of interest. By aggregating samples across test sets, ONEBench enables the assessment of diverse capabilities beyond those covered by the original test sets, while mitigating overfitting and dataset bias.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper creates a new way to test how well foundation models can do things that aren’t just about memorizing facts. Right now, we use fixed tests, but these tests are limited and don’t show us everything the model can do. The authors propose ONEBench, which takes lots of different evaluation datasets and puts them all together into one big pool. This lets users create custom tests that test specific things the model is good at. By combining all these tests, ONEBench shows how well the model does on many different tasks, while also making sure it doesn’t just memorize answers.

Keywords

» Artificial intelligence  » Overfitting