Summary of Onebench to Test Them All: Sample-level Benchmarking Over Open-ended Capabilities, by Adhiraj Ghosh et al.
ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities
by Adhiraj Ghosh, Sebastian Dziadzio, Ameya Prabhu, Vishaal Udandarao, Samuel Albanie, Matthias Bethge
First submitted to arxiv on: 9 Dec 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes ONEBench, a new testing paradigm for evaluating open-ended capabilities of foundation models. The traditional fixed test sets fall short in assessing the diverse capabilities of these models. To address this limitation, ONEBench consolidates individual evaluation datasets into a unified, ever-expanding sample pool. This allows users to generate custom, open-ended evaluation benchmarks from this pool, corresponding to specific capabilities of interest. By aggregating samples across test sets, ONEBench enables the assessment of diverse capabilities beyond those covered by the original test sets, while mitigating overfitting and dataset bias. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper creates a new way to test how well foundation models can do things that aren’t just about memorizing facts. Right now, we use fixed tests, but these tests are limited and don’t show us everything the model can do. The authors propose ONEBench, which takes lots of different evaluation datasets and puts them all together into one big pool. This lets users create custom tests that test specific things the model is good at. By combining all these tests, ONEBench shows how well the model does on many different tasks, while also making sure it doesn’t just memorize answers. |
Keywords
» Artificial intelligence » Overfitting