Summary of Onebench to Test Them All: Sample-level Benchmarking Over Open-ended Capabilities, by Adhiraj Ghosh et al.

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

by Adhiraj Ghosh, Sebastian Dziadzio, Ameya Prabhu, Vishaal Udandarao, Samuel Albanie, Matthias Bethge

First submitted to arxiv on: 9 Dec 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper proposes ONEBench, a new testing paradigm for evaluating open-ended capabilities of foundation models. The traditional fixed test sets fall short in assessing the diverse capabilities of these models. To address this limitation, ONEBench consolidates individual evaluation datasets into a unified, ever-expanding sample pool. This allows users to generate custom, open-ended evaluation benchmarks from this pool, corresponding to specific capabilities of interest. By aggregating samples across test sets, ONEBench enables the assessment of diverse capabilities beyond those covered by the original test sets, while mitigating overfitting and dataset bias.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper creates a new way to test how well foundation models can do things that aren’t just about memorizing facts. Right now, we use fixed tests, but these tests are limited and don’t show us everything the model can do. The authors propose ONEBench, which takes lots of different evaluation datasets and puts them all together into one big pool. This lets users create custom tests that test specific things the model is good at. By combining all these tests, ONEBench shows how well the model does on many different tasks, while also making sure it doesn’t just memorize answers.

Keywords

* Artificial intelligence * Overfitting

ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities

by Adhiraj Ghosh, Sebastian Dziadzio, Ameya Prabhu, Vishaal Udandarao, Samuel Albanie, Matthias Bethge

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Omnievalkit: a Modular, Lightweight Toolbox For Evaluating Large Language Model and Its Omni-extensions, by Yi-kai Zhang and Xu-xiang Zhong and Shiyin Lu and Qing-guo Chen and De-chuan Zhan and Han-jia Ye

Summary of Gruvader: Sentiment-informed Stock Market Prediction, by Akhila Mamillapalli et al.

Related Posts