Loading Now

Summary of Improving Model Evaluation Using Smart Filtering Of Benchmark Datasets, by Vipul Gupta et al.


Improving Model Evaluation using SMART Filtering of Benchmark Datasets

by Vipul Gupta, Candace Ross, David Pantoja, Rebecca J. Passonneau, Megan Ung, Adina Williams

First submitted to arxiv on: 26 Oct 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper addresses the pressing issue of benchmark saturation in NLP evaluation by proposing a novel approach called Selection Methodology for Accurate, Reduced, and Targeted (SMART) filtering. SMART removes less informative and less challenging examples from existing benchmark datasets using three filtering criteria: easy examples, data-contaminated examples, and similar examples based on distance in an embedding space. The paper demonstrates the effectiveness of SMART on multiple choice QA datasets, reducing dataset size by 48% on average while increasing Pearson correlation with rankings from ChatBot Arena. This method enables more efficient evaluation, allowing for new benchmarks to be made more challenging or older datasets to be revitalized without affecting relative model rankings.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper helps solve a big problem in evaluating language models. The issue is that there are too many examples and some of them aren’t very good or useful. To fix this, the authors created a new way to pick the best examples from existing datasets. They use three rules: get rid of easy ones, remove fake data, and remove similar examples. By doing this, they can make new benchmarks more challenging or breathe new life into old datasets without changing how well different models perform.

Keywords

» Artificial intelligence  » Embedding space  » Nlp