Summary of When Benchmarks Are Targets: Revealing the Sensitivity Of Large Language Model Leaderboards, by Norah Alzahrani et al.

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

by Norah Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yusef Almushaykeh, Faisal Mirza, Nouf Alotaibi, Nora Altwairesh, Areeb Alowisheq, M Saiful Bari, Haidar Khan

First submitted to arxiv on: 1 Feb 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper challenges the reliance on large language model (LLM) leaderboards based on benchmark rankings, highlighting that minor perturbations to the benchmark can significantly impact rankings. The authors demonstrate this sensitivity through systematic experiments on multiple-choice question benchmarks like MMLU, showing changes in rankings up to 8 positions by changing answer selection methods or choice order. To address this issue, the study proposes best-practice recommendations, including a hybrid scoring method for answer selection. By shedding light on the limitations of existing evaluation schemes, the paper paves the way for more robust benchmark evaluations and encourages practitioners to think critically about model selection.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This research shows that popular language models are not as reliable as we thought. The rankings on leaderboards can change just by making small changes to the questions or answers. This means we should be careful when choosing a language model, because what works well today might not work so well tomorrow. The authors of this study did some experiments and found out why this happens. They even came up with some tips for how to make better comparisons between models.

Keywords

* Artificial intelligence * Language model * Large language model

When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards

by Norah Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yusef Almushaykeh, Faisal Mirza, Nouf Alotaibi, Nora Altwairesh, Areeb Alowisheq, M Saiful Bari, Haidar Khan

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Higen: Hierarchy-aware Sequence Generation For Hierarchical Text Classification, by Vidit Jain et al.

Summary of Improved Quantization Strategies For Managing Heavy-tailed Gradients in Distributed Learning, by Guangfeng Yan et al.

Related Posts