Summary of When Benchmarks Are Targets: Revealing the Sensitivity Of Large Language Model Leaderboards, by Norah Alzahrani et al.
When Benchmarks are Targets: Revealing the Sensitivity of Large Language Model Leaderboards
by Norah Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yusef Almushaykeh, Faisal Mirza, Nouf Alotaibi, Nora Altwairesh, Areeb Alowisheq, M Saiful Bari, Haidar Khan
First submitted to arxiv on: 1 Feb 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper challenges the reliance on large language model (LLM) leaderboards based on benchmark rankings, highlighting that minor perturbations to the benchmark can significantly impact rankings. The authors demonstrate this sensitivity through systematic experiments on multiple-choice question benchmarks like MMLU, showing changes in rankings up to 8 positions by changing answer selection methods or choice order. To address this issue, the study proposes best-practice recommendations, including a hybrid scoring method for answer selection. By shedding light on the limitations of existing evaluation schemes, the paper paves the way for more robust benchmark evaluations and encourages practitioners to think critically about model selection. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This research shows that popular language models are not as reliable as we thought. The rankings on leaderboards can change just by making small changes to the questions or answers. This means we should be careful when choosing a language model, because what works well today might not work so well tomorrow. The authors of this study did some experiments and found out why this happens. They even came up with some tips for how to make better comparisons between models. |
Keywords
* Artificial intelligence * Language model * Large language model