Loading Now

Summary of Are We Done with Mmlu?, by Aryo Pradipta Gema et al.


Are We Done with MMLU?

by Aryo Pradipta Gema, Joshua Ong Jun Leang, Giwon Hong, Alessio Devoto, Alberto Carlo Maria Mancino, Rohit Saxena, Xuanli He, Yu Zhao, Xiaotang Du, Mohammad Reza Ghasemi Madani, Claire Barale, Robert McHardy, Joshua Harris, Jean Kaddour, Emile van Krieken, Pasquale Minervini

First submitted to arxiv on: 6 Jun 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed paper identifies and analyzes errors in the Massive Multitask Language Understanding (MMLU) benchmark, which is widely adopted in the language modeling community. The analysis reveals numerous ground truth errors that obscure the true capabilities of large language models (LLMs). Specifically, it finds that 57% of analyzed questions in the Virology subset contain errors. To address this issue, the authors introduce a comprehensive framework for identifying dataset errors using a novel error annotation protocol. This is followed by the creation of MMLU-Redux, which consists of 5,700 manually re-annotated questions across all 57 MMLU subjects. The analysis estimates that 6.49% of MMLU questions contain errors. Using MMLU-Redux, the authors demonstrate significant discrepancies with the original model performance metrics reported. The study advocates for revising MMLU’s error-ridden questions to enhance its future utility and reliability as a benchmark.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper looks at problems in a popular tool used by language models. It finds that many of the “answers” in this tool are actually wrong, which means that language models can’t be trusted when they’re tested using these answers. The researchers fix some of these errors and then test language models again to see how they do. They find big differences between what the models were originally said to be able to do and what they can really do. This means that we need to make changes to this tool so that it’s more reliable.

Keywords

» Artificial intelligence  » Language understanding