Loading Now

Summary of Gaokao-eval: Does High Scores Truly Reflect Strong Capabilities in Llms?, by Zhikai Lei et al.


GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?

by Zhikai Lei, Tianyi Liang, Hanglei Hu, Jin Zhang, Yunhua Zhou, Yunfan Shao, Linyang Li, Chenchui Li, Changbo Wang, Hang Yan, Qipeng Guo

First submitted to arxiv on: 13 Dec 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Large Language Models (LLMs) are often evaluated using human-crafted benchmarks, assuming higher scores imply stronger human-like performance. However, concerns arise that LLMs might “game” these benchmarks due to data leakage, achieving high scores while struggling with simple tasks. To address this issue, the authors create GAOKAO-Eval, a comprehensive benchmark based on China’s National College Entrance Examination (Gaokao), and conduct closed-book evaluations for representative models released prior to Gaokao. Surprisingly, even after addressing data leakage and comprehensiveness, high scores fail to truly reflect human-aligned capabilities. To better understand this mismatch, the authors introduce the Rasch model from cognitive psychology to analyze LLM scoring patterns and identify two key discrepancies: anomalous consistent performance across various question difficulties and high variance in performance on questions of similar difficulty. Additionally, inconsistent grading of LLM-generated answers among teachers and recurring mistake patterns are identified. The results show that GAOKAO-Eval can reveal limitations in LLM capabilities not captured by current benchmarks and highlight the need for more LLM-aligned difficulty analysis.
Low GrooveSquid.com (original content) Low Difficulty Summary
LLMs are often tested using special tests made by humans, assuming higher scores mean they’re getting better at being like humans. But some people think these tests might be too easy or unfair to the models, so they could get high scores without actually being good at the tasks. To fix this problem, the researchers created a new test called GAOKAO-Eval that’s based on a real college entrance exam in China. They tested several AI models using this new test and found that even with these improvements, the models didn’t really show how well they could do human-like tasks. The authors used special math from psychology to understand why this might be happening. They found two main problems: the models are very good at some questions but not as good at others, and teachers don’t always agree on how well the models did on certain questions. This new test can help us understand what AI models can really do and what they still need to work on.

Keywords

» Artificial intelligence