Summary of Gaokao-eval: Does High Scores Truly Reflect Strong Capabilities in Llms?, by Zhikai Lei et al.
GAOKAO-Eval: Does high scores truly reflect strong capabilities in LLMs?
by Zhikai Lei, Tianyi Liang, Hanglei Hu, Jin Zhang, Yunhua Zhou, Yunfan Shao, Linyang Li, Chenchui Li, Changbo Wang, Hang Yan, Qipeng Guo
First submitted to arxiv on: 13 Dec 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Large Language Models (LLMs) are often evaluated using human-crafted benchmarks, assuming higher scores imply stronger human-like performance. However, concerns arise that LLMs might “game” these benchmarks due to data leakage, achieving high scores while struggling with simple tasks. To address this issue, the authors create GAOKAO-Eval, a comprehensive benchmark based on China’s National College Entrance Examination (Gaokao), and conduct closed-book evaluations for representative models released prior to Gaokao. Surprisingly, even after addressing data leakage and comprehensiveness, high scores fail to truly reflect human-aligned capabilities. To better understand this mismatch, the authors introduce the Rasch model from cognitive psychology to analyze LLM scoring patterns and identify two key discrepancies: anomalous consistent performance across various question difficulties and high variance in performance on questions of similar difficulty. Additionally, inconsistent grading of LLM-generated answers among teachers and recurring mistake patterns are identified. The results show that GAOKAO-Eval can reveal limitations in LLM capabilities not captured by current benchmarks and highlight the need for more LLM-aligned difficulty analysis. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary LLMs are often tested using special tests made by humans, assuming higher scores mean they’re getting better at being like humans. But some people think these tests might be too easy or unfair to the models, so they could get high scores without actually being good at the tasks. To fix this problem, the researchers created a new test called GAOKAO-Eval that’s based on a real college entrance exam in China. They tested several AI models using this new test and found that even with these improvements, the models didn’t really show how well they could do human-like tasks. The authors used special math from psychology to understand why this might be happening. They found two main problems: the models are very good at some questions but not as good at others, and teachers don’t always agree on how well the models did on certain questions. This new test can help us understand what AI models can really do and what they still need to work on. |