Loading Now

Summary of A Careful Examination Of Large Language Model Performance on Grade School Arithmetic, by Hugh Zhang et al.


A Careful Examination of Large Language Model Performance on Grade School Arithmetic

by Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, Will Song, Tiffany Zhao, Pranav Raja, Charlotte Zhuang, Dylan Slack, Qin Lyu, Sean Hendryx, Russell Kaplan, Michele Lunati, Summer Yue

First submitted to arxiv on: 1 May 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This study investigates the performance of large language models (LLMs) on mathematical reasoning benchmarks. Recent successes by LLMs have raised concerns that some of this performance is due to dataset contamination, rather than true reasoning ability. To address this claim, the authors commission a new benchmark, Grade School Math 1000 (GSM1k), which mirrors the style and complexity of the established GSM8k benchmark. They compare GSM1k with GSM8k across various metrics, including human solve rates, number of steps in solution, and answer magnitude. The results show accuracy drops of up to 8% for some LLMs, with evidence of systematic overfitting across almost all model sizes. Further analysis reveals a positive relationship between a model’s probability of generating an example from GSM8k and its performance gap between GSM8k and GSM1k, suggesting that some models have partially memorized GSM8k. However, many models, especially those on the frontier, show minimal signs of overfitting and demonstrate generalization to novel math problems.
Low GrooveSquid.com (original content) Low Difficulty Summary
This study looks at how well language models can solve math problems. Some people think that these models are just cheating by copying answers from a test they’ve seen before, rather than actually understanding the math. To figure out if this is true, researchers created a new math test called GSM1k that’s similar to another popular test called GSM8k. They compared how well language models did on both tests and found that some of them were much better at one test than the other. This suggests that these models might be memorizing specific answers rather than truly understanding the math. However, many models still performed well even when presented with new math problems they hadn’t seen before.

Keywords

» Artificial intelligence  » Generalization  » Overfitting  » Probability