Summary of Mathcamps: Fine-grained Synthesis Of Mathematical Problems From Human Curricula, by Shubhra Mishra et al.
MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula
by Shubhra Mishra, Gabriel Poesia, Belinda Mo, Noah D. Goodman
First submitted to arxiv on: 1 Jul 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Mathematical problem-solving is an essential capability for Large Language Models (LLMs), serving as a proxy for various reasoning abilities. Existing benchmarks assess a wide range of skills, but aggregate accuracy metrics obscure specific strengths and weaknesses. Moreover, they are challenging to extend with new problems, risking data contamination over time. To address these challenges, the authors propose MathCAMPS: a method to generate high-quality mathematical problems at scale, grounded on 44 fine-grained “standards” from the Mathematics Common Core (CC) Standard for K-8 grades. The team encodes each standard in a formal grammar, allowing them to sample diverse symbolic problems and their answers. They then use LLMs to realize the symbolic problems into word problems. A cycle-consistency method is proposed for validating problem faithfulness. Additionally, follow-up questions are derived from symbolic structures and converted into follow-up word problems – a novel task of mathematical dialogue that probes for robustness in understanding. Experiments on 23 LLMs reveal surprising failures even in the strongest models when asked simple follow-up questions. Furthermore, training checkpoints of Pythia 12B are evaluated on MathCAMPS, enabling analysis of when particular mathematical skills develop during its training. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about teaching Large Language Models (LLMs) to solve math problems better. Right now, we test LLMs with lots of different math questions, but this makes it hard to see what they’re really good or bad at. Also, it’s difficult to add new math problems without messing up the results. The authors came up with a way to create many high-quality math problems that can be used to test LLMs. They used a special set of guidelines for math education and turned them into symbolic math problems and answers. Then, they used the LLMs to change these symbolic problems into word problems. This helps us understand if the LLMs are really good at understanding math or not. Surprisingly, even the best LLMs didn’t do well when asked simple follow-up questions. The authors also looked at how Pythia 12B learned math skills over time. |