Summary of Mr-ben: a Meta-reasoning Benchmark For Evaluating System-2 Thinking in Llms, by Zhongshen Zeng et al.
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs
by Zhongshen Zeng, Yinhong Liu, Yingjia Wan, Jingyao Li, Pengguang Chen, Jianbo Dai, Yuxuan Yao, Rongwu Xu, Zehan Qi, Wanru Zhao, Linling Shen, Jianqiao Lu, Haochen Tan, Yukang Chen, Hao Zhang, Zhan Shi, Bailin Wang, Zhijiang Guo, Jiaya Jia
First submitted to arxiv on: 20 Jun 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Large language models (LLMs) have excelled in problem-solving and decision-making due to step-by-step chain-of-thought reasoning. However, evaluating these abilities has become challenging as existing benchmarks begin to saturate. To address this, we present a process-based benchmark, MR-Ben, which demands meta-reasoning skills by asking LMs to locate and analyze potential errors in automatically generated reasoning steps. This paradigm is suited for system-2 slow thinking, mirroring human cognitive processes. Our benchmark comprises 5,975 questions across subjects like physics, chemistry, logic, coding, and more, curated by human experts. Through designed metrics, we identify limitations and weaknesses of current LLMs (open-source and closed-source models), revealing shortcomings in training strategies and inference methodologies. For example, OpenAI’s o1 series demonstrates strong performance by scrutinizing the solution space, while many state-of-the-art models fall behind on MR-Ben. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about making sure that big computers called Large Language Models can really think through problems. Right now, it’s hard to tell if they’re doing a good job because we don’t have the right way to test them. The scientists came up with a new idea to test these models by giving them puzzles and asking them to figure out where they might be wrong. They created a huge set of questions that cover lots of different subjects like science, math, and computer programming. By looking at how well the computers do on this test, we can see what they’re good at and what they need to work on. This is important because these computers are getting more powerful every day, and we want to make sure they’re being used in the right ways. |
Keywords
* Artificial intelligence * Inference