Summary of Position: Theory Of Mind Benchmarks Are Broken For Large Language Models, by Matthew Riemer et al.

Position: Theory of Mind Benchmarks are Broken for Large Language Models

by Matthew Riemer, Zahra Ashktorab, Djallel Bouneffouf, Payel Das, Miao Liu, Justin D. Weisz, Murray Campbell

First submitted to arxiv on: 27 Dec 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper argues that most benchmarks for large language models’ (LLMs) theory of mind are flawed because they can’t directly test how the models adapt to new partners. This issue stems from benchmarks being inspired by human methods and attributing human-like qualities to AI agents. The authors claim that humans don’t consistently reason across questions, unlike current LLMs. Most benchmarks measure “literal” theory of mind, predicting others’ behavior, but not “functional” theory of mind, adapting to partners’ behavior. Top open-source LLMs may excel in literal theory of mind depending on prompts, but struggle with functional theory of mind even with simple partner policies. The paper concludes that achieving functional theory of mind is a significant challenge.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper shows that most tests for large language models are wrong because they don’t check how the models work with new partners. This problem comes from using human methods to test AI, which isn’t fair. Humans don’t always think the same way, so it’s not right to expect AI to do so either. The paper says we need to focus on what happens when AI works with people, not just predict what they’ll do.

Keywords

* Artificial intelligence

Position: Theory of Mind Benchmarks are Broken for Large Language Models

by Matthew Riemer, Zahra Ashktorab, Djallel Bouneffouf, Payel Das, Miao Liu, Justin D. Weisz, Murray Campbell

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Os-genesis: Automating Gui Agent Trajectory Construction Via Reverse Task Synthesis, by Qiushi Sun et al.

Summary of Enhancing Cognitive Diagnosis by Modeling Learner Cognitive Structure State, By Zhifu Chen et al.

Related Posts