Loading Now

Summary of Position: Theory Of Mind Benchmarks Are Broken For Large Language Models, by Matthew Riemer et al.


Position: Theory of Mind Benchmarks are Broken for Large Language Models

by Matthew Riemer, Zahra Ashktorab, Djallel Bouneffouf, Payel Das, Miao Liu, Justin D. Weisz, Murray Campbell

First submitted to arxiv on: 27 Dec 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper argues that most benchmarks for large language models’ (LLMs) theory of mind are flawed because they can’t directly test how the models adapt to new partners. This issue stems from benchmarks being inspired by human methods and attributing human-like qualities to AI agents. The authors claim that humans don’t consistently reason across questions, unlike current LLMs. Most benchmarks measure “literal” theory of mind, predicting others’ behavior, but not “functional” theory of mind, adapting to partners’ behavior. Top open-source LLMs may excel in literal theory of mind depending on prompts, but struggle with functional theory of mind even with simple partner policies. The paper concludes that achieving functional theory of mind is a significant challenge.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper shows that most tests for large language models are wrong because they don’t check how the models work with new partners. This problem comes from using human methods to test AI, which isn’t fair. Humans don’t always think the same way, so it’s not right to expect AI to do so either. The paper says we need to focus on what happens when AI works with people, not just predict what they’ll do.

Keywords

» Artificial intelligence