Summary of Position: Theory Of Mind Benchmarks Are Broken For Large Language Models, by Matthew Riemer et al.
Position: Theory of Mind Benchmarks are Broken for Large Language Models
by Matthew Riemer, Zahra Ashktorab, Djallel Bouneffouf, Payel Das, Miao Liu, Justin D. Weisz, Murray Campbell
First submitted to arxiv on: 27 Dec 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper argues that most benchmarks for large language models’ (LLMs) theory of mind are flawed because they can’t directly test how the models adapt to new partners. This issue stems from benchmarks being inspired by human methods and attributing human-like qualities to AI agents. The authors claim that humans don’t consistently reason across questions, unlike current LLMs. Most benchmarks measure “literal” theory of mind, predicting others’ behavior, but not “functional” theory of mind, adapting to partners’ behavior. Top open-source LLMs may excel in literal theory of mind depending on prompts, but struggle with functional theory of mind even with simple partner policies. The paper concludes that achieving functional theory of mind is a significant challenge. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper shows that most tests for large language models are wrong because they don’t check how the models work with new partners. This problem comes from using human methods to test AI, which isn’t fair. Humans don’t always think the same way, so it’s not right to expect AI to do so either. The paper says we need to focus on what happens when AI works with people, not just predict what they’ll do. |