Summary of Tombench: Benchmarking Theory Of Mind in Large Language Models, by Zhuang Chen et al.

ToMBench: Benchmarking Theory of Mind in Large Language Models

by Zhuang Chen, Jincenzi Wu, Jinfeng Zhou, Bosi Wen, Guanqun Bi, Gongyao Jiang, Yaru Cao, Mengting Hu, Yunghwei Lai, Zexuan Xiong, Minlie Huang

First submitted to arxiv on: 23 Feb 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary Recent research has sparked debate over whether large language models (LLMs) possess Theory of Mind (ToM), the cognitive ability to perceive mental states. However, existing ToM evaluations face challenges like constrained scope, subjective judgment, and unintended contamination, leading to inadequate assessments. The authors introduce ToMBench, a systematic evaluation framework that addresses these limitations. It features 8 tasks, 31 abilities in social cognition, multiple-choice questions for automated and unbiased evaluation, and a bilingual inventory to prevent data leakage. The authors evaluate the ToM performance of 10 popular LLMs using ToMBench, finding that even advanced models like GPT-4 lag behind human performance by over 10% points. This suggests that LLMs have not yet achieved human-level ToM capabilities. ToMBench aims to enable efficient and effective evaluation of LLMs’ ToM capabilities, facilitating the development of LLMs with inherent social intelligence.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Imagine trying to figure out what someone else is thinking or feeling. That’s Theory of Mind (ToM). Scientists want to know if super smart computer programs called Large Language Models (LLMs) can do this too. But there are problems with how we test ToM now. The solution is a new way to evaluate LLMs’ ToM, called ToMBench. It has 8 tasks and helps us see what the LLMs are good at. Scientists tested 10 different LLMs using ToMBench and found that even the smartest ones aren’t as good as humans yet. They want to use ToMBench to make better computer programs that can understand people’s thoughts and feelings.

Keywords

* Artificial intelligence * Gpt

ToMBench: Benchmarking Theory of Mind in Large Language Models

by Zhuang Chen, Jincenzi Wu, Jinfeng Zhou, Bosi Wen, Guanqun Bi, Gongyao Jiang, Yaru Cao, Mengting Hu, Yunghwei Lai, Zexuan Xiong, Minlie Huang

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Comparing Inferential Strategies Of Humans and Large Language Models in Deductive Reasoning, by Philipp Mondorf and Barbara Plank

Summary of Interactive-kbqa: Multi-turn Interactions For Knowledge Base Question Answering with Large Language Models, by Guanming Xiong et al.

Related Posts