Loading Now

Summary of Mmtom-qa: Multimodal Theory Of Mind Question Answering, by Chuanyang Jin et al.


MMToM-QA: Multimodal Theory of Mind Question Answering

by Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua B. Tenenbaum, Tianmin Shu

First submitted to arxiv on: 16 Jan 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
In this paper, researchers aim to develop machines with human-level social intelligence by creating a new benchmark for Theory of Mind (ToM) understanding. Currently, ToM benchmarks rely on unimodal data, such as video or text, whereas humans can reason about others’ mental states based on conceptual representations from any available data. The authors introduce the MMToM-QA benchmark to evaluate machine ToM on multimodal data and different kinds of unimodal data. They also propose a novel method, BIP-ALM (Bayesian Inverse Planning Accelerated by Language Models), which extracts unified representations from multimodal data and utilizes language models for scalable Bayesian inverse planning. The authors compare human performance with state-of-the-art models, including GPT-4, and demonstrate that large language models and multimodal models lack robust ToM capacity, but BIP-ALM shows promising results.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about teaching machines to understand people’s thoughts and feelings. Right now, machines can only understand certain types of data, like videos or texts. But humans can figure out what someone is thinking based on all kinds of information. The researchers created a new test called MMToM-QA that helps machines learn to do this too. They also came up with a special way for the machine to think about people’s thoughts, using language models and other techniques. When they tested it, they found that even very smart machines aren’t very good at understanding people’s minds yet. But their new method might be able to help.

Keywords

* Artificial intelligence  * Gpt