Summary of Mmina: Benchmarking Multihop Multimodal Internet Agents, by Ziniu Zhang et al.
MMInA: Benchmarking Multihop Multimodal Internet Agents
by Ziniu Zhang, Shulin Tian, Liangyu Chen, Ziwei Liu
First submitted to arxiv on: 15 Apr 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed MMInA benchmark evaluates the performance of embodied agents in completing complex user tasks across multiple websites. This multihop and multimodal benchmark assesses an agent’s ability to extract information from web pages, perform actions on multiple sites, and reason about long-range goals. The dataset includes 1,050 human-written tasks covering various domains like shopping and travel, with each task requiring the agent to autonomously navigate through websites. The evaluation protocol assesses an agent’s progress in completing multihop tasks, and experiments show that state-of-the-art web agents struggle to solve these tasks. To improve performance, a memory augmentation approach is proposed, which significantly enhances both single-hop and multihop web browsing abilities. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The MMInA benchmark helps evaluate the ability of autonomous embodied agents to complete complex user tasks across multiple websites. The benchmark has three appealing properties: it operates on evolving real-world websites, features naturally compositional tasks that require information from or actions on multiple websites, and assesses an agent’s progress in completing multihop tasks. This means the benchmark is very realistic and applicable to natural user tasks. The dataset includes many human-written tasks covering various domains, and the evaluation protocol helps understand how well agents can solve these tasks. |