Loading Now

Summary of Spa-bench: a Comprehensive Benchmark For Smartphone Agent Evaluation, by Jingxuan Chen et al.


SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation

by Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Kaiwen Zhou, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao

First submitted to arxiv on: 19 Oct 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper presents a comprehensive benchmark, named SPA-B ENCH, to evaluate multimodal large language model (M)LLM-based smartphone agents. The benchmark features a diverse set of tasks simulating real-world conditions, covering system and third-party apps in English and Chinese. It also provides a plug-and-play framework for integrating over ten MLLM-based agents with Android devices. The evaluation pipeline assesses agent performance across multiple dimensions using seven metrics related to task completion and resource consumption. Experiments reveal challenges such as interpreting mobile user interfaces, action grounding, memory retention, and execution costs.
Low GrooveSquid.com (original content) Low Difficulty Summary
Smartphone agents are important for helping users control devices efficiently. MLLM-based approaches have emerged as key contenders. To fairly compare these agents is essential but challenging. The paper presents a comprehensive benchmark to evaluate MLLM-based smartphone agents. It includes diverse tasks simulating real-world conditions and provides a framework for integrating multiple agents with Android devices. The evaluation pipeline assesses agent performance using seven metrics related to task completion and resource consumption.

Keywords

» Artificial intelligence  » Grounding  » Large language model