Summary of Spa-bench: a Comprehensive Benchmark For Smartphone Agent Evaluation, by Jingxuan Chen et al.
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
by Jingxuan Chen, Derek Yuen, Bin Xie, Yuhao Yang, Gongwei Chen, Zhihao Wu, Li Yixing, Xurui Zhou, Weiwen Liu, Shuai Wang, Kaiwen Zhou, Rui Shao, Liqiang Nie, Yasheng Wang, Jianye Hao, Jun Wang, Kun Shao
First submitted to arxiv on: 19 Oct 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper presents a comprehensive benchmark, named SPA-B ENCH, to evaluate multimodal large language model (M)LLM-based smartphone agents. The benchmark features a diverse set of tasks simulating real-world conditions, covering system and third-party apps in English and Chinese. It also provides a plug-and-play framework for integrating over ten MLLM-based agents with Android devices. The evaluation pipeline assesses agent performance across multiple dimensions using seven metrics related to task completion and resource consumption. Experiments reveal challenges such as interpreting mobile user interfaces, action grounding, memory retention, and execution costs. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Smartphone agents are important for helping users control devices efficiently. MLLM-based approaches have emerged as key contenders. To fairly compare these agents is essential but challenging. The paper presents a comprehensive benchmark to evaluate MLLM-based smartphone agents. It includes diverse tasks simulating real-world conditions and provides a framework for integrating multiple agents with Android devices. The evaluation pipeline assesses agent performance using seven metrics related to task completion and resource consumption. |
Keywords
» Artificial intelligence » Grounding » Large language model