Summary of Windows Agent Arena: Evaluating Multi-modal Os Agents at Scale, by Rogerio Bonatti et al.
Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale
by Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui
First submitted to arxiv on: 12 Sep 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper introduces a new framework called the Windows Agent Arena to measure the performance of large language models (LLMs) as computer agents. The arena provides a realistic environment where LLMs can operate freely within a real Windows operating system, allowing them to use various applications and tools to solve tasks. The authors adapt the OSWorld framework to create diverse Windows tasks that require planning, screen understanding, and tool usage. They also introduce a new multi-modal agent called Navi, which achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. The authors provide extensive quantitative and qualitative analysis of Navi’s performance and discuss opportunities for future research. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper creates a new environment called Windows Agent Arena that lets large language models work like computer agents on a real Windows operating system. This helps measure how well the models do in tasks that require planning, understanding screens, and using tools. The authors also create a new agent called Navi that does pretty well in this environment. |
Keywords
» Artificial intelligence » Multi modal