Loading Now

Summary of Videogui: a Benchmark For Gui Automation From Instructional Videos, by Kevin Qinghong Lin et al.


VideoGUI: A Benchmark for GUI Automation from Instructional Videos

by Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou

First submitted to arxiv on: 14 Jun 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper proposes a novel benchmark called VideoGUI to evaluate Graphical User Interface (GUI) automation assistants on complex visual-centric GUI tasks. The existing task formulations primarily focus on simple language-only instructions, whereas VideoGUI focuses on professional and novel software, such as Adobe Photoshop or Stable Diffusion WebUI, and complex activities like video editing. The benchmark evaluates GUI assistants through a hierarchical process, identifying the specific levels at which they may fail: high-level planning, middle-level planning, and atomic action execution. For each level, the paper designs evaluation metrics across individual dimensions to provide clear signals. The authors also evaluate the state-of-the-art (SOTA) large multimodal model GPT4o on VideoGUI, revealing its poor performance on visual-centric GUI tasks, especially for high-level planning.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is about helping people be more productive by using computers better. Right now, most computer automation tools can only do simple things when told exactly what to do. The authors created a new way to test these tools called VideoGUI, which focuses on complex tasks like editing videos or using special software like Adobe Photoshop. They designed this benchmark to see how well the tools perform at different levels: planning out what to do, breaking down big tasks into smaller ones, and actually doing the actions. The authors tested a top-performing AI model on VideoGUI and found that it struggled with these complex tasks.

Keywords

» Artificial intelligence  » Diffusion