Loading Now

Summary of Showui: One Vision-language-action Model For Gui Visual Agent, by Kevin Qinghong Lin et al.


ShowUI: One Vision-Language-Action Model for GUI Visual Agent

by Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou

First submitted to arxiv on: 26 Nov 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper presents a vision-language-action model, ShowUI, designed to enhance human workflow productivity by building Graphical User Interface (GUI) assistants. The model features three innovations: UI-Guided Visual Token Selection, Interleaved Vision-Language-Action Streaming, and Small-scale High-quality GUI Instruction-following Datasets. These components enable effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency. ShowUI achieves a strong 75.1% accuracy in zero-shot screenshot grounding and reduces redundant visual tokens during training by 33%. Navigation experiments across web, mobile, and online environments underscore the effectiveness and potential of the model.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper creates a new kind of computer assistant that can help people with tasks on their computers or phones. This assistant is special because it understands what’s happening on the screen, not just what’s being said. It uses three new ideas to make this work: picking the right visual tokens, streaming together language and vision, and creating high-quality training data. The result is a model that can perform tasks correctly, even if it hasn’t seen them before. This could be very helpful in many areas of life.

Keywords

» Artificial intelligence  » Grounding  » Token  » Zero shot