Summary of Showui: One Vision-language-action Model For Gui Visual Agent, by Kevin Qinghong Lin et al.
ShowUI: One Vision-Language-Action Model for GUI Visual Agent
by Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou
First submitted to arxiv on: 26 Nov 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Human-Computer Interaction (cs.HC)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper presents a vision-language-action model, ShowUI, designed to enhance human workflow productivity by building Graphical User Interface (GUI) assistants. The model features three innovations: UI-Guided Visual Token Selection, Interleaved Vision-Language-Action Streaming, and Small-scale High-quality GUI Instruction-following Datasets. These components enable effective management of visual-action history in navigation or pairing multi-turn query-action sequences per screenshot to enhance training efficiency. ShowUI achieves a strong 75.1% accuracy in zero-shot screenshot grounding and reduces redundant visual tokens during training by 33%. Navigation experiments across web, mobile, and online environments underscore the effectiveness and potential of the model. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper creates a new kind of computer assistant that can help people with tasks on their computers or phones. This assistant is special because it understands what’s happening on the screen, not just what’s being said. It uses three new ideas to make this work: picking the right visual tokens, streaming together language and vision, and creating high-quality training data. The result is a model that can perform tasks correctly, even if it hasn’t seen them before. This could be very helpful in many areas of life. |
Keywords
» Artificial intelligence » Grounding » Token » Zero shot