Loading Now

Summary of Navigating the Digital World As Humans Do: Universal Visual Grounding For Gui Agents, by Boyu Gou et al.


by Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, Yu Su

First submitted to arxiv on: 7 Oct 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The abstract proposes a new approach to graphical user interface (GUI) agent development, enabling them to transition from controlled simulations to complex real-world applications across various platforms. Multimodal large language models (MLLMs) are used to facilitate this transition, but the effectiveness of these agents hinges on robust grounding capabilities. The authors suggest a human-like embodiment for GUI agents that perceive the environment visually and directly perform pixel-level operations. They propose a simple recipe using web-based synthetic data and adapt LLaVA architecture to train visual grounding models. A large dataset is collected containing 10M GUI elements and their referring expressions over 1.3M screenshots, and UGround, a strong universal visual grounding model, is trained. The authors demonstrate the effectiveness of UGround on six benchmarks, showing that it outperforms existing visual grounding models for GUI agents by up to 20% absolute.
Low GrooveSquid.com (original content) Low Difficulty Summary
GUI agents are being transformed by multimodal large language models (MLLMs), allowing them to transition from controlled simulations to real-world applications. But how do they “see” what’s on the screen? The authors suggest a new way for GUI agents to perceive the environment, visually and directly performing pixel-level operations. They use a simple recipe to train visual grounding models, which accurately map referring expressions of GUI elements to their coordinates on the GUI. A big dataset is collected, containing millions of GUI elements and their referring expressions. The authors show that this new approach works well on six different benchmarks.

Keywords

» Artificial intelligence  » Grounding  » Synthetic data