Loading Now

Summary of Scenetap: Scene-coherent Typographic Adversarial Planner Against Vision-language Models in Real-world Environments, by Yue Cao et al.


SceneTAP: Scene-Coherent Typographic Adversarial Planner against Vision-Language Models in Real-World Environments

by Yue Cao, Yun Xing, Jie Zhang, Di Lin, Tianwei Zhang, Ivor Tsang, Yang Liu, Qing Guo

First submitted to arxiv on: 28 Nov 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Large vision-language models have shown impressive abilities in interpreting visual content. However, existing works demonstrate these models’ vulnerability to deliberately placed adversarial texts. To address this limitation, the authors propose a novel approach to generate scene-coherent typographic adversarial attacks that mislead advanced LVLMs while maintaining visual naturalness. The method, called SceneTAP, employs a three-stage process: scene understanding, adversarial planning, and seamless integration. This is followed by a scene-coherent TextDiffuser that executes the attack using a local diffusion mechanism. Extensive experiments demonstrate that the generated patches successfully mislead state-of-the-art LVLMs, including ChatGPT-4o, even after capturing new images of physical setups.
Low GrooveSquid.com (original content) Low Difficulty Summary
Large vision-language models can understand pictures and text. But what if someone makes fake text to trick these models? Researchers developed a way to make this fake text look like it belongs in the picture. They called their method SceneTAP and used special language models to create the fake text. The goal was to make the text look real, so it would be hard for the models to detect. The researchers tested their approach by printing the fake text and placing it in real-world environments. They found that the models were fooled, even when shown new pictures of the same environment. This shows that current vision-language models are vulnerable to sophisticated attacks.

Keywords

» Artificial intelligence  » Diffusion  » Scene understanding