Summary of A Unified Understanding Of Adversarial Vulnerability Regarding Unimodal Models and Vision-language Pre-training Models, by Haonan Zheng et al.
A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models
by Haonan Zheng, Xinyang Deng, Wen Jiang, Wenrui Li
First submitted to arxiv on: 25 Jul 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper introduces Feature Guidance Attack (FGA), a novel method that leverages text representations to manipulate clean images, generating adversarial images. This approach is orthogonal to unimodal attack strategies and enables the direct application of unimodal research findings to multimodal scenarios. The authors also propose Feature Guidance with Text Attack (FGA-T), which attacks both modalities simultaneously, achieving superior attack effects against Vision-Language Pre-training (VLP) models. FGA-T demonstrates stable and effective attack capabilities across various datasets, downstream tasks, and black-box/white-box settings, serving as a unified baseline for exploring VLP model robustness. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper talks about how to make pictures look fake or distorted by using text information. They created a new way called Feature Guidance Attack (FGA) that takes normal images and changes them into fake ones. This method is special because it can be used with any kind of image, not just pictures. The authors also combined this method with another technique that uses text to make the attack even stronger. They tested this method on different kinds of images and tasks, and it worked well in all cases. This research helps us understand how powerful computer models like VLP can be attacked or fooled. |