Summary of Probing the Robustness Of Vision-language Pretrained Models: a Multimodal Adversarial Attack Approach, by Jiwei Guan et al.
Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach
by Jiwei Guan, Tianyu Ding, Longbing Cao, Lei Pan, Chen Wang, Xi Zheng
First submitted to arxiv on: 24 Aug 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Recent vision-language pretraining (VLP) transformers have shown exceptional performance across various multimodal tasks. However, their adversarial robustness has not been thoroughly investigated. This paper examines the vulnerability of recent VLP transformers and proposes a novel Joint Multimodal Transformer Feature Attack (JMTFA) to concurrently introduce perturbations in both visual and textual modalities under white-box settings. JMTFA targets attention relevance scores to disrupt important features within each modality, generating adversarial samples that lead to erroneous model predictions. Experimental results demonstrate high attack success rates on vision-language understanding and reasoning downstream tasks compared to existing baselines. Notably, the paper reveals that the textual modality significantly influences complex fusion processes within VLP transformers, with no apparent relationship between model size and adversarial robustness under proposed attacks. These findings emphasize a new dimension of adversarial robustness and underscore potential risks in the reliable deployment of multimodal AI systems. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper looks at how well vision-language pretraining (VLP) transformers work when faced with tricky questions or fake data. VLP transformers are really good at understanding images and text together, but they can be tricked into making mistakes if the information is manipulated in certain ways. The researchers created a new way to test these transformers by adding fake details to both images and text. They found that this approach was very effective in making the transformers make mistakes. This study shows that we need to be careful when using VLP transformers, especially if they’re going to be used for important tasks. |
Keywords
» Artificial intelligence » Attention » Language understanding » Pretraining » Transformer