Loading Now

Summary of Probing the Robustness Of Vision-language Pretrained Models: a Multimodal Adversarial Attack Approach, by Jiwei Guan et al.


Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

by Jiwei Guan, Tianyu Ding, Longbing Cao, Lei Pan, Chen Wang, Xi Zheng

First submitted to arxiv on: 24 Aug 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Recent vision-language pretraining (VLP) transformers have shown exceptional performance across various multimodal tasks. However, their adversarial robustness has not been thoroughly investigated. This paper examines the vulnerability of recent VLP transformers and proposes a novel Joint Multimodal Transformer Feature Attack (JMTFA) to concurrently introduce perturbations in both visual and textual modalities under white-box settings. JMTFA targets attention relevance scores to disrupt important features within each modality, generating adversarial samples that lead to erroneous model predictions. Experimental results demonstrate high attack success rates on vision-language understanding and reasoning downstream tasks compared to existing baselines. Notably, the paper reveals that the textual modality significantly influences complex fusion processes within VLP transformers, with no apparent relationship between model size and adversarial robustness under proposed attacks. These findings emphasize a new dimension of adversarial robustness and underscore potential risks in the reliable deployment of multimodal AI systems.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper looks at how well vision-language pretraining (VLP) transformers work when faced with tricky questions or fake data. VLP transformers are really good at understanding images and text together, but they can be tricked into making mistakes if the information is manipulated in certain ways. The researchers created a new way to test these transformers by adding fake details to both images and text. They found that this approach was very effective in making the transformers make mistakes. This study shows that we need to be careful when using VLP transformers, especially if they’re going to be used for important tasks.

Keywords

» Artificial intelligence  » Attention  » Language understanding  » Pretraining  » Transformer