Summary of Sparse Vs Contiguous Adversarial Pixel Perturbations in Multimodal Models: An Empirical Analysis, by Cristian-alexandru Botocan et al.
Sparse vs Contiguous Adversarial Pixel Perturbations in Multimodal Models: An Empirical Analysis
by Cristian-Alexandru Botocan, Raphael Meier, Ljiljana Dolamic
First submitted to arxiv on: 25 Jul 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
| Summary difficulty | Written by | Summary |
|---|---|---|
| High | Paper authors | High Difficulty Summary Read the original abstract here |
| Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed research assesses the robustness of multimodal models against adversarial examples, which is crucial for ensuring user safety. To achieve this, L0-norm perturbation attacks are applied to preprocessed input images in a black-box setup, targeting both targeted and untargeted misclassification. The study evaluates four multimodal models and two unimodal DNNs, considering various spatial positioning of perturbed pixels. The results show that unimodal DNNs are more robust than multimodal models, with CNN-based Image Encoder models being more vulnerable to attacks. |
| Low | GrooveSquid.com (original content) | Low Difficulty Summary This research is important because it helps ensure that multimodal models are safe for users. It does this by creating special kinds of attacks on these models. The study uses a type of attack called L0-norm perturbation attacks and tests them on different types of models, including some that can do things like recognize images and understand text. The results show that the models that just look at pictures are better at dealing with these attacks than the ones that can look at both pictures and words. |
Keywords
* Artificial intelligence * Cnn * Encoder




