Summary of Towards Robust Vision Transformer Via Masked Adaptive Ensemble, by Fudong Lin et al.
Towards Robust Vision Transformer via Masked Adaptive Ensemble
by Fudong Lin, Jiadong Lou, Xu Yuan, Nian-Feng Tzeng
First submitted to arxiv on: 22 Jul 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel Vision Transformer (ViT) architecture is proposed to improve the robustness of ViTs against adversarial attacks while maintaining high standard accuracy. Adversarial training (AT) can be used to inject adversarial examples into the training data, but this method incurs a trade-off between standard accuracy and robustness. The proposed architecture includes a detector and a classifier bridged by an adaptive ensemble, which is driven by the discovery that detecting adversarial examples benefits from Guided Backpropagation. A Multi-head Self-Attention (MSA) mechanism is introduced to enhance the detector’s ability to sniff out adversarial examples, while a classifier with two encoders extracts visual representations from clean images and adversarial examples. The adaptive ensemble adjusts the proportion of these representations for accurate classification. This design enables the ViT architecture to achieve a better trade-off between standard accuracy and robustness, with experimental results showing that it achieves 90.3% standard accuracy and 49.8% adversarial robustness on CIFAR-10. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary A new kind of computer vision model called Vision Transformer (ViT) is being developed to be more resistant to attacks that try to trick it. Right now, ViTs can be easily fooled by fake data, but this new model tries to fix that problem. The researchers found a way to make the model better at detecting when the input data is fake or not. They also created a special kind of attention mechanism that helps the model focus on the right parts of the image. This new model is able to be both very good at recognizing normal images and very good at resisting attacks. It’s like having a superpower! |
Keywords
» Artificial intelligence » Attention » Backpropagation » Classification » Self attention » Vision transformer » Vit