Loading Now

Summary of On the Surprising Effectiveness Of Attention Transfer For Vision Transformers, by Alexander C. Li et al.


On the Surprising Effectiveness of Attention Transfer for Vision Transformers

by Alexander C. Li, Yuandong Tian, Beidi Chen, Deepak Pathak, Xinlei Chen

First submitted to arxiv on: 14 Nov 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Neural and Evolutionary Computing (cs.NE)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This research paper investigates the effectiveness of Vision Transformers (ViT) in learning useful representations during pre-training, with the goal of improving downstream performance. The conventional wisdom suggests that pre-training ViT improves downstream performance by learning useful representations. However, this paper finds that the features and representations learned during pre-training are not essential. Instead, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high-quality features from scratch and achieve comparable downstream performance. The authors introduce a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. The study systematically explores various aspects of the findings, including distribution shift settings where attention transfer underperforms fine-tuning. The authors hope that this exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning.
Low GrooveSquid.com (original content) Low Difficulty Summary
Pre-training Vision Transformers (ViT) is thought to help models learn useful representations. But do we really need these features? This study found that just using the attention patterns from pre-training is enough for models to learn good features from scratch and do well on a task. They showed this by creating a simple way to transfer the attention patterns from one model to another, called attention transfer. This lets the second model learn its own features instead of relying on the first model’s features. The study also looked at how well this approach works in different situations and found that it can be useful even when there is a big change between the training and testing data.

Keywords

» Artificial intelligence  » Attention  » Fine tuning  » Vit