Loading Now

Summary of Scalekd: Strong Vision Transformers Could Be Excellent Teachers, by Jiawei Fan et al.


ScaleKD: Strong Vision Transformers Could Be Excellent Teachers

by Jiawei Fan, Chao Li, Xiaolong Liu, Anbang Yao

First submitted to arxiv on: 11 Nov 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper investigates whether well-pretrained vision transformer (ViT) models can be used as teachers to advance cross-architecture knowledge distillation (KD) research. The authors highlight the importance of addressing differences in feature computing paradigms, model scales, and knowledge densities. They propose a simple and effective KD method called ScaleKD, which combines three components: cross attention projector, dual-view feature mimicking, and teacher parameter perception. This method trains student backbones spanning various architectures on image classification datasets, achieving state-of-the-art distillation performance. For example, using a well-pretrained Swin-L as the teacher model, ScaleKD obtains top-1 accuracies of 75.15% to 85.53% for different models trained on ImageNet-1K from scratch. The method demonstrates scalable properties when scaling up teacher models or pre-training datasets, leading to increasing gains in student models. ScaleKD also transfers well to downstream MS-COCO and ADE20K datasets.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper explores how to use really good “teacher” models to help train other models. The authors want to know if these teacher models can be used to improve training for different types of models. They propose a new way to do this, called ScaleKD, which helps student models learn faster and better. This method works by combining three ideas: making attention between features work better, mimicking how features look in different ways, and letting the teacher model see what it’s learning. The authors show that their method can train student models that are really good at recognizing images. They also find that if they use a very strong teacher model, it can help train other models even faster.

Keywords

* Artificial intelligence  * Attention  * Cross attention  * Distillation  * Image classification  * Knowledge distillation  * Teacher model  * Vision transformer  * Vit