Loading Now

Summary of Deit-lt Distillation Strikes Back For Vision Transformer Training on Long-tailed Datasets, by Harsh Rangwani et al.


DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

by Harsh Rangwani, Pradipto Mondal, Mayank Mishra, Ashish Ramayee Asokan, R. Venkatesh Babu

First submitted to arxiv on: 3 Apr 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper introduces DeiT-LT, a novel approach for training Vision Transformers (ViTs) from scratch on long-tailed datasets. Unlike Convolutional Neural Networks (CNNs), ViTs lack informative inductive bias and thus require large amounts of data for pre-training. DeiT-LT proposes an efficient distillation method using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This leads to learning local CNN-like features in early ViT blocks, improving generalization for tail classes. The authors also propose a flat CNN teacher for mitigating overfitting, which learns low-rank generalizable features across all ViT blocks. DeiT-LT effectively learns features corresponding to both majority and minority classes using distinct tokens within the same ViT architecture. The approach is evaluated on datasets ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper trains Vision Transformers (ViTs) without needing a lot of data. This is important because many computer vision tasks require lots of training data, which can be hard to collect. The authors propose a new way to train ViTs using out-of-distribution images and adjusting the way we learn from these images. This helps the ViT focus on harder-to-predict classes. They also use a special type of teacher network that helps prevent overfitting. This approach is tested on different datasets, including small and large ones.

Keywords

* Artificial intelligence  * Cnn  * Distillation  * Generalization  * Overfitting  * Vit