Summary of Deit-lt Distillation Strikes Back For Vision Transformer Training on Long-tailed Datasets, by Harsh Rangwani et al.

DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

by Harsh Rangwani, Pradipto Mondal, Mayank Mishra, Ashish Ramayee Asokan, R. Venkatesh Babu

First submitted to arxiv on: 3 Apr 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper introduces DeiT-LT, a novel approach for training Vision Transformers (ViTs) from scratch on long-tailed datasets. Unlike Convolutional Neural Networks (CNNs), ViTs lack informative inductive bias and thus require large amounts of data for pre-training. DeiT-LT proposes an efficient distillation method using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This leads to learning local CNN-like features in early ViT blocks, improving generalization for tail classes. The authors also propose a flat CNN teacher for mitigating overfitting, which learns low-rank generalizable features across all ViT blocks. DeiT-LT effectively learns features corresponding to both majority and minority classes using distinct tokens within the same ViT architecture. The approach is evaluated on datasets ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The paper trains Vision Transformers (ViTs) without needing a lot of data. This is important because many computer vision tasks require lots of training data, which can be hard to collect. The authors propose a new way to train ViTs using out-of-distribution images and adjusting the way we learn from these images. This helps the ViT focus on harder-to-predict classes. They also use a special type of teacher network that helps prevent overfitting. This approach is tested on different datasets, including small and large ones.

Keywords

* Artificial intelligence * Cnn * Distillation * Generalization * Overfitting * Vit

DeiT-LT Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets

by Harsh Rangwani, Pradipto Mondal, Mayank Mishra, Ashish Ramayee Asokan, R. Venkatesh Babu

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Comment on “machine Learning Conservation Laws From Differential Equations”, by Michael F. Zimmer

Summary of Anova-boosting For Random Fourier Features, by Daniel Potts and Laura Weidensager

Related Posts