Summary of Pre-training Of Lightweight Vision Transformers on Small Datasets with Minimally Scaled Images, by Jen Hong Tan
Pre-training of Lightweight Vision Transformers on Small Datasets with Minimally Scaled Images
by Jen Hong Tan
First submitted to arxiv on: 6 Feb 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The research paper investigates whether a lightweight Vision Transformer (ViT) can outperform Convolutional Neural Networks (CNNs) like ResNet on small image datasets. The study finds that a pure ViT, pre-trained using a masked auto-encoder technique and minimal image scaling, can achieve superior performance. This is demonstrated through experiments on the CIFAR-10 and CIFAR-100 datasets, which involve ViT models with fewer than 3.65 million parameters and a multiply-accumulate (MAC) count below 0.27G. The results show that this lightweight transformer-based architecture achieves state-of-the-art performance without scaling up images from the datasets. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary A team of researchers looked at how well a special kind of computer program, called a Vision Transformer, can work with small images. They wanted to know if this type of program could be as good as or even better than other types of programs that are commonly used for image recognition tasks. The answer is yes! By training the Vision Transformer in a special way and using it on small image datasets, the researchers were able to get great results without having to use a lot of computer power. |
Keywords
* Artificial intelligence * Encoder * Resnet * Transformer * Vision transformer * Vit