Summary of Gvt: a Graph-based Vision Transformer with Talking-heads Utilizing Sparsity, Trained From Scratch on Small Datasets, by Dongjing Shan et al.
GvT: A Graph-based Vision Transformer with Talking-Heads Utilizing Sparsity, Trained from Scratch on Small Datasets
by Dongjing Shan, guiqiang chen
First submitted to arxiv on: 7 Apr 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper proposes a novel Graph-based Vision Transformer (GvT) to address the performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets. The GvT utilizes graph convolutional projection and graph-pooling, which involves calculating queries and keys through graph convolutional projection based on spatial adjacency matrices, followed by dot-product attention to generate values. To overcome the low-rank bottleneck in attention heads, the paper employs talking-heads technology based on bilinear pooled features and sparse selection of attention tensors. Additionally, graph-pooling is applied between intermediate blocks to reduce token numbers and aggregate semantic information more effectively. Experimental results show that GvT produces comparable or superior outcomes to deep convolutional networks and surpasses vision transformers without pre-training on large datasets. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper tries to make Vision Transformers work better when there’s not much data. They come up with a new way called Graph-based Vision Transformer (GvT) that uses special kinds of graph calculations to help the model learn. This GvT thing has two main parts: it calculates what matters based on how things are connected, and then uses that information to figure out what’s important. The idea is that this will help the model learn better without needing lots of data. When they tested it, GvT worked almost as well as other models that needed more training, and sometimes even did better! |
Keywords
» Artificial intelligence » Attention » Dot product » Token » Vision transformer