Summary of Efficient Visual Transformer by Learnable Token Merging, By Yancheng Wang et al.
Efficient Visual Transformer by Learnable Token Merging
by Yancheng Wang, Yingzhen Yang
First submitted to arxiv on: 21 Jul 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper proposes a novel and compact transformer block, called LTM-Transformer, which performs token merging in a learnable scheme. The LTM-Transformer reduces the FLOPs and inference time of visual transformers while maintaining or improving prediction accuracy. It replaces traditional transformer blocks in popular networks like MobileViT, EfficientViT, ViT-S/16, and Swin-T with its own blocks, achieving compact and efficient visual transformers with comparable or better accuracy. The LTM-Transformer is motivated by the reduction of Information Bottleneck (IB), and a novel upper bound for the IB loss is derived. The architecture of the mask module in the LTM block generates token merging masks to reduce the IB loss. Experimental results on computer vision tasks demonstrate the effectiveness of the proposed approach. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper is about a new way to make transformers, which are important tools for machine learning. Transformers help computers understand language and pictures better. The new way, called LTM-Transformer, makes transformers smaller and faster while keeping them just as good at understanding things. It works by combining similar information together, which helps the computer learn more quickly. This is useful because it means computers can do tasks like recognizing pictures or understanding speech more efficiently. |
Keywords
» Artificial intelligence » Inference » Machine learning » Mask » Token » Transformer » Vit