Summary of Efficient Visual Transformer by Learnable Token Merging, By Yancheng Wang et al.

Efficient Visual Transformer by Learnable Token Merging

by Yancheng Wang, Yingzhen Yang

First submitted to arxiv on: 21 Jul 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The paper proposes a novel and compact transformer block, called LTM-Transformer, which performs token merging in a learnable scheme. The LTM-Transformer reduces the FLOPs and inference time of visual transformers while maintaining or improving prediction accuracy. It replaces traditional transformer blocks in popular networks like MobileViT, EfficientViT, ViT-S/16, and Swin-T with its own blocks, achieving compact and efficient visual transformers with comparable or better accuracy. The LTM-Transformer is motivated by the reduction of Information Bottleneck (IB), and a novel upper bound for the IB loss is derived. The architecture of the mask module in the LTM block generates token merging masks to reduce the IB loss. Experimental results on computer vision tasks demonstrate the effectiveness of the proposed approach.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The paper is about a new way to make transformers, which are important tools for machine learning. Transformers help computers understand language and pictures better. The new way, called LTM-Transformer, makes transformers smaller and faster while keeping them just as good at understanding things. It works by combining similar information together, which helps the computer learn more quickly. This is useful because it means computers can do tasks like recognizing pictures or understanding speech more efficiently.

Keywords

» Artificial intelligence » Inference » Machine learning » Mask » Token » Transformer » Vit

Efficient Visual Transformer by Learnable Token Merging

by Yancheng Wang, Yingzhen Yang

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Asyco: An Asymmetric Dual-task Co-training Model For Partial-label Learning, by Beibei Li et al.

Summary of Enhancing Hardware Fault Tolerance in Machines with Reinforcement Learning Policy Gradient Algorithms, by Sheila Schoepp et al.

Related Posts