Summary of Accelerating Transformers with Spectrum-preserving Token Merging, by Hoai-chau Tran et al.
Accelerating Transformers with Spectrum-Preserving Token Merging
by Hoai-Chau Tran, Duy M. H. Nguyen, Duy M. Nguyen, Trung-Tin Nguyen, Ngan Le, Pengtao Xie, Daniel Sonntag, James Y. Zou, Binh T. Nguyen, Mathias Niepert
First submitted to arxiv on: 25 May 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The Transformer architecture, a crucial component in state-of-the-art models for vision and language tasks like GPT and LLaVa, aims to increase its throughput. To achieve this, recent strategies have merged token representations within the Transformer model to reduce computational and memory requirements while maintaining accuracy. However, existing methods, such as Bipartite Soft Matching (BSM), suffer from drawbacks like sensitivity to token-splitting strategies and damage to informative tokens in later layers. This paper presents PiToMe, a novel paradigm that prioritizes the preservation of informative tokens using an energy score metric. This score identifies large clusters of similar tokens as high-energy candidates for merging, while smaller unique clusters are considered low-energy and preserved. Experimental results show that PiToMe saved 40-60% FLOPs compared to base models, with superior off-the-shelf performance on image classification (ViT-MAE-H), image-text retrieval (CLIP on Flickr30k), and visual questions answering (LLaVa-7B). Furthermore, PiToMe theoretically preserves intrinsic spectral properties of the original token space under mild conditions. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about making a computer program called the Transformer work more efficiently. The Transformer is used in many state-of-the-art models for tasks like image recognition and language processing. To make it run faster, researchers have tried merging similar words together within the model. However, this approach has some problems, such as damaging important information or being sensitive to how words are split up. This paper presents a new way called PiToMe that prioritizes preserving important information while reducing computational requirements. The results show that PiToMe can make the program run faster and more accurately on certain tasks like image classification, image-text retrieval, and visual questions answering. |
Keywords
» Artificial intelligence » Gpt » Image classification » Mae » Token » Transformer » Vit