Summary of Patch Ranking: Efficient Clip by Learning to Rank Local Patches, By Cheng-en Wu et al.
Patch Ranking: Efficient CLIP by Learning to Rank Local Patches
by Cheng-En Wu, Jinhong Lin, Yu Hen Hu, Pedro Morgado
First submitted to arxiv on: 22 Sep 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A new approach to pruning patch tokens in Contrastive Image-Text (CLIP) pre-trained models aims to address the high computational requirements of Vision Transformer (ViT) backbones. The proposed method, called “Golden Ranking,” uses greedy search to identify the optimal subset of tokens for maximum performance. To compensate for potential accuracy losses from token pruning, learnable visual tokens are introduced to aid in restoring and enhancing model performance. The study investigates pruning tokens within ViT backbones and successfully reduces patch tokens by 40% with minimal average accuracy loss (0.3%) across seven datasets. This work lays the groundwork for building more computationally efficient multimodal models without sacrificing performance, addressing a key challenge in applying advanced vision-language models. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Contrastive image-text pre-trained models are really good at doing lots of tasks. But they can be very slow because they use a lot of computer power. People have tried to make them faster by removing some of the information they process, but this hasn’t worked well for all types of tasks. This paper proposes two new ways to make these models faster: finding the most important parts and using special “visual tokens” to help them understand what they’re seeing. By doing this, the researchers were able to make the model 40% faster without losing much accuracy. |
Keywords
» Artificial intelligence » Pruning » Token » Vision transformer » Vit