Loading Now

Summary of Efficient Vision-and-language Pre-training with Text-relevant Image Patch Selection, by Wei Ye et al.


Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

by Wei Ye, Chaoya Jiang, Haiyang Xu, Chenhao Ye, Chenliang Li, Ming Yan, Shikun Zhang, Songhang Huang, Fei Huang

First submitted to arxiv on: 11 Jan 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Vision Transformers (ViTs) are increasingly used in large-scale Vision and Language Pre-training (VLP) models for their effectiveness. However, these models still face computational inefficiencies due to lengthy visual sequences. To address this challenge, we propose an efficient VLP approach called TRIPS, which progressively reduces the visual sequence using a text-guided patch-selection layer. This layer dynamically computes text-dependent visual attention, identifying attentive image tokens and fusing inattentive ones in an end-to-end fashion. TRIPS does not add extra parameters and generalizes to most ViT-based VLP models. We incorporate TRIPS into three representative VLP models, covering single-stream, dual-stream, and generative paradigms, and conduct extensive experiments on five widely-used multi-modal benchmark datasets. Our results show that TRIPS delivers a 40% speedup while maintaining competitive or superior performance on downstream tasks.
Low GrooveSquid.com (original content) Low Difficulty Summary
Imagine trying to understand a long video by looking at every single frame. That’s what computers have been doing with large-scale Vision and Language Pre-training (VLP) models, but it takes too much time! To solve this problem, we created an efficient VLP approach called TRIPS. It works by selecting only the most important parts of the video that are relevant to the text, like subtitles or captions. This makes training and using these models much faster. We tested TRIPS with three different approaches and five different datasets. Our results show that it’s 40% faster without sacrificing performance!

Keywords

» Artificial intelligence  » Attention  » Multi modal  » Vit