Loading Now

Summary of Too Large; Data Reduction For Vision-language Pre-training, by Alex Jinpeng Wang et al.


Too Large; Data Reduction for Vision-Language Pre-Training

by Alex Jinpeng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei, Mike Zheng Shou

First submitted to arxiv on: 31 May 2023

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: None

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper tackles the challenges of severe image-text misalignment and high redundancy in large-scale Vision-Language Pre-Training (VLP) datasets. The authors propose an efficient algorithm called TL;DR, which compresses the existing large VLP data into a small, high-quality set. This is achieved through two steps: developing a codebook-based encoder-decoder captioner to select representative samples and generating new captions for selected samples to mitigate text-image misalignment while maintaining uniqueness. As a result, TL;DR enables the reduction of large datasets into smaller ones with high-quality data, which can serve as an alternative pre-training dataset. This algorithm significantly speeds up the time-consuming pretraining process. For instance, it can compress the CC3M and YFCC15M datasets at ratios of 24% and 16.7%, respectively. Experiments with three popular VLP models over seven downstream tasks show that training on compressed data using TL;DR can achieve similar or even better results compared to training on full-scale datasets.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper is all about making it easier to use huge amounts of pictures and words together. Right now, there are big problems when we try to match pictures with the right words, and there’s also a lot of repetition in these huge datasets. To fix this, the authors created a new way to shrink these large datasets down into smaller ones that still have high-quality information. This makes it faster and more efficient to train computers to understand pictures and words. The new method is called TL;DR, and it works by selecting important samples and rewriting captions to make them match up better. The results are very promising – the authors were able to shrink some datasets down to 24% or 16.7% of their original size while still getting good results.

Keywords

* Artificial intelligence  * Encoder decoder  * Pretraining