Summary of The Neglected Tails in Vision-language Models, by Shubham Parashar et al.

The Neglected Tails in Vision-Language Models

by Shubham Parashar, Zhiqiu Lin, Tian Liu, Xiangjue Dong, Yanan Li, Deva Ramanan, James Caverlee, Shu Kong

First submitted to arxiv on: 23 Jan 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary Vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. The paper analyzes the frequency of concepts in VLMs’ large-scale datasets using large language models (LLMs). It finds that popular datasets, such as LAION, exhibit a long-tailed concept distribution, leading to biased performance in VLMs. Furthermore, the study reveals that downstream applications of VLMs often fail to recognize or generate images of rare concepts. To mitigate this imbalance, the paper proposes REtrieval-Augmented Learning (REAL), which uses frequent synonyms found in pretraining texts as prompts and trains a linear classifier on a small balanced set of data. REAL outperforms previous zero-shot SOTA models using 400x less storage and 10,000x less training time.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper is about how well computers can recognize pictures without being taught beforehand. The problem is that these computer models don’t do equally well for all types of pictures. Some are easy to recognize, but others are hard or even impossible. The researchers found out why this happens by looking at the large datasets used to train these models. They discovered that some picture categories appear much more often than others in these datasets. This means that the models will naturally be better at recognizing the common types of pictures and worse for the rare ones. To fix this problem, they came up with a new way to train the models called REtrieval-Augmented Learning (REAL). REAL uses simpler words to teach the model what each picture looks like, which makes it perform much better.

Keywords

* Artificial intelligence * Pretraining * Zero shot

The Neglected Tails in Vision-Language Models

by Shubham Parashar, Zhiqiu Lin, Tian Liu, Xiangjue Dong, Yanan Li, Deva Ramanan, James Caverlee, Shu Kong

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Diffusion Representation For Asymmetric Kernels, by Alvaro Almeida Gomez et al.

Summary of Large Language Models Are Superpositions Of All Characters: Attaining Arbitrary Role-play Via Self-alignment, by Keming Lu et al.

Related Posts