Summary of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment, By Xin Xiao et al.

Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

by Xin Xiao, Bohong Wu, Jiacong Wang, Chunyuan Li, Xun Zhou, Haoyuan Guo

First submitted to arxiv on: 28 May 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary The existing approach to image-text modality alignment in Vision Language Models (VLMs) treats each text token equally, which can lead to sub-optimal cross-modal alignment. To address this issue, the authors propose a re-weighting strategy called Contrastive ALignment (CAL), which prioritizes training visually correlated tokens by leveraging the difference in prediction logits on each text token provided by contrasting image inputs. CAL is shown to consistently improve different types of VLMs across various benchmark datasets, with minimal additional computational overhead compared to alternative data scaling strategies. The authors also release their code at this GitHub URL.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Imagine you’re trying to understand a picture better by reading the words that go with it. Right now, computers do this task equally for each word, which isn’t very good. This paper suggests a new way to do this task by looking at how well each word matches up with what’s in the picture. They call this new way Contrastive ALignment (CAL). CAL works better than the old way and can be used with different types of computers that understand pictures and words.

Keywords

* Artificial intelligence * Alignment * Logits * Token

Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment

by Xin Xiao, Bohong Wu, Jiacong Wang, Chunyuan Li, Xun Zhou, Haoyuan Guo

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Getting More Juice Out Of the Sft Data: Reward Learning From Human Demonstration Improves Sft For Llm Alignment, by Jiaxiang Li et al.

Summary of Mixdq: Memory-efficient Few-step Text-to-image Diffusion Models with Metric-decoupled Mixed Precision Quantization, by Tianchen Zhao et al.

Related Posts