Summary of Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment, By Xin Xiao et al.
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
by Xin Xiao, Bohong Wu, Jiacong Wang, Chunyuan Li, Xun Zhou, Haoyuan Guo
First submitted to arxiv on: 28 May 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The existing approach to image-text modality alignment in Vision Language Models (VLMs) treats each text token equally, which can lead to sub-optimal cross-modal alignment. To address this issue, the authors propose a re-weighting strategy called Contrastive ALignment (CAL), which prioritizes training visually correlated tokens by leveraging the difference in prediction logits on each text token provided by contrasting image inputs. CAL is shown to consistently improve different types of VLMs across various benchmark datasets, with minimal additional computational overhead compared to alternative data scaling strategies. The authors also release their code at this GitHub URL. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Imagine you’re trying to understand a picture better by reading the words that go with it. Right now, computers do this task equally for each word, which isn’t very good. This paper suggests a new way to do this task by looking at how well each word matches up with what’s in the picture. They call this new way Contrastive ALignment (CAL). CAL works better than the old way and can be used with different types of computers that understand pictures and words. |
Keywords
» Artificial intelligence » Alignment » Logits » Token