Loading Now

Summary of Exploring How Generative Mllms Perceive More Than Clip with the Same Vision Encoder, by Siting Li et al.


Exploring How Generative MLLMs Perceive More Than CLIP with the Same Vision Encoder

by Siting Li, Pang Wei Koh, Simon Shaolei Du

First submitted to arxiv on: 7 Nov 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper explores the limitations of Contrastive Language-Image Pre-training (CLIP) models in visual reasoning tasks that require grounding compositionality, understanding spatial relationships, or capturing fine-grained details. While it’s initially thought that the vision encoder is to blame, research shows that another branch of Vision-Language Models (VLMs), Generative Multimodal Large Language Models (MLLMs), achieve significantly higher accuracy in these tasks using the same vision encoder and weights. The MLLMs’ success is attributed to design choices such as patch tokens, position embeddings, and prompt-based weighting. The study also reveals that enhancing training data alone or applying a stronger text encoder does not solve the task, and additional text tokens offer little benefit. Interestingly, when converted into CLIP-like encoders through contrastive fine-tuning, these MLLMs still outperform CLIP under the same evaluation protocol. The research highlights the importance of VLM architectural choices and suggests directions for improving the performance of CLIP-like contrastive VLMs.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper is about how some AI models are good at understanding pictures and text together, but not as good at doing specific tasks that require understanding details in images. They test a few different types of models and find that one type, called MLLMs, does better than the others on these tasks. The reason is because of the way MLLMs are designed to process both words and images. The study shows that just changing how the data is trained or using more powerful computers doesn’t make it work any better. But if they take the good parts from another type of model and combine them with their own, they can get even better results. This research helps us understand what makes some AI models better than others at doing certain tasks.

Keywords

» Artificial intelligence  » Encoder  » Fine tuning  » Grounding  » Prompt