Summary of Shaking Up Vlms: Comparing Transformers and Structured State Space Models For Vision & Language Modeling, by Georgios Pantazopoulos et al.
Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling
by Georgios Pantazopoulos, Malvina Nikandrou, Alessandro Suglia, Oliver Lemon, Arash Eshghi
First submitted to arxiv on: 9 Sep 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This study replaces Transformers with Mamba, a structured state space model (SSM), in Visual Language Models (VLMs) to evaluate its performance. The authors test models up to 3B parameters under controlled conditions and find that Mamba-based VLMs outperform Transformers-based VLMs in captioning, question answering, and reading comprehension. However, they observe that Transformers achieve greater performance in visual grounding and the gap widens with scale. To explain this phenomenon, the authors propose two hypotheses: task-agnostic visual encoding affecting hidden state updates and difficulty performing visual grounding from an in-context multimodal retrieval perspective. The results show minimal gains for a task-aware encoding on grounding but significant outperformance by Transformers at in-context multimodal retrieval. Overall, Mamba demonstrates promising performance on tasks relying on image summaries but struggles when retrieving explicit information from context. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This study looks at how well two different models do in understanding and generating text. The first model is called Mamba, which is a new way of doing things. They tested these models with lots of data and found that the Mamba model did better at tasks like writing captions and answering questions. However, when it came to visual tasks like recognizing objects in pictures, the other model (called Transformers) did better. The authors think this might be because the Mamba model is good at understanding the main points of a picture, but struggles when it needs to find specific details. |
Keywords
» Artificial intelligence » Grounding » Question answering