Summary of Contextual: Evaluating Context-sensitive Text-rich Visual Reasoning in Large Multimodal Models, by Rohan Wadhawan et al.
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models
by Rohan Wadhawan, Hritik Bansal, Kai-Wei Chang, Nanyun Peng
First submitted to arxiv on: 24 Jan 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Many real-world tasks require an agent to reason jointly over text and visual objects, referred to as context-sensitive text-rich visual reasoning. This abstract describes a novel dataset called ConTextual, which features human-crafted instructions that require context-sensitive reasoning for text-rich images. The authors test 14 foundation models, including GPT-4V, Gemini-Pro-Vision, LLaVA-Next, and establish a human performance baseline. A significant performance gap of 30.8% is observed between the current best-performing Large Multimodal Model, GPT-4V, and human performance. The authors identify difficulties in interpreting time-related data and infographics, but proficiency in comprehending abstract visual contexts like memes and quotes. Factors contributing to poor performance include lack of precise visual perception and hallucinations. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper creates a special kind of dataset that helps computers understand how text and pictures work together. It’s called ConTextual, and it has lots of examples where humans have written instructions for understanding images with lots of words. The researchers tested 14 different computer models to see which ones were best at following these instructions. They found that the current best model was still much worse than a human, but it did better in some areas like recognizing memes. The paper also finds that computers struggle to understand things like clocks and graphs. |
Keywords
* Artificial intelligence * Gemini * Gpt