Summary of Guiding Vision-language Model Selection For Visual Question-answering Across Tasks, Domains, and Knowledge Types, by Neelabh Sinha et al.
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
by Neelabh Sinha, Vinija Jain, Aman Chadha
First submitted to arxiv on: 14 Sep 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper proposes a novel end-to-end framework for evaluating Vision-Language Models (VLMs) in practical settings for various applications. It introduces VQA360, a comprehensive dataset with annotated task types, application domains, and knowledge types. The authors also develop GoEval, a multimodal evaluation metric that correlates well with human judgments (56.71%). Experiments with state-of-the-art VLMs reveal that no single model excels universally, emphasizing the importance of choosing the right model for a specific application. Proprietary models like Gemini-1.5-Pro and GPT-4o-mini generally outperform others, while open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B demonstrate competitive strengths with additional advantages. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper helps solve a problem in evaluating Vision-Language Models (VLMs) for practical use. It makes it easier to choose the right model for a task by creating a new dataset called VQA360 and an evaluation tool called GoEval. The researchers tested many different models and found that no one model is best for all tasks. They also discovered that some proprietary models are better than others, but open-source models can still be very good choices. |
Keywords
» Artificial intelligence » Gemini » Gpt » Llama