Summary of Vigor: Improving Visual Grounding Of Large Vision Language Models with Fine-grained Reward Modeling, by Siming Yan et al.
ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling
by Siming Yan, Min Bai, Weifeng Chen, Xiong Zhou, Qixing Huang, Li Erran Li
First submitted to arxiv on: 9 Feb 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
| Summary difficulty | Written by | Summary | 
|---|---|---|
| High | Paper authors | High Difficulty Summary Read the original abstract here | 
| Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Medium Difficulty summary: By leveraging the capabilities of large language models (LLMs) combined with image perception, recent large vision language models (LVLMs) have demonstrated exceptional visual reasoning abilities. However, the generated text often lacks accurate grounding in the visual input, leading to errors such as hallucinating nonexistent scene elements, missing significant parts of the scene, and inferring incorrect attributes or relationships between objects. To address these issues, this paper introduces ViGoR (Visual Grounding Through Fine-Grained Reward Modeling), a novel framework that utilizes fine-grained reward modeling to enhance the visual grounding of LVLMs over pre-trained baselines. This improvement is achieved efficiently using human evaluations and automated methods. The effectiveness of ViGoR is demonstrated through various evaluation methods and benchmarks, making it a valuable contribution to the field. | 
| Low | GrooveSquid.com (original content) | Low Difficulty Summary Low Difficulty summary: Imagine having computers that can understand images and describe them in words. Recent advancements have made this possible, but there’s still room for improvement. Sometimes, these computers make mistakes when describing what they see, like inventing things that aren’t really there or leaving out important details. To fix this problem, the authors of this paper created a new way to help computers better understand images and describe them accurately. They called it ViGoR. This method uses rewards to guide the computer’s learning process, allowing it to improve without needing as much human supervision. The authors tested their approach on many images and showed that it works well. | 
Keywords
* Artificial intelligence * Grounding




