Summary of Aligning Modalities in Vision Large Language Models Via Preference Fine-tuning, by Yiyang Zhou et al.

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

by Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao

First submitted to arxiv on: 18 Feb 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary Recent advancements in Vision Large Language Models (VLLMs) have led to significant progress on various tasks. These models merge pre-trained vision models with large language models, requiring joint training on image-language pairs to align the learned representations. However, this procedure can cause hallucinations, where the model provides answers that don’t accurately reflect the image, even when the core LLM is factual and the vision backbone has complete representations. To tackle this issue, we frame the problem as an alignment problem and propose POVID, which generates feedback data using AI models. We use ground-truth instructions as preferred responses and a two-stage approach to generate dispreferred data by prompting GPT-4V to inject plausible hallucinations into correct answers and distorting images to trigger inherent hallucination behavior. This automated approach doesn’t rely on human data generation or require perfect experts, making it scalable. Our experiments show that we can reduce hallucinations while improving model performance across standard benchmarks, outperforming prior approaches.
Low	GrooveSquid.com (original content)	Low Difficulty Summary Imagine trying to teach a computer to understand images and text together. It sounds like a great idea, but sometimes the computer gets confused and gives wrong answers. This problem is called “hallucination”. To fix this, researchers created a new way to train computers using a technique called POVID. They used real instructions as examples of what the correct answer should be, and then generated fake data that could trigger the computer’s tendency to give wrong answers. By combining these two approaches, they were able to reduce hallucinations and make the computer better at understanding images and text.

Keywords

* Artificial intelligence * Alignment * Gpt * Hallucination * Prompting

Aligning Modalities in Vision Large Language Models via Preference Fine-tuning

by Yiyang Zhou, Chenhang Cui, Rafael Rafailov, Chelsea Finn, Huaxiu Yao

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Expressive Higher-order Link Prediction Through Hypergraph Symmetry Breaking, by Simon Zhang et al.

Summary of Optimal Parallelization Strategies For Active Flow Control in Deep Reinforcement Learning-based Computational Fluid Dynamics, by Wang Jia and Hang Xu

Related Posts