Summary of Multi-modal Preference Alignment Remedies Degradation Of Visual Instruction Tuning on Language Models, by Shengzhi Li et al.

by Shengzhi Li, Rongyu Lin, Shichao Pei

First submitted to arxiv on: 16 Feb 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary In this research paper, the authors address the issue of degradation in large language models (MLLMs) when they are trained with visual-question-answering (VQA) datasets. The VQA datasets lack the diversity and complexity of the original text instruction datasets, which can lead to a decline in the model’s language capabilities. To mitigate this effect, the authors collect a lightweight VQA preference dataset and investigate several algorithms, including Supervised Fine-tuning, rejection sampling, Direct Preference Optimization (DPO), and SteerLM. The results show that DPO is effective in surpassing the instruction-following capabilities of the language model, achieving a score of 6.73 on MT-Bench compared to Vicuna’s 6.57 and LLaVA’s 5.99. This improvement in textual instruction-following capability correlates with boosted visual instruction performance (+4.9% on MM-Vet, +6% on LLaVA-Bench) with minimal alignment tax on visual knowledge benchmarks. The authors propose a distillation-based multi-modal alignment model that restores and boosts MLLM’s language capability after visual instruction tuning.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This research paper is about fixing a problem in big language models (MLLMs). When these models are trained using visual-question-answering datasets, they don’t work as well as they do with text-based instructions. The authors collect some data and test different methods to see which one works best. They find that one method, called Direct Preference Optimization (DPO), is really effective in making the model better at following instructions. This improvement also helps the model perform better when it’s used for visual tasks, like recognizing images. Overall, this research could help make MLLMs more useful in real-world applications.

Keywords

* Artificial intelligence * Alignment * Distillation * Fine tuning * Instruction tuning * Language model * Multi modal * Optimization * Question answering * Supervised

Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models

by Shengzhi Li, Rongyu Lin, Shichao Pei

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of From Risk to Uncertainty: Generating Predictive Uncertainty Measures Via Bayesian Estimation, by Nikita Kotelevskii et al.

Summary of When Is Tree Search Useful For Llm Planning? It Depends on the Discriminator, by Ziru Chen et al.

Related Posts