Loading Now

Summary of Multi-modal Preference Alignment Remedies Degradation Of Visual Instruction Tuning on Language Models, by Shengzhi Li et al.


Multi-modal Preference Alignment Remedies Degradation of Visual Instruction Tuning on Language Models

by Shengzhi Li, Rongyu Lin, Shichao Pei

First submitted to arxiv on: 16 Feb 2024

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
In this research paper, the authors address the issue of degradation in large language models (MLLMs) when they are trained with visual-question-answering (VQA) datasets. The VQA datasets lack the diversity and complexity of the original text instruction datasets, which can lead to a decline in the model’s language capabilities. To mitigate this effect, the authors collect a lightweight VQA preference dataset and investigate several algorithms, including Supervised Fine-tuning, rejection sampling, Direct Preference Optimization (DPO), and SteerLM. The results show that DPO is effective in surpassing the instruction-following capabilities of the language model, achieving a score of 6.73 on MT-Bench compared to Vicuna’s 6.57 and LLaVA’s 5.99. This improvement in textual instruction-following capability correlates with boosted visual instruction performance (+4.9% on MM-Vet, +6% on LLaVA-Bench) with minimal alignment tax on visual knowledge benchmarks. The authors propose a distillation-based multi-modal alignment model that restores and boosts MLLM’s language capability after visual instruction tuning.
Low GrooveSquid.com (original content) Low Difficulty Summary
This research paper is about fixing a problem in big language models (MLLMs). When these models are trained using visual-question-answering datasets, they don’t work as well as they do with text-based instructions. The authors collect some data and test different methods to see which one works best. They find that one method, called Direct Preference Optimization (DPO), is really effective in making the model better at following instructions. This improvement also helps the model perform better when it’s used for visual tasks, like recognizing images. Overall, this research could help make MLLMs more useful in real-world applications.

Keywords

* Artificial intelligence  * Alignment  * Distillation  * Fine tuning  * Instruction tuning  * Language model  * Multi modal  * Optimization  * Question answering  * Supervised