Summary of Wings: Learning Multimodal Llms Without Text-only Forgetting, by Yi-kai Zhang et al.

Wings: Learning Multimodal LLMs without Text-only Forgetting

by Yi-Kai Zhang, Shiyin Lu, Yang Li, Yanqing Ma, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

First submitted to arxiv on: 5 Jun 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper presents Wings, a novel multimodal large language model that excels in both text-only dialogues and multimodal comprehension. The authors highlight the issue of MLLMs catastrophically forgetting text-only instructions when fine-tuned on multimodal inputs. They attribute this phenomenon to attention shifts from pre-image to post-image text. To mitigate this, they introduce complementary visual and textual learners that operate alongside the main attention mechanism. This design allows for balanced focus on visual elements. The authors also propose the Low-Rank Residual Attention (LoRRA) to ensure high efficiency. Experimental results show that Wings outperforms equally-scaled MLLMs in both text-only and visual question-answering tasks, with superior performance on a newly constructed Interleaved Image-Text (IIT) benchmark.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This research paper is about a new kind of artificial intelligence called Wings. It can understand and respond to both written text and images. The authors found that other similar AI models forget what they learned from just text when they’re also trained on images. They created a solution by adding two new parts to the model: one for images and one for text. This helps the AI focus equally on both visual and textual information. The results show that Wings is better than other similar AI models at answering questions based on both text and images.

Keywords

» Artificial intelligence » Attention » Large language model » Question answering

Wings: Learning Multimodal LLMs without Text-only Forgetting

by Yi-Kai Zhang, Shiyin Lu, Yang Li, Yanqing Ma, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, De-Chuan Zhan, Han-Jia Ye

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Highway Value Iteration Networks, by Yuhui Wang et al.

Summary of Advancing Anomaly Detection: Non-semantic Financial Data Encoding with Llms, by Alexander Bakumenko (1) et al.

Related Posts