Summary of Memory-space Visual Prompting For Efficient Vision-language Fine-tuning, by Shibo Jie et al.
Memory-Space Visual Prompting for Efficient Vision-Language Fine-Tuning
by Shibo Jie, Yehui Tang, Ning Ding, Zhi-Hong Deng, Kai Han, Yunhe Wang
First submitted to arxiv on: 9 May 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Computation and Language (cs.CL); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper presents a novel approach to efficiently constructing large vision-language (VL) models, which follow a two-step paradigm: projecting visual prompts from pre-trained vision encoders to the input space of pre-trained language models and then fine-tuning via parameter-efficient fine-tuning (PEFT). However, this paradigm still exhibits inefficiency due to increased input lengths for language models. In contrast, the proposed MemVP approach regards visual prompts as additional knowledge that facilitates language models in addressing tasks associated with visual information. The authors introduce a novel method called Memory-Space Visual Prompting (MemVP), where visual prompts are concatenated with the weights of Feed-Forward Network (FFN) for visual knowledge injection. Experimental results across various VL tasks and language models reveal that MemVP significantly reduces training time and inference latency, outperforming previous PEFT methods. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about making computer models that can understand both pictures and words better. Currently, these models are made by combining the output of a picture-processing model with a word-processing model, but this doesn’t work very efficiently. The researchers came up with a new idea called MemVP, where they add visual information to the language processing part of the model. This helps the model do tasks that involve both pictures and words better and faster. In tests on different types of tasks and models, MemVP performed well and was more efficient than previous methods. |
Keywords
» Artificial intelligence » Fine tuning » Inference » Parameter efficient » Prompting