Summary of Llava-zip: Adaptive Visual Token Compression with Intrinsic Image Information, by Ke Wang et al.
LLaVA-Zip: Adaptive Visual Token Compression with Intrinsic Image Information
by Ke Wang, Hong Xuan
First submitted to arxiv on: 11 Dec 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This research paper proposes Dynamic Feature Map Reduction (DFMR) to address the challenge of visual token overload in multi-modal large language models (MLLMs) like LLaVA-1.5. The limitation is that visual tokens consume a significant portion of the maximum token limit, increasing computational demands and decreasing performance when prompts include multiple images or videos. The proposed DFMR compresses visual tokens dynamically, freeing up token capacity and improving model performance in various scenarios. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Large language models have made great progress using instruction-following data like LLaVA, but they struggle with visual token overload. This issue is particularly challenging for academic environments with limited resources. The study proposes a solution to address this challenge by dynamically compressing visual tokens, making it possible to handle multi-image and video scenarios in resource-constrained settings. |
Keywords
» Artificial intelligence » Feature map » Multi modal » Token