Summary of Mbq: Modality-balanced Quantization For Large Vision-language Models, by Shiyao Li et al.
MBQ: Modality-Balanced Quantization for Large Vision-Language Models
by Shiyao Li, Yingchun Hu, Xuefei Ning, Xihui Liu, Ke Hong, Xiaotao Jia, Xiuhong Li, Yaqi Yan, Pei Ran, Guohao Dai, Shengen Yan, Huazhong Yang, Yu Wang
First submitted to arxiv on: 27 Dec 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes a novel approach to post-training quantization (PTQ) for vision-language models (VLMs), specifically addressing the differences in sensitivity between language and vision tokens. The existing PTQ methods mainly focus on large language models, neglecting the distinct characteristics of other modalities. The proposed Modality-Balanced Quantization (MBQ) method incorporates these sensitivities during calibration to minimize reconstruction loss for better quantization parameters. Experimental results demonstrate that MBQ improves task accuracy by up to 4.4% and 11.6% under W3 and W4A8 quantization for VLMs with parameter sizes ranging from 7B to 70B, outperforming state-of-the-art (SOTA) baselines. Additionally, the authors implement a W3 GPU kernel that fuses dequantization and GEMV operators, achieving a 1.4x speedup on LLaVA-onevision-7B using an RTX 4090. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about making big computer models smaller so they can work faster and use less memory. These models are used for things like recognizing pictures and understanding what people say. The problem is that these models need a lot of processing power, which makes them slow and hard to use. To solve this, the researchers developed a new way to make these models smaller while keeping them just as good at doing their jobs. They tested it on some big models and found that it worked really well, making the models faster and more efficient. |
Keywords
» Artificial intelligence » Quantization