Summary of Align-kd: Distilling Cross-modal Alignment Knowledge For Mobile Vision-language Model, by Qianhan Feng et al.
Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model
by Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen
First submitted to arxiv on: 2 Dec 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper proposes Align-KD, a method to distill cross-modal alignment knowledge from large Vision-Language Models (VLMs) to smaller models suitable for edge devices. Existing methods simplify VLM structures or use knowledge distillation (KD) on single-modal LLMs, neglecting the importance of cross-modal matching in VLMs. Align-KD guides student models to learn this matching by projecting vision tokens into text embedding space based on text focus. The proposed method achieves an average score improvement of 2.0 across six benchmarks using a 1.7B MobileVLM V2 model and a 7B teacher model. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Researchers are working to make powerful artificial intelligence (AI) available on mobile devices, like AI assistants. To do this, they need to shrink the size of these AI models without losing their ability to understand and reason. One way to achieve this is by distilling knowledge from large AI models into smaller ones. This paper proposes a new method called Align-KD that helps small AI models learn how to match visual and text information correctly. The result is impressive, with the proposed method achieving better performance on several tasks compared to existing methods. |
Keywords
» Artificial intelligence » Alignment » Embedding space » Knowledge distillation » Teacher model