Loading Now

Summary of Align-kd: Distilling Cross-modal Alignment Knowledge For Mobile Vision-language Model, by Qianhan Feng et al.


Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model

by Qianhan Feng, Wenshuo Li, Tong Lin, Xinghao Chen

First submitted to arxiv on: 2 Dec 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper proposes Align-KD, a method to distill cross-modal alignment knowledge from large Vision-Language Models (VLMs) to smaller models suitable for edge devices. Existing methods simplify VLM structures or use knowledge distillation (KD) on single-modal LLMs, neglecting the importance of cross-modal matching in VLMs. Align-KD guides student models to learn this matching by projecting vision tokens into text embedding space based on text focus. The proposed method achieves an average score improvement of 2.0 across six benchmarks using a 1.7B MobileVLM V2 model and a 7B teacher model.
Low GrooveSquid.com (original content) Low Difficulty Summary
Researchers are working to make powerful artificial intelligence (AI) available on mobile devices, like AI assistants. To do this, they need to shrink the size of these AI models without losing their ability to understand and reason. One way to achieve this is by distilling knowledge from large AI models into smaller ones. This paper proposes a new method called Align-KD that helps small AI models learn how to match visual and text information correctly. The result is impressive, with the proposed method achieving better performance on several tasks compared to existing methods.

Keywords

» Artificial intelligence  » Alignment  » Embedding space  » Knowledge distillation  » Teacher model