Summary of Mousi: Poly-visual-expert Vision-language Models, by Xiaoran Fan et al.
MouSi: Poly-Visual-Expert Vision-Language Models
by Xiaoran Fan, Tao Ji, Changhao Jiang, Shuo Li, Senjie Jin, Sirui Song, Junke Wang, Boyang Hong, Lu Chen, Guodong Zheng, Ming Zhang, Caishuang Huang, Rui Zheng, Zhiheng Xi, Yuhao Zhou, Shihan Dou, Junjie Ye, Hang Yan, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang
First submitted to arxiv on: 30 Jan 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper tackles challenges faced by large vision-language models (VLMs) in accurately interpreting complex visual information and contextual data. The authors identify two key issues: single visual components may not have sufficient capabilities, while excessively long visual tokens hinder the model’s performance. To address these limitations, they propose an ensemble experts technique that combines the strengths of individual visual encoders, such as image-text matching, OCR, and image segmentation. A fusion network is introduced to unify output processing and bridge the gap between image encoders and pre-trained language models (LLMs). Additionally, the authors explore different positional encoding schemes to alleviate position overflow and length limitations. Experimental results show that VLMs with multiple experts outperform isolated visual encoders, leading to a significant performance boost. The training code is open-sourced on the project website. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper solves problems with big AI models that understand pictures and words. These models struggle when they need to understand very long descriptions of images or complex image information. To fix this, researchers propose a new way to combine many smaller image processing experts into one strong model. They also develop ways to make the model’s internal representation more efficient so it can handle longer descriptions. By combining these ideas, the authors show that their model performs much better than previous models. All the code and resources are available online. |
Keywords
* Artificial intelligence * Image segmentation * Positional encoding