Summary of M2-encoder: Advancing Bilingual Image-text Understanding by Large-scale Efficient Pretraining, By Qingpei Guo et al.
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
by Qingpei Guo, Furong Xu, Hanxiao Zhang, Wang Ren, Ziping Ma, Lin Ju, Jian Wang, Jingdong Chen, Ming Yang
First submitted to arxiv on: 29 Jan 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper introduces a novel dataset, BM-6B, comprising over 6 billion image-text pairs in both Chinese and English. The goal is to enhance multimodal foundation models to understand images well in both languages. To tackle the scale of this dataset, the authors propose a grouped aggregation approach for contrastive loss computation, reducing communication overhead and GPU memory demands by 60%. This leads to improved fine-grained understanding ability in bilingual image-text foundation models, dubbed M^2-Encoders (pronounced “M-Square”). The resulting models set new benchmarks for multimodal retrieval and classification tasks in both languages. Specifically, the largest M^2-Encoder-10B model achieves top-1 accuracies of 88.5% on ImageNet and 80.7% on ImageNet-CN under a zero-shot classification setting, surpassing previous state-of-the-art methods by 2.2% and 21.1%, respectively. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper creates a big dataset with lots of pictures and words in both Chinese and English. They want to make computers better at understanding these images. To handle this huge amount of data, they came up with a new way of doing things that makes it faster and more efficient. This leads to computers being able to understand images even better. The result is a special kind of computer model that can recognize pictures in both languages really well. It’s so good that it beats previous records by a lot! |
Keywords
» Artificial intelligence » Classification » Contrastive loss » Encoder » Zero shot