Summary of Moma: Efficient Early-fusion Pre-training with Mixture Of Modality-aware Experts, by Xi Victoria Lin and Akshat Shrivastava and Liang Luo and Srinivasan Iyer and Mike Lewis and Gargi Ghosh and Luke Zettlemoyer and Armen Aghajanyan

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

by Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, Armen Aghajanyan

First submitted to arxiv on: 31 Jul 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary A novel architecture called MoMa is introduced for pre-training mixed-modal, early-fusion language models. This modality-aware mixture-of-experts (MoE) design processes images and text in arbitrary sequences by dividing expert modules into modality-specific groups. The paper shows that this approach achieves substantial pre-training efficiency gains through modality-specific parameter allocation, with the MoMa 1.4B model achieving impressive FLOPs savings compared to a compute-equivalent dense baseline. Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs savings, although this combination hurts performance in causal inference due to increased sensitivity to router accuracy.
Low	GrooveSquid.com (original content)	Low Difficulty Summary A new way of training language models is presented that can handle images and text together more efficiently. This is done by dividing the model into parts that only look at certain types of data (images or text). The results show that this approach saves a lot of computer power compared to traditional methods, which could lead to more capable and efficient AI systems.

Keywords

* Artificial intelligence * Inference * Mixture of experts

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

by Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, Armen Aghajanyan

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Hgoe: Hybrid External and Internal Graph Outlier Exposure For Graph Out-of-distribution Detection, by Junwei He et al.

Summary of Contrastive Factor Analysis, by Zhibin Duan et al.

Related Posts