Summary of Moma: Efficient Early-fusion Pre-training with Mixture Of Modality-aware Experts, by Xi Victoria Lin and Akshat Shrivastava and Liang Luo and Srinivasan Iyer and Mike Lewis and Gargi Ghosh and Luke Zettlemoyer and Armen Aghajanyan
MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts
by Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, Armen Aghajanyan
First submitted to arxiv on: 31 Jul 2024
Categories
- Main: Artificial Intelligence (cs.AI)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel architecture called MoMa is introduced for pre-training mixed-modal, early-fusion language models. This modality-aware mixture-of-experts (MoE) design processes images and text in arbitrary sequences by dividing expert modules into modality-specific groups. The paper shows that this approach achieves substantial pre-training efficiency gains through modality-specific parameter allocation, with the MoMa 1.4B model achieving impressive FLOPs savings compared to a compute-equivalent dense baseline. Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs savings, although this combination hurts performance in causal inference due to increased sensitivity to router accuracy. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary A new way of training language models is presented that can handle images and text together more efficiently. This is done by dividing the model into parts that only look at certain types of data (images or text). The results show that this approach saves a lot of computer power compared to traditional methods, which could lead to more capable and efficient AI systems. |
Keywords
* Artificial intelligence * Inference * Mixture of experts