Loading Now

Summary of Moma: Efficient Early-fusion Pre-training with Mixture Of Modality-aware Experts, by Xi Victoria Lin and Akshat Shrivastava and Liang Luo and Srinivasan Iyer and Mike Lewis and Gargi Ghosh and Luke Zettlemoyer and Armen Aghajanyan


MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

by Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Ghosh, Luke Zettlemoyer, Armen Aghajanyan

First submitted to arxiv on: 31 Jul 2024

Categories

  • Main: Artificial Intelligence (cs.AI)
  • Secondary: Machine Learning (cs.LG)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A novel architecture called MoMa is introduced for pre-training mixed-modal, early-fusion language models. This modality-aware mixture-of-experts (MoE) design processes images and text in arbitrary sequences by dividing expert modules into modality-specific groups. The paper shows that this approach achieves substantial pre-training efficiency gains through modality-specific parameter allocation, with the MoMa 1.4B model achieving impressive FLOPs savings compared to a compute-equivalent dense baseline. Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs savings, although this combination hurts performance in causal inference due to increased sensitivity to router accuracy.
Low GrooveSquid.com (original content) Low Difficulty Summary
A new way of training language models is presented that can handle images and text together more efficiently. This is done by dividing the model into parts that only look at certain types of data (images or text). The results show that this approach saves a lot of computer power compared to traditional methods, which could lead to more capable and efficient AI systems.

Keywords

* Artificial intelligence  * Inference  * Mixture of experts