Summary of Multilinear Mixture Of Experts: Scalable Expert Specialization Through Factorization, by James Oldfield et al.
Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization
by James Oldfield, Markos Georgopoulos, Grigorios G. Chrysos, Christos Tzelepis, Yannis Panagakis, Mihalis A. Nicolaou, Jiankang Deng, Ioannis Patras
First submitted to arxiv on: 19 Feb 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The Mixture of Experts (MoE) paradigm has long been a powerful way to decompose dense layers into smaller, modular computations that are more amenable to human interpretation, debugging, and editability. However, scaling the number of experts high enough to achieve fine-grained specialization has been a major challenge due to computational costs. To address this, we propose the Multilinear Mixture of Experts (μMoE) layer, focusing on vision models. μMoE layers enable scalable expert specialization by performing an implicit computation on prohibitively large weight tensors entirely in factorized form. This avoids the high inference-time costs of dense MoEs while not inheriting the training issues of sparse MoEs’ discrete expert routing. Our approach leads to more specialized experts at the class-level, enabling manual bias correction in CelebA attribute classification. We also demonstrate qualitative results showing expert specialism achieved when pre-training large GPT2 and MLP-Mixer models with μMoE blocks at every layer, maintaining comparable accuracy. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The MoE paradigm is a way to break down complex computations into smaller pieces that are easier to understand and work with. However, making these small pieces do their job can be very hard on computers. To fix this problem, we created a new type of MoE called μMoE. μMoE helps computers focus on the important parts of what they’re doing, which makes it possible for us to control how well they do certain tasks. |
Keywords
* Artificial intelligence * Classification * Inference * Mixture of experts