Summary of Multimodal Variational Autoencoder: a Barycentric View, by Peijie Qiu et al.
Multimodal Variational Autoencoder: a Barycentric View
by Peijie Qiu, Wenhui Zhu, Sayantan Kumar, Xiwen Chen, Xiaotong Sun, Jin Yang, Abolfazl Razi, Yalin Wang, Aristeidis Sotiras
First submitted to arxiv on: 29 Dec 2024
Categories
- Main: Machine Learning (cs.LG)
- Secondary: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper proposes a novel approach to learning generative models for multimodal representation learning, particularly in cases where certain modalities are missing. The primary goal is to learn a modality-invariant and modality-specific representation that characterizes information across multiple modalities. To achieve this, the authors provide an alternative theoretical formulation of multimodal VAEs through the lens of barycenter, showing that previous approaches such as product of experts (PoE) and mixture of experts (MoE) are specific instances of barycenters. The proposed method extends these two barycenters to a more flexible choice by considering different types of divergences, including the Wasserstein barycenter defined by the 2-Wasserstein distance. Empirical studies on three multimodal benchmarks demonstrate the effectiveness of the proposed method. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper is about learning how to understand things that have multiple ways of being described (like pictures and sounds). Right now, there are different ways to do this, but they’re not very good at capturing all the important details. The authors found a new way to do it using something called “barycenter” which helps connect different types of information together in a better way. They tested their method on three big sets of data and showed that it works really well. |
Keywords
* Artificial intelligence * Mixture of experts * Representation learning