Loading Now

Summary of Multimodal Variational Autoencoder: a Barycentric View, by Peijie Qiu et al.


Multimodal Variational Autoencoder: a Barycentric View

by Peijie Qiu, Wenhui Zhu, Sayantan Kumar, Xiwen Chen, Xiaotong Sun, Jin Yang, Abolfazl Razi, Yalin Wang, Aristeidis Sotiras

First submitted to arxiv on: 29 Dec 2024

Categories

  • Main: Machine Learning (cs.LG)
  • Secondary: Computer Vision and Pattern Recognition (cs.CV); Information Theory (cs.IT)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper proposes a novel approach to learning generative models for multimodal representation learning, particularly in cases where certain modalities are missing. The primary goal is to learn a modality-invariant and modality-specific representation that characterizes information across multiple modalities. To achieve this, the authors provide an alternative theoretical formulation of multimodal VAEs through the lens of barycenter, showing that previous approaches such as product of experts (PoE) and mixture of experts (MoE) are specific instances of barycenters. The proposed method extends these two barycenters to a more flexible choice by considering different types of divergences, including the Wasserstein barycenter defined by the 2-Wasserstein distance. Empirical studies on three multimodal benchmarks demonstrate the effectiveness of the proposed method.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper is about learning how to understand things that have multiple ways of being described (like pictures and sounds). Right now, there are different ways to do this, but they’re not very good at capturing all the important details. The authors found a new way to do it using something called “barycenter” which helps connect different types of information together in a better way. They tested their method on three big sets of data and showed that it works really well.

Keywords

* Artificial intelligence  * Mixture of experts  * Representation learning