Loading Now

Summary of Conditional Prompt Tuning For Multimodal Fusion, by Ruixiang Jiang et al.


Conditional Prompt Tuning for Multimodal Fusion

by Ruixiang Jiang, Lingbo Liu, Changwen Chen

First submitted to arxiv on: 28 Nov 2023

Categories

  • Main: Computation and Language (cs.CL)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper presents a method for parameter-efficient multimodal fusion by leveraging the representation of one modality to guide prompting in another. The approach involves encoding one modality and using its representation as a prior to conditionally prompt all frozen layers of the other modality, achieving adaptive prompts that capture global-level and instance-level features. The mixture of prompt experts (MoPE) is introduced to dynamically route each instance to the most suitable prompt experts for encoding, and a regularization term is added to avoid degenerated prompt expert routing. The method shows improved expressiveness and scalability compared to vanilla prompting, with state-of-the-art results on three multimodal datasets, requiring only 0.7% of trainable parameters.
Low GrooveSquid.com (original content) Low Difficulty Summary
The paper explores a new way to combine information from different senses (like images and words) without needing to train a whole new model for each combination. They show that by using the information in one sense to guide what’s being looked at or listened to in another, they can get better results with much less training data needed. This could be useful for things like image-to-text systems or machine translation.

Keywords

* Artificial intelligence  * Parameter efficient  * Prompt  * Prompting  * Regularization  * Translation