Summary of Conditional Prompt Tuning For Multimodal Fusion, by Ruixiang Jiang et al.
Conditional Prompt Tuning for Multimodal Fusion
by Ruixiang Jiang, Lingbo Liu, Changwen Chen
First submitted to arxiv on: 28 Nov 2023
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper presents a method for parameter-efficient multimodal fusion by leveraging the representation of one modality to guide prompting in another. The approach involves encoding one modality and using its representation as a prior to conditionally prompt all frozen layers of the other modality, achieving adaptive prompts that capture global-level and instance-level features. The mixture of prompt experts (MoPE) is introduced to dynamically route each instance to the most suitable prompt experts for encoding, and a regularization term is added to avoid degenerated prompt expert routing. The method shows improved expressiveness and scalability compared to vanilla prompting, with state-of-the-art results on three multimodal datasets, requiring only 0.7% of trainable parameters. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper explores a new way to combine information from different senses (like images and words) without needing to train a whole new model for each combination. They show that by using the information in one sense to guide what’s being looked at or listened to in another, they can get better results with much less training data needed. This could be useful for things like image-to-text systems or machine translation. |
Keywords
* Artificial intelligence * Parameter efficient * Prompt * Prompting * Regularization * Translation