Summary of Dream: Improving Video-text Retrieval Through Relevance-based Augmentation Using Large Foundation Models, by Yimu Wang et al.
DREAM: Improving Video-Text Retrieval Through Relevance-Based Augmentation Using Large Foundation Models
by Yimu Wang, Shuai Yuan, Bo Xue, Xiangru Jian, Wei Pang, Mushi Wang, Ning Yu
First submitted to arxiv on: 7 Apr 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Computation and Language (cs.CL); Information Retrieval (cs.IR); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Recent advancements in video-text retrieval have primarily been driven by improvements in model architectures and training strategies. However, the representation learning capabilities of videotext retrieval models remain constrained by limited and low-quality training data annotations. To address this issue, we present DREAM, a novel ViDeoText Retrieval Paradigm with RElevance-based AugMentation that enhances video and text data using large foundation models to learn more generalized features. Our approach involves a simple augmentation method that generates self-similar data by randomly duplicating or dropping subwords and frames, as well as a more robust method through textual paraphrasing and video stylization using large language models (LLMs) and visual generative models (VGMs). Additionally, we propose a relevance-based augmentation method where LLMs and VGMs generate and integrate new relevant information into the original data. Experimental results on several video-text retrieval benchmarks demonstrate the superiority of DREAM over existing methods. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary A team of researchers has found a way to improve how computers find connections between videos and text. Right now, computers have trouble learning from low-quality training data, which limits their ability to understand videos and text. To fix this problem, the team created a new approach called DREAM that uses large language models and visual generative models to enhance video and text data. This helps computers learn more general features. The team tested DREAM on several different datasets and found that it performed better than existing methods. |
Keywords
* Artificial intelligence * Representation learning