Summary of Multi-scale Temporal Difference Transformer For Video-text Retrieval, by Ni Wang et al.
Multi-Scale Temporal Difference Transformer for Video-Text Retrieval
by Ni Wang, Dongliang Liao, Xing Xu
First submitted to arxiv on: 23 Jun 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel transformer variant called Multi-Scale Temporal Difference Transformer (MSTDT) is proposed for video-text retrieval, which addresses the limitations of traditional transformers in capturing local temporal information. The approach involves extracting inter-frame difference features and integrating them with frame features using a multi-scale temporal transformer. This architecture consists of a short-term multi-scale temporal difference transformer that focuses on modeling local temporal information and a long-term temporal transformer that models global temporal information. A new loss function is also introduced to narrow the distance between similar samples. The proposed method, when combined with the CLIP backbone, achieves a state-of-the-art result in video-text retrieval. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper proposes a new way to understand videos by using special computer models called transformers. These models are good at understanding text, but they’re not as good at understanding the order of things happening in a video. To fix this, the researchers create a new kind of transformer that looks at how different parts of a video move and change over time. They also come up with a way to measure how well the model is doing, which helps it learn to recognize similar videos. |
Keywords
» Artificial intelligence » Loss function » Transformer