Summary of Timesuite: Improving Mllms For Long Video Understanding Via Grounded Tuning, by Xiangyu Zeng et al.
TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning
by Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, Yali Wang, Yu Qiao, Limin Wang
First submitted to arxiv on: 25 Oct 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Multimedia (cs.MM)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel multimodal large language model (MLLM) is proposed to improve the understanding of long-form videos, building upon the success of existing short-form video MLLMs. The TimeSuite framework includes a simple processing mechanism for long video sequences, a high-quality video dataset for grounded tuning, and an instruction tuning task that incorporates grounding supervision in a traditional QA format. Specifically, the VideoChat-T model is developed by implementing token shuffling to compress long video tokens and introducing Temporal Adaptive Position Encoding (TAPE) to enhance temporal awareness of visual representation. The TimePro dataset consists of 9 tasks and 349k high-quality grounded annotations, including a new instruction tuning task type called Temporal Grounded Caption that predicts time stamps for detailed video descriptions. Experimental results show that the TimeSuite provides a successful solution for enhancing long video understanding capability, achieving improvements on benchmarks Egoschema and VideoMME. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Long-form videos are challenging for machine learning models to understand. A new approach, called TimeSuite, is proposed to help these models. It includes a way to process long videos, a large dataset with accurate annotations, and a special task that helps the model learn about time. The main innovation is a new type of instruction tuning task that asks the model to describe a video in detail while also predicting when certain events happen. This helps the model focus on the right parts of the video and avoid making mistakes. The results show that TimeSuite can improve the performance of machine learning models on long-form videos. |
Keywords
» Artificial intelligence » Grounding » Instruction tuning » Large language model » Machine learning » Token