Summary of Video-star: Self-training Enables Video Instruction Tuning with Any Supervision, by Orr Zohar et al.
Video-STaR: Self-Training Enables Video Instruction Tuning with Any Supervision
by Orr Zohar, Xiaohan Wang, Yonatan Bitton, Idan Szpektor, Serena Yeung-Levy
First submitted to arxiv on: 8 Jul 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Large Vision Language Models (LVLMs) rely heavily on the size and quality of their training datasets. Existing video instruction tuning datasets are limited in diversity, as they are derived from prompting large language models with video captions to generate question-answer pairs. Meanwhile, labeled video datasets with diverse labels exist but require non-trivial integration into LVLMs. To address this challenge, we propose Video Self-Training with augmented Reasoning (Video-STaR), a novel approach that enables the utilization of any labeled video dataset for video instruction tuning. Our method involves an LVLM cycling between instruction generation and fine-tuning, which improves general video understanding and adapts LVLMs to novel downstream tasks with existing supervision. We demonstrate that Video-STaR-enhanced LVLMs exhibit improved performance in general video QA and on downstream tasks, such as Kinetics700-QA accuracy (20%) and action quality assessment on FineDiving (15%). |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Imagine a machine learning model that can understand videos. To make this happen, we need to train the model with lots of labeled video data. But, there isn’t enough data like this available yet. So, we came up with a new way to train the model using existing labeled video datasets. This method is called Video Self-Training with augmented Reasoning (Video-STaR). We tested our approach and found that it makes the model perform better in understanding videos and doing tasks with those videos. |
Keywords
» Artificial intelligence » Fine tuning » Instruction tuning » Machine learning » Prompting » Self training