Summary of Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-language Models, by Shimin Chen et al.
Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models
by Shimin Chen, Yitian Yuan, Shaoxiang Chen, Zequn Jie, Lin Ma
First submitted to arxiv on: 12 Jun 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A novel approach to transitioning Large Vision-Language Models (LVLMs) from images to videos is presented in this paper. The authors leverage visual commonalities between images and videos to evolve image-LVLMs into video-LVLMs efficiently. A cost-effective video-LVLM architecture is introduced, along with innovative training strategies and the identification of effective video instruction data types. The weighted token sampler significantly compresses visual tokens in each video frame, reducing computational expenses by leveraging just 10% of the available video data during various training phases. The FTFV-LVLM model’s performance is evaluated across image and video benchmarks, showcasing exceptional results and validating its design and training approaches. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper helps us talk to computers better by creating a way to turn image-based models into video-based models. It does this by finding similarities between images and videos and using that information to create a new model that is more efficient and effective. The new model, called FTFV-LVLM, uses less data than before and still performs well on both image and video tasks. |
Keywords
» Artificial intelligence » Token