Summary of Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-language Models, by Shimin Chen et al.

Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

by Shimin Chen, Yitian Yuan, Shaoxiang Chen, Zequn Jie, Lin Ma

First submitted to arxiv on: 12 Jun 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary A novel approach to transitioning Large Vision-Language Models (LVLMs) from images to videos is presented in this paper. The authors leverage visual commonalities between images and videos to evolve image-LVLMs into video-LVLMs efficiently. A cost-effective video-LVLM architecture is introduced, along with innovative training strategies and the identification of effective video instruction data types. The weighted token sampler significantly compresses visual tokens in each video frame, reducing computational expenses by leveraging just 10% of the available video data during various training phases. The FTFV-LVLM model’s performance is evaluated across image and video benchmarks, showcasing exceptional results and validating its design and training approaches.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper helps us talk to computers better by creating a way to turn image-based models into video-based models. It does this by finding similarities between images and videos and using that information to create a new model that is more efficient and effective. The new model, called FTFV-LVLM, uses less data than before and still performs well on both image and video tasks.

Keywords

* Artificial intelligence * Token

Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

by Shimin Chen, Yitian Yuan, Shaoxiang Chen, Zequn Jie, Lin Ma

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Sciriff: a Resource to Enhance Language Model Instruction-following Over Scientific Literature, by David Wadden et al.

Summary of A Sociotechnical Lens For Evaluating Computer Vision Models: a Case Study on Detecting and Reasoning About Gender and Emotion, by Sha Luo et al.

Related Posts