Loading Now

Summary of Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-language Models, by Shimin Chen et al.


Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models

by Shimin Chen, Yitian Yuan, Shaoxiang Chen, Zequn Jie, Lin Ma

First submitted to arxiv on: 12 Jun 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
A novel approach to transitioning Large Vision-Language Models (LVLMs) from images to videos is presented in this paper. The authors leverage visual commonalities between images and videos to evolve image-LVLMs into video-LVLMs efficiently. A cost-effective video-LVLM architecture is introduced, along with innovative training strategies and the identification of effective video instruction data types. The weighted token sampler significantly compresses visual tokens in each video frame, reducing computational expenses by leveraging just 10% of the available video data during various training phases. The FTFV-LVLM model’s performance is evaluated across image and video benchmarks, showcasing exceptional results and validating its design and training approaches.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper helps us talk to computers better by creating a way to turn image-based models into video-based models. It does this by finding similarities between images and videos and using that information to create a new model that is more efficient and effective. The new model, called FTFV-LVLM, uses less data than before and still performs well on both image and video tasks.

Keywords

» Artificial intelligence  » Token