Summary of Free Video-llm: Prompt-guided Visual Perception For Efficient Training-free Video Llms, by Kai Han et al.

Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

by Kai Han, Jianyuan Guo, Yehui Tang, Wei He, Enhua Wu, Yunhe Wang

First submitted to arxiv on: 14 Oct 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper introduces a novel approach to applying vision-language large models (video LLMs) for video understanding tasks, addressing the challenges of complexity and computational demands. The proposed prompt-guided visual perception framework, dubbed Free Video-LLM, efficiently adapts pre-trained image-LLMs for video tasks without additional training. By decoupling spatial-temporal dimensions and performing temporal frame sampling and spatial RoI cropping based on task-specific prompts, the method reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks. The authors demonstrate competitive results with significantly fewer tokens compared to state-of-the-art video LLMs.
Low	GrooveSquid.com (original content)	Low Difficulty Summary The paper presents a way to use computer models for understanding videos without needing to train them from scratch. This is helpful because training these models can be very time-consuming and require lots of computing power. Instead, the authors propose a new approach that takes advantage of pre-trained image models and adapts them for video tasks. This makes it possible to understand videos more efficiently.

Keywords

» Artificial intelligence » Prompt » Question answering

Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs

by Kai Han, Jianyuan Guo, Yehui Tang, Wei He, Enhua Wu, Yunhe Wang

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of Easyrag: Efficient Retrieval-augmented Generation Framework For Automated Network Operations, by Zhangchi Feng et al.

Summary of Deep Compression Autoencoder For Efficient High-resolution Diffusion Models, by Junyu Chen et al.

Related Posts