Summary of Free Video-llm: Prompt-guided Visual Perception For Efficient Training-free Video Llms, by Kai Han et al.
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs
by Kai Han, Jianyuan Guo, Yehui Tang, Wei He, Enhua Wu, Yunhe Wang
First submitted to arxiv on: 14 Oct 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper introduces a novel approach to applying vision-language large models (video LLMs) for video understanding tasks, addressing the challenges of complexity and computational demands. The proposed prompt-guided visual perception framework, dubbed Free Video-LLM, efficiently adapts pre-trained image-LLMs for video tasks without additional training. By decoupling spatial-temporal dimensions and performing temporal frame sampling and spatial RoI cropping based on task-specific prompts, the method reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks. The authors demonstrate competitive results with significantly fewer tokens compared to state-of-the-art video LLMs. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary The paper presents a way to use computer models for understanding videos without needing to train them from scratch. This is helpful because training these models can be very time-consuming and require lots of computing power. Instead, the authors propose a new approach that takes advantage of pre-trained image models and adapts them for video tasks. This makes it possible to understand videos more efficiently. |
Keywords
» Artificial intelligence » Prompt » Question answering