Summary of Hawkeye: Training Video-text Llms For Grounding Text in Videos, by Yueqian Wang et al.
HawkEye: Training Video-Text LLMs for Grounding Text in Videos
by Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, Dongyan Zhao
First submitted to arxiv on: 15 Mar 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes a novel approach to video-text large language models (LLMs) that can perform temporal video grounding in a fully text-to-text manner. The proposed model, HawkEye, is designed to understand and reason about temporal information in long and complicated videos, which is the most fundamental difference between videos and images. To train HawkEye, the authors construct a large-scale video-text corpus, InternVid-G, with segment-level captions and negative spans, and introduce two new time-aware training objectives to video-text LLMs. The authors also propose a coarse-grained method of representing segments in videos that is more robust and easier for LLMs to learn and follow than other alternatives. Experimental results show that HawkEye outperforms existing models on temporal video grounding tasks and comparable on other video-text tasks. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper helps us understand how computers can better watch and understand long videos with lots of information. Right now, computers are really good at understanding short videos and images, but they struggle to follow what’s happening in longer videos. The researchers developed a new computer model called HawkEye that can do a better job of watching and understanding long videos by breaking them down into smaller parts. They also created a big collection of video and text data that the computer can use to learn from. This new approach helps computers understand more about what’s happening in longer videos, which is important for many applications like surveillance or film analysis. |
Keywords
» Artificial intelligence » Grounding