Summary of Longvideobench: a Benchmark For Long-context Interleaved Video-language Understanding, by Haoning Wu et al.
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
by Haoning Wu, Dongxu Li, Bei Chen, Junnan Li
First submitted to arxiv on: 22 Jul 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Computation and Language (cs.CL); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary Large multimodal models (LMMs) are increasingly handling longer and richer inputs. Despite progress, few public benchmarks exist to measure such advancements. To address this gap, we introduce LongVideoBench, a question-answering benchmark featuring video-language interleaved inputs up to an hour long. Our benchmark comprises 3,763 varying-length web-collected videos with subtitles across diverse themes, designed to comprehensively evaluate LMMs on long-term multimodal understanding. We formulate the primary challenge as accurately retrieving and reasoning over detailed multimodal information from long inputs, introducing a novel video question-answering task termed referring reasoning. The task involves referencing related video contexts, requiring models to reason over relevant details. Our benchmark contains 6,678 human-annotated multiple-choice questions in 17 fine-grained categories, establishing one of the most comprehensive benchmarks for long-form video understanding. Evaluations suggest that even advanced proprietary models struggle with LongVideoBench, while open-source counterparts show a larger performance gap. Model performance improves only when capable of processing more frames, positioning LongVideoBench as a valuable benchmark for evaluating future-generation long-context LMMs. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Imagine having super smart machines that can understand videos really well. Right now, we don’t have many ways to test how good they are at understanding longer videos. To fix this problem, scientists created a new test called LongVideoBench. This test has over 3,700 videos with subtitles, and it’s designed to see if the super smart machines can understand what’s going on in long videos. The test is special because it asks the machines to find specific parts of the video that are important for understanding what’s happening. Scientists think this test will help them create even better machines that can understand longer videos really well. |
Keywords
» Artificial intelligence » Question answering