Summary of Lita: Language Instructed Temporal-localization Assistant, by De-an Huang et al.
LITA: Language Instructed Temporal-Localization Assistant
by De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, Jan Kautz
First submitted to arxiv on: 27 Mar 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This research paper proposes a novel approach to multimodal Large Language Models (LLMs) that enables accurate temporal localization, crucial for answering “When?” questions. The authors identify three key limitations in existing LLMs: time representation, architecture, and data. To address these shortcomings, they introduce the Language Instructed Temporal-Localization Assistant (LITA), featuring time tokens, SlowFast tokens, and a new dataset, ActivityNet-RTL. LITA demonstrates strong performance on the challenging Reasoning Temporal Localization task, nearly doubling the temporal mean intersection-over-union (mIoU) of baselines. Additionally, the authors show that emphasizing temporal localization improves video-based text generation by 36% relative to existing Video LLMs. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper is about helping computers understand videos better. Right now, these computers can answer some questions about what’s happening in a video, but they struggle to tell us when something happens. The authors of this paper think that’s because the way they represent time and store information about videos isn’t very good. They propose a new approach called LITA that does things differently. It uses special tokens to understand timestamps and has a unique architecture that captures temporal information. They also created a new dataset with challenges that require computers to reason about when events happen in a video. The results show that LITA is much better than existing methods at answering these “When?” questions. |
Keywords
» Artificial intelligence » Text generation