Summary of Artemis: Towards Referential Understanding in Complex Videos, by Jihao Qiu and Yuan Zhang and Xi Tang and Lingxi Xie and Tianren Ma and Pengyu Yan and David Doermann and Qixiang Ye and Yunjie Tian
Artemis: Towards Referential Understanding in Complex Videos
by Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, David Doermann, Qixiang Ye, Yunjie Tian
First submitted to arxiv on: 1 Jun 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper presents Artemis, a multimodal large language model (MLLM) that excels in referential understanding scenarios like video-based referring. Given a video and a natural-language question with a bounding box, Artemis describes the referred target in the entire video. The key is extracting compact, target-specific video features, which is achieved by tracking and selecting spatiotemporal features from the video. The model is trained on the VideoRef45K dataset and uses a three-stage training procedure to improve performance. Results are promising both quantitatively and qualitatively, and Artemis can be integrated with other tools for more complex scenarios. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Artemis is a new kind of AI that can understand what’s happening in videos and answer questions about them. Right now, most AI models aren’t very good at this kind of thing, but Artemis is different. It takes a video and a question, and then tries to find the answer within the video. The key to making this work is figuring out how to get useful information from the video itself. This paper shows that Artemis can do this really well, and it’s even better when you use it with other tools like video grounding and text summarization. |
Keywords
» Artificial intelligence » Bounding box » Grounding » Large language model » Spatiotemporal » Summarization » Tracking