Loading Now

Summary of Artemis: Towards Referential Understanding in Complex Videos, by Jihao Qiu and Yuan Zhang and Xi Tang and Lingxi Xie and Tianren Ma and Pengyu Yan and David Doermann and Qixiang Ye and Yunjie Tian


Artemis: Towards Referential Understanding in Complex Videos

by Jihao Qiu, Yuan Zhang, Xi Tang, Lingxi Xie, Tianren Ma, Pengyu Yan, David Doermann, Qixiang Ye, Yunjie Tian

First submitted to arxiv on: 1 Jun 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The paper presents Artemis, a multimodal large language model (MLLM) that excels in referential understanding scenarios like video-based referring. Given a video and a natural-language question with a bounding box, Artemis describes the referred target in the entire video. The key is extracting compact, target-specific video features, which is achieved by tracking and selecting spatiotemporal features from the video. The model is trained on the VideoRef45K dataset and uses a three-stage training procedure to improve performance. Results are promising both quantitatively and qualitatively, and Artemis can be integrated with other tools for more complex scenarios.
Low GrooveSquid.com (original content) Low Difficulty Summary
Artemis is a new kind of AI that can understand what’s happening in videos and answer questions about them. Right now, most AI models aren’t very good at this kind of thing, but Artemis is different. It takes a video and a question, and then tries to find the answer within the video. The key to making this work is figuring out how to get useful information from the video itself. This paper shows that Artemis can do this really well, and it’s even better when you use it with other tools like video grounding and text summarization.

Keywords

» Artificial intelligence  » Bounding box  » Grounding  » Large language model  » Spatiotemporal  » Summarization  » Tracking