Summary of Xgen-mm-vid (blip-3-video): You Only Need 32 Tokens to Represent a Video Even in Vlms, by Michael S. Ryoo et al.
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
by Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles
First submitted to arxiv on: 21 Oct 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary A multimodal language model for videos, xGen-MM-Vid (BLIP-3-Video), is introduced to efficiently capture temporal information over multiple frames. This model takes advantage of the “temporal encoder” in addition to the conventional visual tokenizer, resulting in significantly fewer visual tokens compared to competing models. The paper explores different types of temporal encoders and experimentally confirms that BLIP-3-Video achieves comparable video question-answering accuracies to larger state-of-the-art models while being more efficient and smaller. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary A new model for understanding videos is developed. This model, called xGen-MM-Vid (BLIP-3-Video), can analyze multiple frames in a video at the same time. It does this by using something called a “temporal encoder” along with a “visual tokenizer”. This allows it to work with much less information than other models do. The researchers tried out different kinds of temporal encoders and found that their model works just as well as bigger, more complicated models. |
Keywords
» Artificial intelligence » Encoder » Language model » Question answering » Tokenizer