Summary of Crema: Generalizable and Efficient Video-language Reasoning Via Multimodal Modular Fusion, by Shoubin Yu et al.
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
by Shoubin Yu, Jaehong Yoon, Mohit Bansal
First submitted to arxiv on: 8 Feb 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper presents a novel multimodal reasoning framework called CREMA that can incorporate any new modality to enhance video reasoning. The authors address the limitations of existing models by proposing a generalizable, efficient, and modular architecture that processes multiple modalities without requiring significant updates to parameters. The framework first incorporates various modalities from given videos, such as optical flow, 3D point cloud, audio, thermal heatmap, and touch map, leveraging sensors or pre-trained models. Then, it uses a query transformer with parameter-efficient modules to project modality features into the token embedding space of a large language model (LLM). The authors also propose a novel progressive multimodal fusion design that compresses information across various modalities while maintaining computational efficiency in the LLM. CREMA is validated on 7 video-language reasoning tasks, achieving better or equivalent performance against strong multimodal LLMs while reducing over 90% trainable parameters. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper makes it easier for machines to understand videos by combining different types of information from the video. The authors created a new way to process multiple sources of data, such as sound, movement, and temperature, without needing to rewrite the whole system. They used this approach to improve how well machines can answer questions about videos. This is important because it could help us use machines to do tasks that are hard for humans, like analyzing large amounts of video footage. |
Keywords
» Artificial intelligence » Embedding space » Large language model » Optical flow » Parameter efficient » Temperature » Token » Transformer