Loading Now

Summary of Crema: Generalizable and Efficient Video-language Reasoning Via Multimodal Modular Fusion, by Shoubin Yu et al.


CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

by Shoubin Yu, Jaehong Yoon, Mohit Bansal

First submitted to arxiv on: 8 Feb 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Computation and Language (cs.CL)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
This paper presents a novel multimodal reasoning framework called CREMA that can incorporate any new modality to enhance video reasoning. The authors address the limitations of existing models by proposing a generalizable, efficient, and modular architecture that processes multiple modalities without requiring significant updates to parameters. The framework first incorporates various modalities from given videos, such as optical flow, 3D point cloud, audio, thermal heatmap, and touch map, leveraging sensors or pre-trained models. Then, it uses a query transformer with parameter-efficient modules to project modality features into the token embedding space of a large language model (LLM). The authors also propose a novel progressive multimodal fusion design that compresses information across various modalities while maintaining computational efficiency in the LLM. CREMA is validated on 7 video-language reasoning tasks, achieving better or equivalent performance against strong multimodal LLMs while reducing over 90% trainable parameters.
Low GrooveSquid.com (original content) Low Difficulty Summary
This paper makes it easier for machines to understand videos by combining different types of information from the video. The authors created a new way to process multiple sources of data, such as sound, movement, and temperature, without needing to rewrite the whole system. They used this approach to improve how well machines can answer questions about videos. This is important because it could help us use machines to do tasks that are hard for humans, like analyzing large amounts of video footage.

Keywords

» Artificial intelligence  » Embedding space  » Large language model  » Optical flow  » Parameter efficient  » Temperature  » Token  » Transformer