Summary of Coef-vq: Cost-efficient Video Quality Understanding Through a Cascaded Multimodal Llm Framework, by Xin Dong et al.
COEF-VQ: Cost-Efficient Video Quality Understanding through a Cascaded Multimodal LLM Framework
by Xin Dong, Sen Jia, Hongyu Xiong
First submitted to arxiv on: 11 Dec 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes a novel cascaded Multimodal Large Language Model (MLLM) framework, called COEF-VQ, for better video quality understanding on TikTok. The MLLM fuses visual, textual, and audio signals, and is composed of a lightweight model as the pre-filtering stage and the MLLM as the fine-tuning stage. This design significantly reduces the need for GPU resources while retaining the performance demonstrated by the MLLM alone. The authors deploy this framework on TikTok’s video management platform (VMP) and conduct experiments on two in-house tasks related to video quality understanding, showing that COEF-VQ leads to substantial performance gains with limited resource consumption. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper creates a new way for computers to understand videos better. It uses a special type of artificial intelligence called Multimodal Large Language Model (MLLM) to analyze videos on TikTok. The MLLM looks at the video’s visuals, text, and audio together to make decisions. This makes it more efficient and effective than other methods that only look at one part of the video. The authors tested this new method on TikTok and showed that it can make better decisions with less computer power needed. |
Keywords
» Artificial intelligence » Fine tuning » Large language model