Summary of Vibe: a Text-to-video Benchmark For Evaluating Hallucination in Large Multimodal Models, by Vipula Rawte et al.

ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

by Vipula Rawte, Sarthak Jain, Aarush Sinha, Garv Kaushik, Aman Bansal, Prathiksha Rumale Vishwanath, Samyak Rajesh Jain, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amit P. Sheth, Amitava Das

First submitted to arxiv on: 16 Nov 2024

GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty	Written by	Summary
High	Paper authors	High Difficulty Summary Read the original abstract here
Medium	GrooveSquid.com (original content)	Medium Difficulty Summary This paper proposes ViBe, a large-scale dataset of hallucinated videos generated by open-source Text-to-Video (T2V) models. The authors identify five major types of hallucinations: Vanishing Subject, Omission Error, Numeric Variability, Subject Dysmorphia, and Visual Incongruity. To evaluate the reliability of T2V models, they generate 3,782 videos from diverse MS COCO captions using ten different models and manually annotate them. The proposed benchmark includes a dataset of hallucinated videos and a classification framework using video embeddings. The authors establish classification as a baseline, with the TimeSFormer + CNN ensemble achieving the best performance (0.345 accuracy, 0.342 F1 score). This highlights the difficulty of automated hallucination detection, emphasizing the need for improved methods to drive the development of more robust T2V models and evaluate their outputs based on user preferences.
Low	GrooveSquid.com (original content)	Low Difficulty Summary This paper creates a big dataset called ViBe that shows mistakes in video generation from text prompts. It finds five main kinds of mistakes: missing things, wrong numbers, or weird visuals. The authors use lots of different video-generating tools to make 3,782 videos and check them by hand. They also create a way to categorize the mistakes using special computer code. This helps scientists develop better video-generation tools that can be trusted.

Keywords

» Artificial intelligence » Classification » Cnn » F1 score » Hallucination

ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models

by Vipula Rawte, Sarthak Jain, Aarush Sinha, Garv Kaushik, Aman Bansal, Prathiksha Rumale Vishwanath, Samyak Rajesh Jain, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amit P. Sheth, Amitava Das

Categories

GrooveSquid.com Paper Summaries

Keywords

Summary of A Dataset Of Questions on Decision-theoretic Reasoning in Newcomb-like Problems, by Caspar Oesterheld and Emery Cooper and Miles Kodama and Linh Chi Nguyen and Ethan Perez

Summary of Cross-patient Pseudo Bags Generation and Curriculum Contrastive Learning For Imbalanced Multiclassification Of Whole Slide Image, by Yonghuang Wu et al.

Related Posts