Summary of Vibe: a Text-to-video Benchmark For Evaluating Hallucination in Large Multimodal Models, by Vipula Rawte et al.
ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models
by Vipula Rawte, Sarthak Jain, Aarush Sinha, Garv Kaushik, Aman Bansal, Prathiksha Rumale Vishwanath, Samyak Rajesh Jain, Aishwarya Naresh Reganti, Vinija Jain, Aman Chadha, Amit P. Sheth, Amitava Das
First submitted to arxiv on: 16 Nov 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper proposes ViBe, a large-scale dataset of hallucinated videos generated by open-source Text-to-Video (T2V) models. The authors identify five major types of hallucinations: Vanishing Subject, Omission Error, Numeric Variability, Subject Dysmorphia, and Visual Incongruity. To evaluate the reliability of T2V models, they generate 3,782 videos from diverse MS COCO captions using ten different models and manually annotate them. The proposed benchmark includes a dataset of hallucinated videos and a classification framework using video embeddings. The authors establish classification as a baseline, with the TimeSFormer + CNN ensemble achieving the best performance (0.345 accuracy, 0.342 F1 score). This highlights the difficulty of automated hallucination detection, emphasizing the need for improved methods to drive the development of more robust T2V models and evaluate their outputs based on user preferences. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper creates a big dataset called ViBe that shows mistakes in video generation from text prompts. It finds five main kinds of mistakes: missing things, wrong numbers, or weird visuals. The authors use lots of different video-generating tools to make 3,782 videos and check them by hand. They also create a way to categorize the mistakes using special computer code. This helps scientists develop better video-generation tools that can be trusted. |
Keywords
» Artificial intelligence » Classification » Cnn » F1 score » Hallucination