Summary of Not (yet) the Whole Story: Evaluating Visual Storytelling Requires More Than Measuring Coherence, Grounding, and Repetition, by Aditya K Surikuchi et al.
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition
by Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle
First submitted to arxiv on: 5 Jul 2024
Categories
- Main: Computation and Language (cs.CL)
- Secondary: Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary This paper introduces a novel method to evaluate the quality of stories generated by models given temporally ordered image sequences. The proposed method focuses on three key aspects: visual grounding, coherence, and repetitiveness, which are crucial for human-like story understanding. The authors apply this method to several models, including LLaVA and TAPM, a smaller visual storytelling model. Surprisingly, the smaller model obtains competitive performance with significantly fewer parameters than LLaVA. To further improve performance, the authors upgrade the visual and language components of TAPM, achieving competitive results while reducing the number of parameters. The study concludes that a ‘good’ story may require more than just human-like levels of visual grounding, coherence, and repetition. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary This paper helps us understand how to tell better stories using computers. Right now, it’s hard for computers to decide what makes a good story because there isn’t one agreed-upon way to measure it. The authors come up with a new method that looks at three important things: whether the images match the words, if the story makes sense, and if the story repeats itself in a meaningful way. They use this method to test several computer models that can create stories from pictures. They find that one model, called TAPM, does surprisingly well even though it’s much smaller than another popular model, LLaVA. By making some adjustments to TAPM, they’re able to get similar results with fewer calculations. Finally, the authors ask people to rate the stories and discover that there might be more to a good story than just making sure the images match the words. |
Keywords
» Artificial intelligence » Grounding