Loading Now

Summary of User-in-the-loop Evaluation Of Multimodal Llms For Activity Assistance, by Mrinal Verghese et al.


User-in-the-loop Evaluation of Multimodal LLMs for Activity Assistance

by Mrinal Verghese, Brian Chen, Hamid Eghbalzadeh, Tushar Nagarajan, Ruta Desai

First submitted to arxiv on: 4 Aug 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Our research explores the capabilities of modern multimodal reasoning models powered by Large Language Models (LLMs) to facilitate vision-powered assistants for multi-step daily activities. These assistants must be able to encode relevant visual history from sensors, forecast future actions, and replan based on user feedback. We evaluate these capabilities using two prominent LLM approaches – Socratic Models and Vision Conditioned Language Models (VCLMs) – on video-based action anticipation tasks using offline datasets. However, this approach does not allow us to close the loop with the user, which is essential for evaluating replanning capabilities and measuring successful activity completion in assistive scenarios. To address this, we conduct a first-of-its-kind user study with 18 participants performing 3 different multi-step cooking activities while wearing an egocentric observation device called Aria and following assistance from multimodal LLMs. Our results show that the Socratic approach outperforms VCLMs in both offline and online settings. We also highlight the challenges of grounding long visual history, which is common in activity assistance, especially for VCLMs.
Low GrooveSquid.com (original content) Low Difficulty Summary
This research investigates how well AI models can help with daily activities by using a camera to capture what’s happening and predicting what you’ll do next. The goal is to create an assistant that can guide you through tasks like cooking or doing chores. To test this idea, we used two types of AI models – Socratic Models and Vision Conditioned Language Models (VCLMs) – on videos of people performing different actions. We found that one type of model performed better than the other in both simulated and real-world scenarios. This study also shows how hard it is to use AI models to understand long sequences of events, which is important for helping with complex tasks.

Keywords

* Artificial intelligence  * Grounding