Loading Now

Summary of Roomtour3d: Geometry-aware Video-instruction Tuning For Embodied Navigation, by Mingfei Han et al.


RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

by Mingfei Han, Liang Ma, Kamila Zhumakhanova, Ekaterina Radionova, Jingyi Zhang, Xiaojun Chang, Xiaodan Liang, Ivan Laptev

First submitted to arxiv on: 11 Dec 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI); Robotics (cs.RO)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Vision-and-Language Navigation (VLN) relies on manual curation of existing simulators, limiting its diversity and scale. To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos capturing real-world indoor spaces and human walking demonstrations. Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions. We perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with information on room types, object locations, and surrounding scenes. The dataset includes ~100K description-enriched trajectories with ~200K instructions and 17K action-enriched trajectories from 1847 room tour environments. RoomTour3D enables significant improvements across multiple VLN tasks, including CVDN, SOON, R2R, and REVERIE. Moreover, it facilitates the development of trainable zero-shot VLN agents, showcasing potential and challenges in advancing towards open-world navigation.
Low GrooveSquid.com (original content) Low Difficulty Summary
Imagine navigating through a room or building using verbal instructions. This is called Vision-and-Language Navigation (VLN). The problem is that there isn’t enough data to train computers to do this well. To fix this, researchers created a new dataset called RoomTour3D. It’s made up of real-world videos of people walking through rooms and buildings. This dataset helps computers learn how to follow verbal instructions in a more realistic way. With this new dataset, computers can get better at navigating through spaces using only verbal directions.

Keywords

» Artificial intelligence  » Zero shot