Summary of Doe-1: Closed-loop Autonomous Driving with Large World Model, by Wenzhao Zheng et al.
Doe-1: Closed-Loop Autonomous Driving with Large World Model
by Wenzhao Zheng, Zetian Xia, Yuanhui Huang, Sicheng Zuo, Jie Zhou, Jiwen Lu
First submitted to arxiv on: 12 Dec 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The proposed Doe-1 framework for autonomous driving tackles the limitations of existing open-loop methods by introducing a closed-loop approach that leverages large amounts of data. The framework formulates autonomous driving as a next-token generation problem, using multi-modal tokens to unify perception, prediction, and planning tasks. Specifically, it employs free-form texts for scene descriptions, generates future predictions in RGB space with image tokens, and encodes actions into discrete tokens using position-aware tokenizers. A multi-modal transformer is trained to autoregressively generate tokens end-to-end and unified. The approach demonstrates effectiveness on the nuScenes dataset across tasks such as visual question-answering, action-conditioned video generation, and motion planning. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Imagine a self-driving car that can understand its surroundings, predict what might happen next, and plan its actions accordingly. This is the goal of a new approach called Doe-1 for autonomous driving. The current methods are limited because they don’t learn from data very well or make good decisions. Doe-1 changes this by using a special kind of computer model that can understand many different types of information at once. It can describe what it sees, predict what will happen next, and plan its actions. This approach is tested on a large dataset called nuScenes and shows great results for tasks like answering questions about what’s happening in a video or generating new videos based on action. |
Keywords
» Artificial intelligence » Multi modal » Question answering » Token » Transformer