Summary of Lavida Drive: Vision-text Interaction Vlm For Autonomous Driving with Token Selection, Recovery and Enhancement, by Siwen Jiao et al.
LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement
by Siwen Jiao, Yangyi Fang, Baoyun Peng, Wangqun Chen, Bharadwaj Veeravalli
First submitted to arxiv on: 20 Nov 2024
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: Artificial Intelligence (cs.AI)
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary The paper introduces a novel Visual Language Model (VLM) called LaVida Drive for visual question answering (VQA) in autonomous driving. The model is designed to handle dynamic driving environments by integrating temporal data while maintaining high-resolution inputs for detailed visual perception. It consists of two modules: the Query-aware Token Selection module and the Spatial-Temporal Token Recovery and Enhancement module. The former reduces the token count from high-resolution spatial input based on semantic alignment with the input query, while the latter ensures smooth interactions between spatial and temporal information. Experimental results show that LaVida Drive improves overall performance, enhances efficiency, and significantly reduces visual tokens. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary LaVida Drive is a new way for self-driving cars to understand questions from humans. It’s hard for computers to make sense of images or videos when they’re moving around in real-time. To solve this problem, the LaVida Drive model combines information from different time points and keeps high-quality details. The model has two parts: one picks the most important visual bits based on what the question is asking, and the other makes sure all the information flows smoothly together. This helps self-driving cars understand questions better and make decisions more effectively. |
Keywords
» Artificial intelligence » Alignment » Language model » Question answering » Token