Loading Now

Summary of Bev-tsr: Text-scene Retrieval in Bev Space For Autonomous Driving, by Tao Tang et al.


BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving

by Tao Tang, Dafeng Wei, Zhengyu Jia, Tian Gao, Changwei Cai, Chengkai Hou, Peng Jia, Kun Zhan, Haiyang Sun, Jingchen Fan, Yixing Zhao, Fu Liu, Xiaodan Liang, Xianpeng Lang, Yang Wang

First submitted to arxiv on: 2 Jan 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
The proposed BEV-TSR framework leverages descriptive text to retrieve corresponding scenes in the Bird’s Eye View (BEV) space, addressing challenges like lack of global feature representation and inadequate text retrieval ability for complex driving scenes. The framework employs a large language model (LLM) to extract semantic features from text inputs, incorporating knowledge graph embeddings to enhance language embedding richness. A Shared Cross-modal Embedding bridges the gap between BEV features and language embeddings, with caption generation tasks enhancing alignment. Experimental results on nuScenes-Retrieval show BEV-TSR achieves state-of-the-art performance (85.78% and 87.66% top-1 accuracy for scene-to-text and text-to-scene retrieval, respectively). The proposed framework has the potential to improve autonomous driving data optimization.
Low GrooveSquid.com (original content) Low Difficulty Summary
Imagine you’re trying to find specific scenes from a big dataset of autonomous driving footage. This is hard because most current methods aren’t good at understanding complex descriptions of what’s happening in those scenes. Researchers propose a new way to do this, using text descriptions as input and combining it with computer vision techniques. They also create a new dataset to test their method. Results show that this approach works really well (85.78% and 87.66% accurate), which is important for making autonomous vehicles better.

Keywords

» Artificial intelligence  » Alignment  » Embedding  » Knowledge graph  » Large language model  » Optimization