Summary of Vimi: Vehicle-infrastructure Multi-view Intermediate Fusion For Camera-based 3d Object Detection, by Zhe Wang et al.
VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection
by Zhe Wang, Siqi Fan, Xiaoliang Huo, Tongda Xu, Yan Wang, Jingjing Liu, Yilun Chen, Ya-Qin Zhang
First submitted to arxiv on: 20 Mar 2023
Categories
- Main: Computer Vision and Pattern Recognition (cs.CV)
- Secondary: None
GrooveSquid.com Paper Summaries
GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!
Summary difficulty | Written by | Summary |
---|---|---|
High | Paper authors | High Difficulty Summary Read the original abstract here |
Medium | GrooveSquid.com (original content) | Medium Difficulty Summary In autonomous driving, Vehicle-Infrastructure Cooperative 3D Object Detection (VIC3D) combines multi-view cameras from vehicles and traffic infrastructure to provide a comprehensive view of road conditions. To overcome the challenges of calibration noise and information loss in VIC3D, researchers propose Vehicles-Infrastructure Multi-view Intermediate fusion (VIMI), a novel 3D object detection framework. VIMI uses Multi-scale Cross Attention (MCA) to fuse features from vehicles and infrastructure, Camera-aware Channel Masking (CCM) to correct calibration noise, and Feature Compression (FC) to reduce the size of transmitted features for improved efficiency. Experimental results on the DAIR-V2X-C dataset show that VIMI achieves 15.61% overall AP_3D and 21.44% AP_BEV, outperforming state-of-the-art early fusion and late fusion methods with comparable transmission cost. |
Low | GrooveSquid.com (original content) | Low Difficulty Summary Imagine cars talking to traffic lights to make roads safer! This paper helps cars understand what’s happening on the road by combining camera views from vehicles and traffic infrastructure. They call this “Vehicle-Infrastructure Cooperative 3D Object Detection” or VIC3D for short. The big challenge is making sure all these cameras work together correctly, so they came up with a new way to do it called VIMI. It uses special tricks like attention and compression to make the process more efficient. They tested their method on a new dataset and showed that it works better than other methods with similar transmission costs. |
Keywords
» Artificial intelligence » Attention » Cross attention » Object detection