Loading Now

Summary of Zero-shot Detection Of Buildings in Mobile Lidar Using Language Vision Model, by June Moh Goo et al.


Zero-shot detection of buildings in mobile LiDAR using Language Vision Model

by June Moh Goo, Zichao Zeng, Jan Boehm

First submitted to arxiv on: 15 Apr 2024

Categories

  • Main: Computer Vision and Pattern Recognition (cs.CV)
  • Secondary: Artificial Intelligence (cs.AI)

     Abstract of paper      PDF of paper


GrooveSquid.com Paper Summaries

GrooveSquid.com’s goal is to make artificial intelligence research accessible by summarizing AI papers in simpler terms. Each summary below covers the same AI paper, written at different levels of difficulty. The medium difficulty and low difficulty versions are original summaries written by GrooveSquid.com, while the high difficulty version is the paper’s original abstract. Feel free to learn from the version that suits you best!

Summary difficulty Written by Summary
High Paper authors High Difficulty Summary
Read the original abstract here
Medium GrooveSquid.com (original content) Medium Difficulty Summary
Recent advancements have shown that Language Vision Models (LVMs) outperform the current State-of-the-Art (SOTA) in two-dimensional (2D) computer vision tasks, leading researchers to explore applying LVMs to three-dimensional (3D) data. However, addressing 3D point clouds with LVMs proves challenging due to issues such as feature extraction, large data sizes, and labelling costs, resulting in limited dataset availability. To overcome these hurdles, our research focuses on transferring 3D to 2D using Grounded SAM through Spherical Projection and evaluating its effectiveness with synthetic data. Our approach demonstrated high performance, achieving an accuracy of 0.96, IoU of 0.85, precision of 0.92, recall of 0.91, and F1 score of 0.92. The results confirm the potential of this method, but challenges persist, such as occlusion problems and pixel-level overlaps during spherical image generation.
Low GrooveSquid.com (original content) Low Difficulty Summary
Researchers have been trying to use Language Vision Models (LVMs) for tasks that involve recognizing objects in three-dimensional space. However, it’s hard to make LVMs work well with 3D data because of some big challenges. For example, it’s tough to extract important features from 3D data and collecting and labeling the data can be very expensive. This makes it difficult to find good datasets to test our models. To solve this problem, we developed a new way to transfer 3D data to 2D data that LVMs are already good at handling. Our approach worked really well, getting 96% of things right and correctly identifying most objects. While this is a great start, there’s still more work to be done to make our method perfect.

Keywords

» Artificial intelligence  » F1 score  » Feature extraction  » Image generation  » Precision  » Recall  » Sam  » Synthetic data