SLAMANTIC - Leveraging Semantics to Improve VSLAM in Dynamic Environments - Naver Labs Europe
loader image

The accuracy of estimated camera poses in a visual simultaneous localization and mapping (VSLAM) algorithm relies on valid geometric representations of the observed environment.

During operation, a visual simultaneous localization and mapping algorithm extends its map (3D scene model) by adding new 3D measurements which are generated by estimating the depth of image points captured from different viewpoints at different points in time.
If an object has moved in between these points in time, the triangulation of image points representing the object doesn’t yield to the correct distance from the camera which means it therefore doesn’t lead to the correct camera pose. This means that, to get reliable results, the environments have to be either very distinctive or static. This shortcoming is the main challenge for VSLAM algorithms in real-world scenarios.

To illustrate this, in the figure below a car is approaching a truck that has stopped at an intersection. When the truck starts moving again, the car falsely interprets this as backward motion.

Figure 1: A challenging scenario for VSLAM caused by dynamics: The car with the observing camera is approaching a truck which has stopped at a crossing (lef column). When the truck gradually starts moving, the baseline VSLAM (middle) fails because it wrongly estimates a backward motion. Using semantics, our proposed approach (right) is able to cope with such situations.


To address this issue we propose including semantic information obtained by deep learning methods in the traditional geometric pipeline ‘SLAMANTIC’. (SLAMANTIC – Leveraging Semantics to Improve VSLAM in Dynamic Environments, Matthias Schörghuber, Daniel Steininger (both AIT Austrian Institute of Technology), , Margrit Gelautz (Vienna University of Technology), Workshop on Deep Learning for Visual SLAM at ICCV19, Seoul, South Korea, 27 October 2019-02 November, 2019)

Specifically, we compute a confidence measure for each map point as a function of its semantic class (car, person, building, etc.) and its detection consistency over time. The confidence is then applied to guide the usage of each point in the mapping and localization stage. Points with high confidence are used to verify points with low confidence to select the final set of points for pose computation and mapping.

Evaluating our method on public datasets, we show that it can successfully solve challenging situations in dynamic environments which cause state-of-the-art baseline VSLAM algorithms to fail and that it maintains performance on static scenes.

Code on