Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos

Published 15 Nov 2018 in cs.CV | (1811.06152v1)

Abstract: Learning to predict scene depth from RGB inputs is a challenging task both for indoor and outdoor robot navigation. In this work we address unsupervised learning of scene depth and robot ego-motion where supervision is provided by monocular videos, as cameras are the cheapest, least restrictive and most ubiquitous sensor for robotics. Previous work in unsupervised image-to-depth learning has established strong baselines in the domain. We propose a novel approach which produces higher quality results, is able to model moving objects and is shown to transfer across data domains, e.g. from outdoors to indoor scenes. The main idea is to introduce geometric structure in the learning process, by modeling the scene and the individual objects; camera ego-motion and object motions are learned from monocular videos as input. Furthermore an online refinement method is introduced to adapt learning on the fly to unknown domains. The proposed approach outperforms all state-of-the-art approaches, including those that handle motion e.g. through learned flow. Our results are comparable in quality to the ones which used stereo as supervision and significantly improve depth prediction on scenes and datasets which contain a lot of object motion. The approach is of practical relevance, as it allows transfer across environments, by transferring models trained on data collected for robot navigation in urban scenes to indoor navigation settings. The code associated with this paper can be found at https://sites.google.com/view/struct2depth.

Abstract PDF Upgrade to Chat

Citations (451)

View on Semantic Scholar

Summary

The paper introduces a novel geometric structure-based model to estimate scene depth and camera motion from monocular videos without using depth sensors.
It integrates object motion modeling and an online refinement procedure to adapt predictions in dynamic and diverse environments.
Experimental results demonstrate reduced depth prediction errors and enhanced ego-motion accuracy on challenging datasets like KITTI and Cityscapes.

Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos

The paper "Depth Prediction Without the Sensors: Leveraging Structure for Unsupervised Learning from Monocular Videos" by Vincent Casser et al. presents a novel unsupervised approach to predict scene depth and robot ego-motion using only monocular video inputs. The absence of depth sensors, which are not always feasible in many applications due to their cost and restrictions, frames this study's importance. The authors introduce a technique that models both the geometry of the scene and individual object motions, overtaking limitations seen in prior unsupervised methods.

Core Methodology

The approach leverages geometric structure to enhance depth prediction and ego-motion estimation. Unlike previous models that tend to falter in dynamic scenes due to motion handling issues, this method integrates an explicit 3D motion model. The model learns from monocular video sequences, predicting object motion as well as camera movements.

Another critical innovation is the online refinement procedure. This process adapts the model to new environments or datasets on-the-fly—a concept not previously applied in this context. This adaptability allows the model to provide accurate predictions across varying environments, adding significant practical value to robot navigation tasks.

Experimental Outcomes

The proposed method demonstrates superior performance over state-of-the-art techniques, even rivaling those using stereo inputs. Key quantitative results highlight:

A decrease in depth prediction errors compared to baseline models.
Improved ego-motion estimation accuracy on the KITTI dataset, surpassing algorithms that use longer-term temporal information.

The model's ability to predict depth in highly dynamic scenes, like those in the Cityscapes dataset, illustrates the efficacy of integrating object motion modeling. Notably, the results hold up in challenging cross-domain evaluations, such as transferring from outdoor urban scenes to indoor navigation scenarios.

Implications and Future Directions

The implications of this research span both theoretical and practical realms. Theoretically, it challenges the reliance on direct sensor data for depth estimation, proposing a viable alternative that relies on available visual inputs. Practically, this can significantly reduce costs and increase the accessibility of autonomous systems in various applications.

Speculation for future developments in AI considers extending this online refinement mechanism to handle longer sequences, enhancing temporal coherence and consistency in depth predictions. Furthermore, potential expansions into full 3D reconstruction could capitalize on the robust depth and motion predictions, pushing the boundaries of autonomous navigation in unknown environments.

In conclusion, the study offers a significant advancement in unsupervised monocular depth prediction and robot navigation. By addressing motion within dynamic scenes and facilitating domain adaptation through online refinement, it sets a benchmark for future exploration in unsupervised learning methods.