Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue

Ravi Garg

Last update: Nov 28, 2022

Related tags

Deep Learning Unsupervised_Depth_Estimation

Overview

Realtime Unsupervised Depth Estimation from an Image

This is the caffe implementation of our paper "Unsupervised CNN for single view depth estimation: Geometry to the rescue" published in ECCV 2016 with minor modifications. In this variant, we train the network end-to-end instead of in coarse to fine manner with deeper network (Resnet 50) and TVL1 loss instead of HS loss.

With the implementation we share the sample Resnet50by2 model trained on KITTI training set:

https://github.com/Ravi-Garg/Unsupervised_Depth_Estimation/blob/master/model/train_iter_40000.caffemodel

Shared model is a small variant of the 50 layer residual network from scratch on KITTI. Our model is <25 MB and predicts depths on 160x608 resolution images at over 30Hz on Nvidia Geforce GTX980 (50Hz on TITAN X). It can be used with caffe without any modification and we provide a simple matlab wrapper for testing.

Click on the image to watch preview of the results on youtube:

If you use our model or the code for your research please cite:

@inproceedings{garg2016unsupervised,
  title={Unsupervised CNN for single view depth estimation: Geometry to the rescue},
  author={Garg, Ravi and Kumar, BG Vijay and Carneiro, Gustavo and Reid, Ian},
  booktitle={European Conference on Computer Vision},
  pages={740--756},
  year={2016},
  organization={Springer}
}

Training Procedure

This model was trained on 23200 raw stereo pairs of KITTI taken from city, residential and road sequences. Images from other sequences of KITTI were left untouched. A subset of 697 images from 28 sequences froms the testset, leaving the remaining 33 sequences from these categories which can be used for training.

To use the same training data use the splits spacified in the file 'train_test_split.mat'.

Our model is trained end-to-end from scratch with adam solver (momentum1 = 0.9 , momentom2 = 0.999, learning rate =10e-3 ) for 40,000 iterations on 4 gpus with batchsize 14 per GPU. This model is a pre-release further tuning of hyperparameters should improve results. Only left-right flips as described in the paper were used to train the provided network. Other agumentations described in the paper and runtime shuffle were not used but should also lead to performance imrovement.

Here is the training loss recorded per 20 iterations:

Note: We have resized the KITTI images to 160x608 for training - which changes the aspect ratio of the images. Thus for proper evaluation on KITTI the images needs to be resized to this resolution and predicted disparities should be scaled by a factor of 608/width_of_input_image before computing depth. For ease in citing the results for further publications, we share the performance measures.

Our model gives following results on KITTI test-set without any post processing:

RMSE(linear): 4.400866

RMSE(log) : 0.233548

RMSE(log10) : 0.101441

Abs rel diff: 0.137796

Sq rel diff : 0.824861

accuracy THr 1.25 : 0.809765

accuracy THr 1.25 sq: 0.935108

accuracy THr 1.25 cube: 0.974739

The test-set consists of 697 images which was used in https://www.cs.nyu.edu/~deigen/depth/kitti_depth_predictions.mat Depth Predictions were first clipped to depth values between 0 and 50 meters and evaluated only in the region spacified in the given mask.

#Network Architecture

Architecture of our networks closely follow Residual networks scheme. We start from resnet 50 by 2 architecture and have replaced strided convolutions with 2x2 MAX pooling layers like VGG. The first 7x7 convolution with stride 2 is replaced with the 7x7 convolution with no stride and the max-pooled output at ½ resolution is passed through an extra 3x3 convolutional (128 features)->relu->2x2 pooling block. Rest of the network followes resnet50 with half the parameters every layer.

For dense prediction we have followed the skip-connections as specified in FCN and our ECCV paper. We have introduced a learnable scale layer with weight decay 0.01 before every 1x1 convolution of FCN skip-connections which allows us to merge mid-level features more efficiently by:

Adaptively selecting the mid-level features which are more correlated to depth of the scene.
Making 1x1 convolutions for projections more stable for end to end training.

Further analysis and visualizations of learned features will be released shortly on the arxiv: https://arxiv.org/pdf/1603.04992v2.pdf

Using the code

To train and finetune networks on your own data, you need to compile caffe with additional:

“AbsLoss” layer for L1 loss minimization,
“Warping” layer for image warpping given flow
and modified "filler.hpp" to compute image gradient with convolutions which we share here.

License

For academic usage, the code is released under the permissive BSD license. For any commercial purpose, please contact the authors.

Contact

Please report any known issues on this thread of to [email protected]

Comments

Unknown layer type: Warping

I got the latest version caffe and set WITH_PYTHON_LAYER := 1 in makefile.config. However, I still get the following error:

===

F0325 16:49:50.756420 6840 layer_factory.hpp:81] Check failed: registry.count(type) == 1 (0 vs. 1) Unknown layer type: Warping (known types: AbsVal, Accuracy, ArgMax, BNLL, BatchNorm, BatchReindex, Bias, Clip, Concat, ContrastiveLoss, Convolution, Crop, Data, Deconvolution, Dropout, DummyData, ELU, Eltwise, Embed, EuclideanLoss, Exp, Filter, Flatten, HDF5Data, HDF5Output, HingeLoss, Im2col, ImageData, InfogainLoss, InnerProduct, Input, LRN, LSTM, LSTMUnit, Log, MVN, MemoryData, MultinomialLogisticLoss, PReLU, Parameter, Pooling, Power, Python, RNN, ReLU, Reduction, Reshape, SPP, Scale, Sigmoid, SigmoidCrossEntropyLoss, Silence, Slice, Softmax, SoftmaxWithLoss, Split, Swish, TanH, Threshold, Tile, WindowData)

=== Is there anything I missed?

opened by kspeng 2
where do you get the camera`s focal?

i see the 'KITTI_vis_test_seq.m' that you offered in the program, you seem to set (focal f * distance B) equals to 389.6304，i know the B is 0.54 meter in the kitti dataset paper, but how/where do you get the camera`s focal f?

opened by lxp349899201 1
Prediction result recovery

Hi,

I downloaded your prediction result mat, but I can not recover it correctly. I got a 142x697 image from predictions in mat. I plot it our but it's quite weird. All the images look like this example:

Is there anything I have to do before I save the data to png?

opened by kspeng 1
generate_depth

Hi,Thanks for your great sharing! I want to generate the depth of kitti ,I have no idea about it even if i have read the development kit provided by the website(http://www.cvlibs.net/datasets/kitti/raw_data.php). could you please share your code about generating the depth iamges? Thank you！

opened by liu666666 1
Load matlab network

Hello, I'm trying to load a traiend network and use it to depth estimation, however the file train_iter_40000.caffemodel requires a .prototxt' to be imported in matlab using importCaffeNetwork. I tried with the .prototxt files such as resnet50.prototxt, but it says Check if your network contains layers which are not defined in caffe.proto.. Thanjs for your help.

opened by jdanielhoyos 0

Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue

Related tags

Overview

Realtime Unsupervised Depth Estimation from an Image

Training Procedure

Our model gives following results on KITTI test-set without any post processing:

Using the code

License

Contact

Comments

Unknown layer type: Warping

where do you get the camera`s focal?

Prediction result recovery

generate_depth

Load matlab network

Owner

Ravi Garg

Light-weight network, depth estimation, knowledge distillation, real-time depth estimation, auxiliary data.

Aerial Single-View Depth Completion with Image-Guided Uncertainty Estimation (RA-L/ICRA 2020)

PN-Net a neural field-based framework for depth estimation from single-view RGB images.

The implemention of Video Depth Estimation by Fusing Flow-to-Depth Proposals

Monocular Depth Estimation Using Laplacian Pyramid-Based Depth Residuals

ESTDepth: Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks (CVPR 2021)

Blender add-on: Add to Cameras menu: View → Camera, View → Add Camera, Camera → View, Previous Camera, Next Camera

[CVPR'21] Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation

Toward Realistic Single-View 3D Object Reconstruction with Unsupervised Learning from Multiple Images (ICCV 2021)

Code for "Single-view robot pose and joint angle estimation via render & compare", CVPR 2021 (Oral).

NFT-Price-Prediction-CNN - Using visual feature extraction, prices of NFTs are predicted via CNN (Alexnet and Resnet) architectures.

Geometry-Free View Synthesis: Transformers and no 3D Priors

Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency[ECCV 2020]

Implementation of ICCV19 Paper "Learning Two-View Correspondences and Geometry Using Order-Aware Network"

Beyond Image to Depth: Improving Depth Prediction using Echoes (CVPR 2021)

Python script for performing depth completion from sparse depth and rgb images using the msg_chn_wacv20. model in ONNX

Python script for performing depth completion from sparse depth and rgb images using the msg_chn_wacv20. model in Tensorflow Lite.

Data-depth-inference - Data depth inference with python

Re-implementation of the Noise Contrastive Estimation algorithm for pyTorch, following "Noise-contrastive estimation: A new estimation principle for unnormalized statistical models." (Gutmann and Hyvarinen, AISTATS 2010)