An unsupervised learning framework for depth and ego-motion estimation from monocular videos

Overview

SfMLearner

This codebase implements the system described in the paper:

Unsupervised Learning of Depth and Ego-Motion from Video

Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe

In CVPR 2017 (Oral).

See the project webpage for more details. Please contact Tinghui Zhou ([email protected]) if you have any questions.

Prerequisites

This codebase was developed and tested with Tensorflow 1.0, CUDA 8.0 and Ubuntu 16.04.

Running the single-view depth demo

We provide the demo code for running our single-view depth prediction model. First, download the pre-trained model from this Google Drive, and put the model files under models/. Then you can use the provided ipython-notebook demo.ipynb to run the demo.

Preparing training data

In order to train the model using the provided code, the data needs to be formatted in a certain manner.

For KITTI, first download the dataset using this script provided on the official website, and then run the following command

python data/prepare_train_data.py --dataset_dir=/path/to/raw/kitti/dataset/ --dataset_name='kitti_raw_eigen' --dump_root=/path/to/resulting/formatted/data/ --seq_length=3 --img_width=416 --img_height=128 --num_threads=4

For the pose experiments, we used the KITTI odometry split, which can be downloaded here. Then you can change --dataset_name option to kitti_odom when preparing the data.

For Cityscapes, download the following packages: 1) leftImg8bit_sequence_trainvaltest.zip, 2) camera_trainvaltest.zip. Then run the following command

python data/prepare_train_data.py --dataset_dir=/path/to/cityscapes/dataset/ --dataset_name='cityscapes' --dump_root=/path/to/resulting/formatted/data/ --seq_length=3 --img_width=416 --img_height=171 --num_threads=4

Notice that for Cityscapes the img_height is set to 171 because we crop out the bottom part of the image that contains the car logo, and the resulting image will have height 128.

Training

Once the data are formatted following the above instructions, you should be able to train the model by running the following command

python train.py --dataset_dir=/path/to/the/formatted/data/ --checkpoint_dir=/where/to/store/checkpoints/ --img_width=416 --img_height=128 --batch_size=4

You can then start a tensorboard session by

tensorboard --logdir=/path/to/tensorflow/log/files --port=8888

and visualize the training progress by opening https://localhost:8888 on your browser. If everything is set up properly, you should start seeing reasonable depth prediction after ~100K iterations when training on KITTI.

Notes

After adding data augmentation and removing batch normalization (along with some other minor tweaks), we have been able to train depth models better than what was originally reported in the paper even without using additional Cityscapes data or the explainability regularization. The provided pre-trained model was trained on KITTI only with smooth weight set to 0.5, and achieved the following performance on the Eigen test split (Table 1 of the paper):

Abs Rel Sq Rel RMSE RMSE(log) Acc.1 Acc.2 Acc.3
0.183 1.595 6.709 0.270 0.734 0.902 0.959

When trained on 5-frame snippets, the pose model obtains the following performanace on the KITTI odometry split (Table 3 of the paper):

Seq. 09 Seq. 10
0.016 (std. 0.009) 0.013 (std. 0.009)

Evaluation on KITTI

Depth

We provide evaluation code for the single-view depth experiment on KITTI. First, download our predictions (~140MB) from this Google Drive and put them into kitti_eval/.

Then run

python kitti_eval/eval_depth.py --kitti_dir=/path/to/raw/kitti/dataset/ --pred_file=kitti_eval/kitti_eigen_depth_predictions.npy

If everything runs properly, you should get the numbers for Ours(CS+K) in Table 1 of the paper. To get the numbers for Ours cap 50m (CS+K), set an additional flag --max_depth=50 when executing the above command.

Pose

We provide evaluation code for the pose estimation experiment on KITTI. First, download the predictions and ground-truth pose data from this Google Drive.

Notice that all the predictions and ground-truth are 5-frame snippets with the format of timestamp tx ty tz qx qy qz qw consistent with the TUM evaluation toolkit. Then you could run

python kitti_eval/eval_pose.py --gtruth_dir=/directory/of/groundtruth/trajectory/files/ --pred_dir=/directory/of/predicted/trajectory/files/

to obtain the results reported in Table 3 of the paper. For instance, to get the results of Ours for Seq. 10 you could run

python kitti_eval/eval_pose.py --gtruth_dir=kitti_eval/pose_data/ground_truth/10/ --pred_dir=kitti_eval/pose_data/ours_results/10/

KITTI Testing code

Depth

Once you have model trained, you can obtain the single-view depth predictions on the KITTI eigen test split formatted properly for evaluation by running

python test_kitti_depth.py --dataset_dir /path/to/raw/kitti/dataset/ --output_dir /path/to/output/directory --ckpt_file /path/to/pre-trained/model/file/

Pose

We also provide sample testing code for obtaining pose predictions on the KITTI dataset with a pre-trained model. You can obtain the predictions formatted as above for pose evaluation by running

python test_kitti_pose.py --test_seq [sequence_id] --dataset_dir /path/to/KITTI/odometry/set/ --output_dir /path/to/output/directory/ --ckpt_file /path/to/pre-trained/model/file/

A sample model trained on 5-frame snippets can be downloaded at this Google Drive.

Then you can obtain predictions on, say Seq. 9, by running

python test_kitti_pose.py --test_seq 9 --dataset_dir /path/to/KITTI/odometry/set/ --output_dir /path/to/output/directory/ --ckpt_file models/model-100280

Other implementations

Pytorch (by Clement Pinard)

Disclaimer

This is the authors' implementation of the system described in the paper and not an official Google product.

Comments
  • TF 2.0 code

    TF 2.0 code

    Does anybody ( @tinghuiz , @ClementPinard , @Huang-Jin ) have TF 2.0 code for the SfMLearner? I tried to rewrite everthing from scratch., However I have few doubts:

    The model predicts depth initially in only middle region of the image and slowly spreads to other areas. While this behaviour is not present in the original code TF 1.0

    If anybody willing to review the code, I am happy to share.

    Screenshot 2019-11-21 at 11 02 10 Screenshot 2019-11-21 at 11 03 14
    opened by ezorfa 14
  • error on the 'Namespace' object has no attribute 'dump_root'

    error on the 'Namespace' object has no attribute 'dump_root'

    thanks for the code.but when I run the prepare_train_data.py as python data/prepare_train_data.py --dataset_dir=/home/lli/tensorflow/SfMLearner/data/kitti/ --dataset_name='kitti_raw_eigen' --dump_root=/home/lli/tensorflow/SfMLearner/data/kitti/resulting/formatted/data/ --seq_length=3 --img_width=416 --img_height=128 --num_threads=4

    it run the error 'Namespace' object has no attribute 'dump_root'

    opened by lixiangyu-1008 11
  • Precondition Error when is_training is set to false

    Precondition Error when is_training is set to false

    I noticed that when the depth test graph is being build, the is_training argument for disp_net is not set to False. Won't this negatively affect the test performance, as the batch normalization won't be configured properly?

    When setting this argument to True, an exception is raised. (Related to batch norm)

    FailedPreconditionError: Attempting to use uninitialized value depth_net/upcnv3/BatchNorm/moving_mean
    	 [[Node: depth_net/upcnv3/BatchNorm/moving_mean/read = Identity[T=DT_FLOAT, _class=["loc:@depth_net/upcnv3/BatchNorm/moving_mean"], _device="/job:localhost/replica:0/task:0/gpu:0"](depth_net/upcnv3/BatchNorm/moving_mean)]]
    	 [[Node: depth_prediction/truediv/_131 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/cpu:0", send_device="/job:localhost/replica:0/task:0/gpu:0", send_device_incarnation=1, tensor_name="edge_459_depth_prediction/truediv", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/cpu:0"]()]]
    

    I get this when using the model that was provided in the "download_model.sh" script

    opened by TomRoussel 9
  • How to evaluate?

    How to evaluate?

    Hi @tinghuiz,

    Thank you for sharing with us the training code. How should I evaluate based on the output? I observed that the depth output is resized to [0,1]. How can I restore the absolute depth value for evaluation?

    Thank you

    opened by thuyang 8
  • how to predict pose with my data?

    how to predict pose with my data?

    I want to predict depth and pose with my video I already convert video to images and get depth successfully (by using "demo.ipynb") but I don't know how to predict pose from my data ( I am just a beginner )

    opened by goeatsmall 7
  • White blank depth on my data

    White blank depth on my data

    Hi, Thanks a lot for the amazing work and well written code. I created my own driving dataset similar to Kitti and arranged in the same way as ur prepared kitti folder and trained from scratch(matching the same image dimensions and scaling the intrinsic matrix accordingly). It seem to be training well and the pose estimation seems to be okay as the projection errors are looking good).However my disparity are all white. But the same code works on KITTI. (ie) they are very small values while training and they never converge.

    More specifically check these out:

    image

    image

    image

    image

    Clearly the depth values are all close to zero still. Even though the network seems to be reducing the total loss and training well. One thing I also noticed is that my disparity values are not smoothly distributed while training like yours in KITTI: image

    Could you please tell me what parameters I can change to help the network train. I dont understand why it doesn't train on a similar dataset. Only Ground truth is camera intrinsics. If my intrinsics is wrong then my projection errors will also be wrong right? Pls help me find out what I am doing wrong. Thanks a lot for ur help.

    opened by athus1990 7
  • How to interpolate ground-truth from sparse measurements?  (about figure6 in the paper)

    How to interpolate ground-truth from sparse measurements? (about figure6 in the paper)

    Hi,

    I am wondering that how to interpolate ground-truth from sparse measurements as it says in the paper. There is no explanation for that. Could anyone teach me how? Thank you :)

    opened by keishatsai 7
  • networks weight decay

    networks weight decay

    Hello, I have been trying to replicate your results on my own pytorch implementation, but had some trouble converging with your hyper parameters.

    Especially, the weight decay you use seems very large to me : https://github.com/tinghuiz/SfMLearner/blob/master/nets.py#L27

    Weight decay is usually around 5e-5 ~ 5e-4 and here it is 0.05 ! When using it, my two networks just go to zero very quickly.

    As I am not very familiar with tf.slim, I have done some research, and I am not sure you actually apply weight regularization, since apparently you have to call slim.losses.get_total_loss()

    This also corroborates the fact that trying to set l2 regularization to extreme values (like 50.0) doesn't change anything.

    The good news here are if weight decay is indeed not applied to your network, you might have something interesting to work on if you want to improve even more your results !

    Clément

    opened by ClementPinard 7
  • Pretrained model and evaluation

    Pretrained model and evaluation

    Hi @tinghuiz , thanks for releasing the code. Seems that you did not provide the full pipeline (training+testing+evaluation), for now you just released the test results in .npy files, but not the testing code.

    I wonder if the provided pretrained model is the one you used in the paper, I want to use it to test the images in eigen's test split and evaluate it. Also, can you provide the pretrained model for pose estimation? I want to see if the numbers are consistent with the paper.

    Thank you so much!

    opened by Yuliang-Zou 7
  • Trained your network locally, but my eval. result is not as good as yours

    Trained your network locally, but my eval. result is not as good as yours

    Hi,

    Thanks for sharing your wonderful works. I followed your readme file, and it seems that everything goes well until when I see different evaluation result compared with yours (Ours(K) in table 1 of your paper) when I do evaluate based on locally trained weight. During training, I used the same parameters as suggested on this webpage.

    The followings metrics which I got: abs_rel, sq_rel, rms, log_rms, d1_all, a1, a2, a3 0.2621, 3.6171, 8.2036, 0.3806, 0.0000, 0.6577, 0.8520, 0.9258

    I checked and reviewed procedure, but I cannot find any hint for the above different evaluation result. Let me know what do you think about it.

    Regards, CJ

    opened by jeongc 7
  • problem when testing posenet?

    problem when testing posenet?

    Hi,@tinghuiz I want to train posenet on KITTI data_odometry_color. And i got the trianed net. BUT when i testing , the problem as following: InvaildArugmentError: Assign requires shapes of both tensors to match. lhs shape=[7,7,15,16], rhs shape=[7,7,9,16].

    But it didn't exist on your posenet. Whether if you train posenet or save it independly?

    opened by LinRui9531 6
  • question about dump_xyz in evaluate_pose.py

    question about dump_xyz in evaluate_pose.py

    The movement from this frame to the next frame should be to multiply the matrix to the current pose matrix, so theoretically, should the code be written as cam_to_world = np.dot(source_to_target_transformation, cam_to_world) instead of cam_to_world = np.dot( cam_to_world, source_to_target_transformation)?

    opened by Scarlett213 0
  • Pre-processing NYUv2 dataset

    Pre-processing NYUv2 dataset

    Hi,

    I would like to ask if we can possibly prepare/pre-process NYU V2 dataset using the codebase? I have tried to pre-process it using CityScapes as baseline with default values mostly and I added classes and functionality for NYU dataset and successfully ran the code and generated pre-processes dataset. However the results did not seem to be correct. Have a look at the following results: 10

    The original image(s) was: 10

    How should I make sure if the dataset is pre-processed correctly or not? I tried training mode and the results were terrible so I think I am missing something in this step.

    Thanks

    opened by mshaheryarmalik 0
  • Question about paper (projected coordinates formula)

    Question about paper (projected coordinates formula)

    Hi,

    Thanks you for submitting this incredible work. I am trying to understand the projected coordinates formula from your article (below). May I ask a proof or references that could help me understanding how this formula is found ? Thanks you in advance for your reply.

    Capture d’écran 2022-06-12 à 11 55 47
    opened by anthonygofin 2
  • Is the dispnet output a depth map?

    Is the dispnet output a depth map?

    the dispnet takes a target image as input, and output a pre_disp, and then the depth map is obtained by 1/pre_disp. So does it means the dispNet generaties a transformation type image of depth, instead of a disparity map?

    opened by tasizhousong 0
  • How to get ATE values as in Table 3 in your paper

    How to get ATE values as in Table 3 in your paper

    Hi , I use eval_pose to evaluate the predicted pose but the results are ATE mean and std. How to get the same values as in Table 3 in your paper ? do the values in this table mean the translation and rotation error ? e.g. 0.014+_0.008 means that the translation error is 0.014 and rotation error is 0.008 ? how to get same error format using your evaluation code ?

    opened by RokiaAbdeen 0
Owner
Tinghui Zhou
Tinghui Zhou
Official implementation of the network presented in the paper "M4Depth: A motion-based approach for monocular depth estimation on video sequences"

M4Depth This is the reference TensorFlow implementation for training and testing depth estimation models using the method described in M4Depth: A moti

Michaël Fonder 76 Jan 3, 2023
Light-weight network, depth estimation, knowledge distillation, real-time depth estimation, auxiliary data.

light-weight-depth-estimation Boosting Light-Weight Depth Estimation Via Knowledge Distillation, https://arxiv.org/abs/2105.06143 Junjie Hu, Chenyou F

Junjie Hu 13 Dec 10, 2022
the official code for ICRA 2021 Paper: "Multimodal Scale Consistency and Awareness for Monocular Self-Supervised Depth Estimation"

G2S This is the official code for ICRA 2021 Paper: Multimodal Scale Consistency and Awareness for Monocular Self-Supervised Depth Estimation by Hemang

NeurAI 4 Jul 27, 2022
Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging This repository contains an implementation

Computational Photography Lab @ SFU 1.1k Jan 2, 2023
[CVPR 2021] Monocular depth estimation using wavelets for efficiency

Single Image Depth Prediction with Wavelet Decomposition Michaël Ramamonjisoa, Michael Firman, Jamie Watson, Vincent Lepetit and Daniyar Turmukhambeto

Niantic Labs 205 Jan 2, 2023
[ICCV 2021] Excavating the Potential Capacity of Self-Supervised Monocular Depth Estimation

EPCDepth EPCDepth is a self-supervised monocular depth estimation model, whose supervision is coming from the other image in a stereo pair. Details ar

Rui Peng 110 Dec 23, 2022
This repo is for Self-Supervised Monocular Depth Estimation with Internal Feature Fusion(arXiv), BMVC2021

DIFFNet This repo is for Self-Supervised Monocular Depth Estimation with Internal Feature Fusion(arXiv), BMVC2021 A new backbone for self-supervised d

Hang 3 Oct 22, 2021
SimpleDepthEstimation - An unified codebase for NN-based monocular depth estimation methods

SimpleDepthEstimation Introduction This is an unified codebase for NN-based monocular depth estimation methods, the framework is based on detectron2 (

null 8 Dec 13, 2022
ONNX-GLPDepth - Python scripts for performing monocular depth estimation using the GLPDepth model in ONNX

ONNX-GLPDepth - Python scripts for performing monocular depth estimation using the GLPDepth model in ONNX

Ibai Gorordo 18 Nov 6, 2022
ONNX-PackNet-SfM: Python scripts for performing monocular depth estimation using the PackNet-SfM model in ONNX

Python scripts for performing monocular depth estimation using the PackNet-SfM model in ONNX

Ibai Gorordo 14 Dec 9, 2022
Repository for "Toward Practical Monocular Indoor Depth Estimation" (CVPR 2022)

Toward Practical Monocular Indoor Depth Estimation Cho-Ying Wu, Jialiang Wang, Michael Hall, Ulrich Neumann, Shuochen Su [arXiv] [project site] DistDe

Meta Research 122 Dec 13, 2022
The implemention of Video Depth Estimation by Fusing Flow-to-Depth Proposals

Flow-to-depth (FDNet) video-depth-estimation This is the implementation of paper Video Depth Estimation by Fusing Flow-to-Depth Proposals Jiaxin Xie,

null 32 Jun 14, 2022
(CVPR 2022 - oral) Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry Official implementation of the paper Multi-View Depth Est

Bae, Gwangbin 138 Dec 28, 2022
Advancing Self-supervised Monocular Depth Learning with Sparse LiDAR

Official implementation for paper "Advancing Self-supervised Monocular Depth Learning with Sparse LiDAR"

Ziyue Feng 72 Dec 9, 2022
Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction This is the code for the paper Combining E

Robotics and Perception Group 69 Dec 26, 2022
Code for Transformers Solve Limited Receptive Field for Monocular Depth Prediction

Official PyTorch code for Transformers Solve Limited Receptive Field for Monocular Depth Prediction. Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe,

stanley 152 Dec 16, 2022
Categorical Depth Distribution Network for Monocular 3D Object Detection

CaDDN CaDDN is a monocular-based 3D object detection method. This repository is based off of [OpenPCDet]. Categorical Depth Distribution Network for M

Toronto Robotics and AI Laboratory 289 Jan 5, 2023
AdelaiDepth is an open source toolbox for monocular depth prediction.

AdelaiDepth is an open source toolbox for monocular depth prediction.

Adelaide Intelligent Machines (AIM) Group 743 Jan 1, 2023
Apply our monocular depth boosting to your own network!

MergeNet - Boost Your Own Depth Boost custom or edited monocular depth maps using MergeNet Input Original result After manual editing of base You can

Computational Photography Lab @ SFU 142 Dec 17, 2022