RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching

Princeton Vision & Learning Lab

Last update: Jan 9, 2023

Related tags

Deep Learning RAFT-Stereo

Overview

RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching

This repository contains the source code for our paper:

RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching
Lahav Lipson, Zachary Teed and Jia Deng

@article{lipson2021raft,
  title={{RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching}},
  author={Lipson, Lahav and Teed, Zachary and Deng, Jia},
  journal={arXiv preprint arXiv:2109.07547},
  year={2021}
}

Requirements

The code has been tested with PyTorch 1.7 and Cuda 10.2.

conda env create -f environment.yaml
conda activate raftstereo

Required Data

To evaluate/train RAFT-stereo, you will need to download the required datasets.

Sceneflow (Includes FlyingThings3D, Driving & Monkaa
Middlebury
ETH3D
KITTI

To download the ETH3D and Middlebury test datasets for the demos, run

chmod ug+x download_datasets.sh && ./download_datasets.sh

By default stereo_datasets.py will search for the datasets in these locations. You can create symbolic links to wherever the datasets were downloaded in the datasets folder

├── datasets
    ├── FlyingThings3D
        ├── frames_cleanpass
        ├── frames_finalpass
        ├── disparity
    ├── Monkaa
        ├── frames_cleanpass
        ├── frames_finalpass
        ├── disparity
    ├── Driving
        ├── frames_cleanpass
        ├── frames_finalpass
        ├── disparity
    ├── KITTI
        ├── testing
        ├── training
        ├── devkit
    ├── Middlebury
        ├── MiddEval3
    ├── ETH3D
        ├── lakeside_1l
        ├── ...
        ├── tunnel_3s

Demos

Pretrained models can be downloaded by running

chmod ug+x download_models.sh && ./download_models.sh

or downloaded from google drive

You can demo a trained model on pairs of images. To predict stereo for Middlebury, run

python demo.py --restore_ckpt models/raftstereo-sceneflow.pth

Or for ETH3D:

python demo.py --restore_ckpt models/raftstereo-eth3d.pth -l=datasets/ETH3D/*/im0.png -r=datasets/ETH3D/*/im1.png

Using our fastest model:

python demo.py --restore_ckpt models/raftstereo-realtime.pth  --shared_backbone --n_downsample 3 --n_gru_layers 2 --slow_fast_gru

To save the disparity values as .npy files, run any of the demos with the --save_numpy flag.

Converting Disparity to Depth

If the camera focal length and camera baseline are known, disparity predictions can be converted to depth values using

Note that the units of the focal length are pixels not millimeters.

Evaluation

To evaluate a trained model on a validation set (e.g. Middlebury), run

python evaluate_stereo.py --restore_ckpt models/raftstereo-middlebury.pth --dataset middlebury_H

Training

Our model is trained on two RTX-6000 GPUs using the following command. Training logs will be written to runs/ which can be visualized using tensorboard.

python train_stereo.py --batch_size 8 --train_iters 22 --valid_iters 32 --spatial_scale -0.2 0.4 --saturation_range 0 1.4 --n_downsample 2 --num_steps 200000 --mixed_precision

To train using significantly less memory, change --n_downsample 2 to --n_downsample 3. This will slightly reduce accuracy.

(Optional) Faster Implementation

We provide a faster CUDA implementation of the correlation volume which works with mixed precision feature maps.

cd sampler && python setup.py install && cd ..

Running demo.py, train_stereo.py or evaluate.py with --corr_implementation reg_cuda together with --mixed_precision will speed up the model without impacting performance.

To significantly decrease memory consumption on high resolution images, use --corr_implementation alt. This implementation is slower than the default, however.

Comments

Question about learning rate

We prepare to cite your paper in our new work. When I reproduce your work, I noticed that the maximum learning rate of training Sceneflow is 0.0002 in your provided code. In your paper, It is said that Sceneflow is trained with a minimum learning rate of 1e-4. Should I just keep the parameter 0.0002 in the code to reproduce your work?

Thanks!

opened by David-Zhao-1997 7
Question about Training Schedule

Thanks for sharing such an excellent work!
Mentioned in the paper:
Final models are trained on synthetic data for 200k steps with a batch size of 8
SceneFlow consists of about 35k training pictures.
Can we think that you have trained a total of 200k*8/35k ~= 45 epochs?
GANet, DSMNet, etc. only train 10-20 epochs. Have you compared the performance when the training epoch is less than 20?
How long does it take to train 200k Steps under your GPU configuration?

thank you for your reply!

opened by AbnerCSZ 6
Frozen Batch Norm
Thanks for the great work Lahav and the team, iteratively refinement has been missing in stereo matching, and great work on the multi-level correlation lookup volume.

I have been using this code repo on my application with great success. However training is not very stable, the model seems to suffer some kind of mode collapse and predict the same for all inputs.

I'm just checking all the loose ends and came across the normalisation part. https://github.com/princeton-vl/RAFT-Stereo/blob/5c13878b617177da139cfeba79ac15b39b351963/train_stereo.py#L151

Why is batch norm frozen during training? Doesn't this defeat the purpose of adding a batch norm in the first place?

In the paper, instance norm is used instead of the batch norm for the context encoder, can you expand on this implementation detail? How will this impact the model when we use a shared encoder for speed up?
opened by ppyht2 5
Questions about different augmentors for sparse and dense gt.

Thanks for sharing your excellent work! Why do you provide two different ways for augmenting data? The sparse one does not have yjitter，asymmetric color augmentation. Besides, why do you provide a new function resize_sparse_flow_map, instead of using cv2.resize(flow, interpolation=cv2.INTER_NEAREST)for resizing sparse gt?

opened by zhujiagang 5
ONNX export failed: Couldn't export Python operator CorrSampler

When I converted the ONNX model, I encountered that the CorrSampler could not be converted. Can you provide some suggestions?

Looking forward to your reply！

opened by sunmooncode 4
Problem about runtime of base model
Hi, thank you for sharing the awesome work. I have run the basemodel raftstereo-middlebury.pth without any refinement on my costume dataset. The precision of results are pretty good, But the runtime of model prediction seems does not match that described in the article. I add some time-analysis snippet in demo.py:

The command predicts stereo on custome data is : python demo.py --restore_ckpt models/raftstereo-middlebury.pth -l=output_xvisio/rect_cam0/*.jpg -r=output_xvisio/rect_cam1/*.jpg --corr_implementation alt --mixed_precision

The configuration of my local machine:

image resolution: 640 x 400;

CPU: Intel® Core™ i7-8700K CPU @ 3.70GHz × 12

GPU: NVIDIA GeForce GTX 1080 Runtime of base model:

The configuration of my server machine:

image resolution: 640 x 400;

CPU:Intel(R) Xeon(R) CPU E5-2682 v4 @ 2.50GHz

GPU: NVIDIA Tesla P100 Runtime of base model:

Could you help me to figure out what mistake I made caused this problem? I really appreciate your help!
opened by fangchuan 4
What is raftstereo-middlebury.pth trained on?

Hi @lahavlipson, Thank you for your great work!

I'm wondering what datasets you used to train raftstereo-middlebury.pth? Why do you recommend it for in-the-wild images?

opened by nikitakaraevv 3

Question About Inference Time.

Hi, thank you for sharing the amazing work. It shows better performance on leaderboard with a large margin. I'm testing R-Stereo. I found that I can't make the speed of fastest model to reach ~26fps on kitti 2015 test data. So, I want to ask for the configuration to achieve this.

I use the time calculate method like this:

            padder = InputPadder(image1.shape, divis_by=32)
            image1, image2 = padder.pad(image1, image2)
            start_time = time.time()
            _, flow_up = model(image1, image2, iters=args.valid_iters, test_mode=True)
            print("forward time: ", time.time()-start_time)
            file_stem = imfile1.split('\\')[-1]

the configure is RTX3070 CUDA11 Cudnn 8 Win10 Kitti (1248x384) with following scripts

python.exe demo.py 
--restore_ckpt   models/raftstereo-realtime.pth
--shared_backbone
--n_downsample  3
--n_gru_layers  2
--slow_fast_gru
--mixed_precision
--corr_implementation  reg_cuda
-l= ${kitti_test}
-r= ${kitti_test}

I got about ~170ms per stereo image. I want to ask which GPU are you use and if there anything wrong with the testing?

opened by fafancier 3

Question about disparity Up & Down

Hi
I want to compute the up-down disparity , and then fusion with left-right disaprity, my solution is

disparity_x = RAFT_x(Img_Left, Img_Right) disparity_y = RAFT_y(Img_Up Img_Down) disparity_out = disparity_x * ratio_x + disparity_y * ratio_y

In RAFT_y I modify correction function to y-direction correction but result is not good,could you give me some advice?

Thanks

opened by excllent123 2
confidence map

Hello, Is it possible to extract a confidence map for all disparity values? During a 3D stiching and SLAM such confidence map would greatly improve the over all result. Do you have any suggestions on how to extract such confidence map from the RAFT pipeline itself or would you recommend a different approach? Thank you for making your work available to all!

opened by gpuartifact 2
Question about fine-tunning on middlebury

Hi
In paper section 4.4. Middlebury,After pre-training on Sceneflow [23], we fine-tune on 384x1000 random crops of the 23 Middlebury traning images for 4000 steps with a batch size of 2, using 22 update iterations during training but in official_train.txt, only containt 10 Middlebury traning images , could you help me point out which is right?

Thanks

opened by excllent123 2
Training Datasets and schedule

https://github.com/princeton-vl/RAFT-Stereo/blob/0e2a12746143a7552e30ef2f4b1d4c3214388a1a/train_stereo.py#L222 Hi there, thank you for supplying such a clear code! I have a question regarding the training procedure; as I understand, the training you suggest on the git page includes only sceneflow (and refine on Middlebury) with no reference to other datasets - FallingThings and Tartanair, which you reference in the paper. Do you use them in any additional training? Can you clarify? Thank you!

opened by orram 0
pytorch usewarning

/home/user/anaconda3/envs/raftstereo/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:136: UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate

Does it have an impact on the results? If so, how to solve it？ Thanks！

opened by lyhloveyou 0
Why divide the correlation by sqrt(D)?

https://github.com/princeton-vl/RAFT-Stereo/blob/0e2a12746143a7552e30ef2f4b1d4c3214388a1a/core/corr.py#L156 Hi, what does this line mean? In the paper it says that the correlation is dot product between feature vectors. But it is divided by this sqrt. Any meaning? Can replace this sqrt by something else?

opened by steven9046 0
occlusion detection

Hi, RAFT-Stereo is producing disparities also in regions that are occluded to one of the cameras of the stereo pair. As a result the disparites are good when scene is similar to the training set. In terms of generalization it then produces more errors in those partially occluded regions that are less similar to data in the training set.

In SGM implementations the left-rigth and right-left consitency check is used to implicitly find occluded regions. This check is done in SGM within the cost cube computed within one run (only left-right or only right-left) by swithching the Cost look up direction.

Where in the RAFT-Stereo implementation would you suggest implementing an equivalent for occlusion check without having to implement the full disparity computation twice? Thanks.

opened by gpuartifact 0
Retraining RAFT

Hi. I tried RAFT for my data. It's working fine. But objects in my case are thin and I'm facing depth issue (zig-zag kind structures). I saw in one issue you suggested to use High Resolution images or Retrain with thin dataset. As I don't have GT for my dataset. Can you suggest any opensource dataset that have more thin objects so I can ReTrain RAFT... Thanks!!!

opened by jayes97 0
question about excluding "seasonsforest_winter_easy" from the "tartan_air" training dataset

Hi,

I realized that you excluded this subset from the tartan_air, could you share the reason why you did so? and why only "easy" not "hard"?

Thanks!

opened by deephog 1

Owner

Princeton Vision & Learning Lab

GitHub

Code for "PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clouds", CVPR 2021

PV-RAFT This repository contains the PyTorch implementation for paper "PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clou

43 Dec 5, 2022

Implementation of Bidirectional Recurrent Independent Mechanisms (Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules)

BRIMs Bidirectional Recurrent Independent Mechanisms Implementation of the paper Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neura

26 May 26, 2022

TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

1.3k Dec 30, 2022

functorch is a prototype of JAX-like composable function transforms for PyTorch.

1.2k Jan 9, 2023

Image Processing, Image Smoothing, Edge Detection and Transforms

opevcvdl-hw1 This project uses openCV and Qt to achieve the requirements. Version Python 3.7 opencv-contrib-python 3.4.2.17 Matplotlib 3.1.1 pyqt5 5.1

3 Aug 17, 2022

RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching

Related tags

Overview

RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching

Requirements

Required Data

Demos

Converting Disparity to Depth

Evaluation

Training

(Optional) Faster Implementation

Comments

Owner

Princeton Vision & Learning Lab

Code for "PV-RAFT: Point-Voxel Correlation Fields for Scene Flow Estimation of Point Clouds", CVPR 2021

Implementation of Bidirectional Recurrent Independent Mechanisms (Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules)

Python scripts form performing stereo depth estimation using the high res stereo model in PyTorch .

the code for our CVPR 2021 paper Bilateral Grid Learning for Stereo Matching Network [BGNet]

CFNet: Cascade and Fused Cost Volume for Robust Stereo Matching（CVPR2021）

Lightweight stereo matching network based on MobileNetV1 and MobileNetV2

Pytorch reimplementation of PSM-Net: "Pyramid Stereo Matching Network"

The official implementation code of "PlantStereo: A Stereo Matching Benchmark for Plant Surface Dense Reconstruction."

✨✨✨An awesome open source toolbox for stereo matching.

Local Similarity Pattern and Cost Self-Reassembling for Deep Stereo Matching Networks

Code for C2-Matching (CVPR2021). Paper: Robust Reference-based Super-Resolution via C2-Matching.

A Python implementation of the Locality Preserving Matching (LPM) method for pruning outliers in image matching.

Datasets, Transforms and Models specific to Computer Vision

Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

Image data augmentation scheduler for albumentations transforms

Progressive Coordinate Transforms for Monocular 3D Object Detection

TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

functorch is a prototype of JAX-like composable function transforms for PyTorch.

Image Processing, Image Smoothing, Edge Detection and Transforms