MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions

Related tags

Deep Learning MVS2D

MVS2D: Efficient Multi-view Stereo via Attention-Driven 2D Convolutions

Project Page | Paper


If you find our work useful for your research, please consider citing our paper:

  author    = {Zhenpei Yang and
               Zhile Ren and
               Qi Shan and
               Qixing Huang},
  title     = {{MVS2D:} Efficient Multi-view Stereo via Attention-Driven 2D Convolutions},
  journal   = {CoRR},
  volume    = {abs/2104.13325},
  year      = {2021},
  url       = {},
  eprinttype = {arXiv},
  eprint    = {2104.13325},
  timestamp = {Tue, 04 May 2021 15:12:43 +0200},
  biburl    = {},
  bibsource = {dblp computer science bibliography,}

✏️ Changelog

Nov 27 2021

  • Initial release. Note that our released code achieve improved results than those reported in the initial arxiv pre-print. In addition, we include the evaluation on DTU dataset. We will update our paper soon.

⚙️ Installation

Click to expand

The code is tested with CUDA10.1. Please use following commands to install dependencies:

conda create --name mvs2d python=3.7
conda activate mvs2d

pip install -r requirements.txt

The folder structure should looks like the following if you have downloaded all data and pretrained models. Download links are inside each dataset tab at the end of this README.

├── configs
├── datasets
├── demo
├── networks
├── scripts
├── pretrained_model
│   ├── demon
│   ├── dtu
│   └── scannet
├── data
│   ├── DeMoN
│   ├── DTU_hr
│   ├── SampleSet
│   ├── ScanNet
│   └── ScanNet_3_frame_jitter_pose.npy
├── splits
│   ├── DeMoN_samples_test_2_frame.npy
│   ├── DeMoN_samples_train_2_frame.npy
│   ├── ScanNet_3_frame_test.npy
│   ├── ScanNet_3_frame_train.npy
│   └── ScanNet_3_frame_val.npy

🎬 Demo

Click to expand

After downloading the pretrained models for ScanNet, try to run following command to make a prediction on a sample data.

python --cfg configs/scannet/release.conf

The results are saved as demo.png

Training & Testing

We use 4 Nvidia V100 GPU for training. You may need to modify 'CUDA_VISIBLE_DEVICES' and batch size to accomodate your GPU resources.


Click to expand


data 🔗 split 🔗 pretrained models 🔗 noisy pose 🔗


First download and extract ScanNet training data and split. Then run following command to train our model.

bash scripts/scannet/

To train the multi-scale attention model, add --robust 1 to the training command in scripts/scannet/

To train our model with noisy input pose, add --perturb_pose 1 to the training command in scripts/scannet/


First download and extract data, split and pretrained models.

Then run:

bash scripts/scannet/

You should get something like these:

abs_rel sq_rel log10 rmse rmse_log a1 a2 a3 abs_diff abs_diff_median thre1 thre3 thre5
0.059 0.016 0.026 0.157 0.084 0.964 0.995 0.999 0.108 0.079 0.856 0.974 0.996


Click to expand


data 🔗 split 🔗 pretrained models 🔗


First download and extract DeMoN training data and split. Then run following command to train our model.

bash scripts/demon/


First download and extract data, split and pretrained models.

Then run:

bash scripts/demon/

You should get something like these:

dataset rgbd: 160

abs_rel sq_rel log10 rmse rmse_log a1 a2 a3 abs_diff abs_diff_median thre1 thre3 thre5
0.082 0.165 0.047 0.440 0.147 0.921 0.939 0.948 0.325 0.284 0.753 0.894 0.933

dataset scenes11: 256

abs_rel sq_rel log10 rmse rmse_log a1 a2 a3 abs_diff abs_diff_median thre1 thre3 thre5
0.046 0.080 0.018 0.439 0.107 0.976 0.989 0.993 0.155 0.058 0.822 0.945 0.979

dataset sun3d: 160

abs_rel sq_rel log10 rmse rmse_log a1 a2 a3 abs_diff abs_diff_median thre1 thre3 thre5
0.099 0.055 0.044 0.304 0.137 0.893 0.970 0.993 0.224 0.171 0.649 0.890 0.969

-> Done!


abs_rel sq_rel log10 rmse rmse_log a1 a2 a3 abs_diff abs_diff_median thre1 thre3 thre5
0.071 0.096 0.033 0.402 0.127 0.938 0.970 0.981 0.222 0.152 0.755 0.915 0.963


Click to expand


data 🔗 eval data 🔗 pretrained models 🔗


First download and extract DTU training data. Then run following command to train our model.

bash scripts/dtu/


First download and extract DTU eval data and pretrained models.

The following command performs three steps together: 1. Generate depth prediction on DTU test set. 2. Fuse depth predictions into final point cloud. 3. Evaluate predicted point cloud. Note that we re-implement the original Matlab Evaluation of DTU dataset using python.

bash scripts/dtu/

You should get something like these:

Acc 0.4051747996189477
Comp 0.2776021161518006
F-score 0.34138845788537414


The fusion code for DTU dataset is heavily built upon from PatchMatchNet

  • The code is inconsistent with the paper

    The code is inconsistent with the paper

    Thank you for your excellent work. However, I have a doubt. According to the statement of your paper, the depth assumption of SRC image is used to obtain the similarity score by projection transformation to ref image. However, in Code, the depth assumption is set in the Ref Image coordinate system to obtain 3D points, which are transformed to the SRC image coordinate system and sampled from the SRC image. Finally, the similarity score is obtained by dot product with the REF image. This is actually helpful to get the depth of the ref image, instead of SRC image. I don't understand why you make the depth assumption in the Ref image coordinate system and then project it.

    opened by cjd24-coder 7
  • Question about generalization

    Question about generalization

    Thanks for sharing of code.

    But I wonder why you did not test your method on 7Scenes or Tanks&Temple.

    Besides, I tried to test your method on 7Scenes, but the results are pretty poor. Can you give me some suggestions?

    My code is here: at test_7scenes_long, as I use the split from Long.

    My results is here:

    7scenes:{'a1': 0.39494346839485406,
     'a2': 0.6459305167578225,
     'a3': 0.8022140821127047,
     'abs_diff': 0.5733749950019752,
     'abs_diff_median': 0.5094702287105953,
     'abs_rel': 0.32380217413691914,
     'log10': 0.17217029032900052,
     'rmse': 0.6894540686467114,
     'rmse_log': 0.47906266186605484,
     'sq_rel': 0.27828435772756005,
     'thre1': 0.23810647116173855,
     'thre3': 0.5398887457282459,
     'thre5': 0.8326691103244529}
    opened by Yannnnnnnnnnnn 2
  • Convergence speed

    Convergence speed

    hi I have some questions on convergence speed of MVS2D

    the cost-volume based method (MVSNet, etc.) can see the outline of objects after 1 epoch or even several iters depth_estgt (9)

    however, MVS2D seems hard to convergence and the depthmap are very vague at the beginning. depth_estgt (35)

    does this due to the network design? Could you explain about this phenomenon?

    opened by AddASecond 2
  • val set in not consistent with those in

    val set in not consistent with those in

    There is a point in your repo that is confusing for newbies(for me), that the val set in is not consistent with those in

    in val it is

            data_set = [
                3, 5, 17, 21, 28, 35, 37, 38, 40, 43, 56, 59, 66, 67, 82, 86,
                106, 117

    however in / it is

    scans = [
        1, 4, 9, 10, 11, 12, 13, 15, 23, 24, 29, 32, 33, 34, 48, 49, 62, 75,
        77, 110, 114, 118

    hope you could fix it, or at least this post may be helpful for users who want to reproduce your great work!

    opened by AddASecond 1
  • Question about Bts*

    Question about Bts*

    Additionally, we use an asterisk sign ‘∗’ to denote an oracle version Bts∗, where we use the ground truth depth map to factor out the global scale.

    Is it means that the mono depth network's prediction may not have the same scale as the depth map and Bts* is to estimate some parameter (e.g. mean and std in (depth-mean)/std) to adjust its scale according to the gt depth map?

    opened by 07hyx06 1
  • RuntimeError: CUDA error: an illegal memory access was encountered

    RuntimeError: CUDA error: an illegal memory access was encountered

    Hi guys, Thanks for your great work Would you please let me know what is the following error about ? I did some search but I couldnt find the solution I would appresiate it if you could help thanks

    **Traceback (most recent call last): File "", line 72, in outputs = model(imgs[0], imgs[1:], proj_mats[0], proj_mats[1:], inv_K_pool) File "/home/akarami/anaconda3/envs/mvs2d/lib/python3.7/site-packages/torch/nn/modules/", line 1130, in _call_impl return forward_call(*input, kwargs) File "/home/akarami/mvs2d/networks/", line 311, in forward src_imgs, File "/home/akarami/mvs2d/networks/", line 230, in epipolar_fusion k, proj_mask, grid = homo_warping(k, src_proj, ref_proj,depth_values) File "/home/akarami/mvs2d/networks/", line 72, in homo_warping proj = torch.matmul(src_proj, torch.inverse(ref_proj)) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

    opened by AliKaramiFBK 1
  • Saving color images

    Saving color images

    In the '' code, the input color images are resized and transposed by (1,2,0).

    Why did you transposed the original input images when you save the "data" including original images if opt.mode=='test' and opt.save_prediction==True. ?

    opened by nicesonnday 1
  • What is the frame selection(sampling) criterion of ScanNet dataset?

    What is the frame selection(sampling) criterion of ScanNet dataset?

    As you mentioned at the MVS2D paper,

    ScanNet dataset sampled from original set by 86324 triple images for training and 666 triple images for testing.

    I wanna know the criterion of frame sampling.

    opened by nicesonnday 1
  • 'tuple' object is not callable

    'tuple' object is not callable

    Hi, there is a bug in your code that cause: 'tuple' object is not callable training process is ok to run but always showing 'tuple' object is not callable

    the log is:

    Training 'tuple' object is not callable 'tuple' object is not callable 'tuple' object is not callable /mypath/anaconda3/lib/python3.8/site-packages/torch/optim/ UserWarning: Detected call of lr_scheduler.step() before optimizer.step(). In PyTorch 1.1.0 and later, you should call them in the opposite order: optimizer.step() before lr_scheduler.step(). Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at warnings.warn("Detected call of lr_scheduler.step() before optimizer.step(). " /mypath/anaconda3/lib/python3.8/site-packages/torch/optim/ UserWarning: The epoch parameter in scheduler.step() was not necessary and is being deprecated where possible. Please use scheduler.step() to step the scheduler. During the deprecation, if epoch is different from None, the closed form is used instead of the new chainable form, where available. Please open an issue if you are unable to replicate your use case: warnings.warn(EPOCH_DEPRECATION_WARNING, UserWarning) 'tuple' object is not callable /mypath/anaconda3/lib/python3.8/site-packages/torch/nn/ UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at /pytorch/c10/core/TensorImpl.h:1156.) return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode) 'tuple' object is not callable 'tuple' object is not callable 'tuple' object is not callable

    opened by AddASecond 3
CS PhD student
Multi-Scale Geometric Consistency Guided Multi-View Stereo

ACMM [News] The code for ACMH is released!!! [News] The code for ACMP is released!!! About ACMM is a multi-scale geometric consistency guided multi-vi

Qingshan Xu 118 Jan 4, 2023
Code for "Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo"

Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo This repository includes the source code for our CVPR 2021 paper on multi-view mult

Jiahao Lin 66 Jan 4, 2023
Python scripts form performing stereo depth estimation using the high res stereo model in PyTorch .

PyTorch-High-Res-Stereo-Depth-Estimation Python scripts form performing stereo depth estimation using the high res stereo model in PyTorch. Stereo dep

Ibai Gorordo 26 Nov 24, 2022
RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching

RAFT-Stereo: Multilevel Recurrent Field Transforms for Stereo Matching This repository contains the source code for our paper: RAFT-Stereo: Multilevel

Princeton Vision & Learning Lab 328 Jan 9, 2023
Planar Prior Assisted PatchMatch Multi-View Stereo

ACMP [News] The code for ACMH is released!!! [News] The code for ACMM is released!!! About This repository contains the code for the paper Planar Prio

Qingshan Xu 127 Dec 31, 2022
Code release of paper "Deep Multi-View Stereo gone wild"

Deep MVS gone wild Pytorch implementation of "Deep MVS gone wild" (Paper | website) This repository provides the code to reproduce the experiments of

François Darmon 53 Dec 24, 2022
[ICCV 2021 Oral] NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo

NerfingMVS Project Page | Paper | Video | Data NerfingMVS: Guided Optimization of Neural Radiance Fields for Indoor Multi-view Stereo Yi Wei, Shaohui

Yi Wei 369 Dec 24, 2022
COLMAP - Structure-from-Motion and Multi-View Stereo

COLMAP About COLMAP is a general-purpose Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline with a graphical and command-line interface.

null 4.7k Jan 7, 2023
Blender add-on: Add to Cameras menu: View → Camera, View → Add Camera, Camera → View, Previous Camera, Next Camera

Blender add-on: Camera additions In 3D view, it adds these actions to the View|Cameras menu: View → Camera : set the current camera to the 3D view Vie

German Bauer 11 Feb 8, 2022
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

Phil Wang 180 Jan 5, 2023
Attention-driven Robot Manipulation (ARM) which includes Q-attention

Attention-driven Robotic Manipulation (ARM) This codebase is home to: Q-attention: Enabling Efficient Learning for Vision-based Robotic Manipulation I

Stephen James 84 Dec 29, 2022
Stereo Radiance Fields (SRF): Learning View Synthesis for Sparse Views of Novel Scenes

Stereo Radiance Fields (SRF): Learning View Synthesis for Sparse Views of Novel Scenes

null 111 Dec 29, 2022
[CVPR'21] Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation

Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation Weixiang Yang, Qi Li, Wenxi Liu, Yuanlong Yu, Y

null 118 Dec 26, 2022
Official code for "Stereo Waterdrop Removal with Row-wise Dilated Attention (IROS2021)"

Stereo-Waterdrop-Removal-with-Row-wise-Dilated-Attention This repository includes official codes for "Stereo Waterdrop Removal with Row-wise Dilated A

null 29 Oct 1, 2022
Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

HaloNet - Pytorch Implementation of the Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones. This re

Phil Wang 189 Nov 22, 2022
[AAAI 2021] MVFNet: Multi-View Fusion Network for Efficient Video Recognition

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

Wenhao Wu 114 Nov 27, 2022
MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021)

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

null 2 Jan 29, 2022
This repository contains a pytorch implementation of "StereoPIFu: Depth Aware Clothed Human Digitization via Stereo Vision".

StereoPIFu: Depth Aware Clothed Human Digitization via Stereo Vision | Project Page | Paper | This repository contains a pytorch implementation of "St

null 87 Dec 9, 2022
Code release for our paper, "SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo"

SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo Thomas Kollar, Michael Laskey, Kevin Stone, Brijen Thananjeyan

null 68 Dec 14, 2022