PyTorch implementation of paper "Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes", CVPR 2021


Neural Scene Flow Fields

[Project Website] [Paper] [Video]


The code is tested with Python3, Pytorch >= 1.6 and CUDA >= 10.2, the dependencies includes

  • configargparse
  • matplotlib
  • opencv
  • scikit-image
  • scipy
  • cupy
  • imageio.
  • tqdm

Video preprocessing

  1. Download from link, an example input video with SfM camera poses and intrinsics estimated from COLMAP (Note you need to use COLMAP "colmap image_undistorter" command to undistort input images to get "dense" folder as shown in the example, this dense folder should include "images" and "sparse" folders).

  2. Download single view depth prediction model "" from link, and put it on the folder "nsff_scripts".

  3. Run the following commands to generate required inputs for training/inference:

    # Usage
    cd nsff_scripts
    # create camera intrinsics/extrinsic format for NSFF, same as original NeRF where it uses script from the LLFF code:
    python --data_path "/home/xxx/Neural-Scene-Flow-Fields/kid-running/dense/"
    # Resize input images and run single view model
    python --data_path "/home/xxx/Neural-Scene-Flow-Fields/kid-running/dense/" --input_w 640 --input_h 360 --resize_height 288
    # Run optical flow model (for easy setup and Pytorch version consistency, we use RAFT as backbond optical flow model, but should be easy to change to other models such as PWC-Net or FlowNet2.0)
    python --model models/raft-things.pth --data_path /home/xxx/Neural-Scene-Flow-Fields/kid-running/dense/ --epi_threhold 1.0 --input_flow_w 768 --input_semantic_w 1024 --input_semantic_h 576

Rendering from an example pretrained model

  1. Download pretraind model "" from link. Unzipping and putting it in the folder "nsff_exp/logs/kid-running_ndc_5f_sv_of_sm_unify3_F00-30/360000.tar".

Set datadir in config/config_kid-running.txt to the root directory of input video. Then go to directory "nsff_exp":

   cd nsff_exp
  1. Rendering of fixed time, viewpoint interpolation
   python --config configs/config_kid-running.txt --render_bt --target_idx 10

By running the example command, you should get the following result: Alt Text

  1. Rendering of fixed viewpoint, time interpolation
   python --config configs/config_kid-running.txt --render_lockcam_slowmo --target_idx 8

By running the example command, you should get the following result: Alt Text

  1. Rendering of space-time interpolation
   python --config configs/config_kid-running.txt --render_slowmo_bt  --target_idx 10

By running the example command, you should get the following result: Alt Text


  1. In configs/config_kid-running.txt, modifying expname to any name you like (different from the original one), and running the following command to train the model:
    python --config configs/config_kid-running.txt

The per-scene training takes ~2 days using 2 Nvidia V100 GPUs.

  1. Several parameters in config files you might need to know for training a good model
  • N_samples: in order to render images with higher resolution, you have to increase number sampled points
  • start_frame, end_frame: indicate training frame range. The default model usually works for video of 1~2s. Training on longer frames can cause oversmooth rendering. To mitigate the effect, you can increase the capacity of the network by increasing netwidth (but it can drastically increase training time and memory usage).
  • decay_iteration: number of iteartion in initialization stage. Data-driven losses will decay every 1000*decay_iteration steps. It's usually good to match decay_iteration to the number of training frames.
  • no_ndc: our current implementation only supports reconstruction in NDC space, meaning it only works for forward-facing scene like original NeRF. But it should be not hard to adapt to euclidean space.
  • use_motion_mask, num_extra_sample: whether to use estimated coarse motion segmentation mask to perform hard-mining sampling during initialization stage, and how many extra samples during initialization stage.
  • w_depth, w_optical_flow: weight of losses for single-view depth and geometry consistency priors described in the paper
  • w_cycle: weights of scene flow cycle consistency loss
  • w_sm: weight of scene flow smoothness loss
  • w_prob_reg: weight of disocculusion weight regularization

Evaluation on the Dynamic Scene Dataset

  1. Download Dynamic Scene dataset "" from link

  2. Download pretrained model "" from link, unzip and put them in the folder "nsff_exp/logs/"

  3. Run the following command for each scene to get quantitative results reported in the paper:

   # Usage: configs/config_xxx.txt indicates each scene name such as config_balloon1-2.txt in nsff/configs
   python --config configs/config_xxx.txt
  • Note: you have to use modified LPIPS implementation included in this branch in order to measure LIPIS error for dynamic region only as described in the paper.


The code is based on implementation of several prior work:


This repository is released under the MIT license.


If you find our code/models useful, please consider citing our paper:

  title={Neural Scene Flow Fields for Space-Time View Synthesis of Dynamic Scenes},
  author={Li, Zhengqi and Niklaus, Simon and Snavely, Noah and Wang, Oliver},
  journal={arXiv preprint arXiv:2011.13084},
    Question about image warping and blending weight

    Hi, the idea of warping and having a blending weight predicting how to merge static and dynamic results is a good idea, however I have been confused by some operations in the implementation. Initially I was not sure if these confusing parts lead to bad results, until now I see several issues concerning the performance on own datasets, so I decided to post this issue, hoping to provide some insights that could potentially solve some of the issues.

    1. Predicting the blending weight by the static network seems really counterintuitive since dynamic objects move inside the scene independently of the static network. A more reasonable way is to predict this weight by the dynamic network as suggested by this paper. Or a even simpler way is to totally remove this blending weight and use the addition strategy as in NeRF-W. I did a short experiment on the difference between these two blending strategies (blending weight vs addition), and find that addition produces better reconstruction and novel view results. Although this finding might not be always correct for different data, at least I think predicting the blending weight by the static network is not the ideal way to go.
    2. For image warping, you do blending in the current timestamp but only render the dynamic parts of t-1 and t+1 this is really confusing, basically it means there are two rendering pipelines, and the network should learn to maximize the performance of both pipelines. It is hard to tell how this causes problem exactly, but I doubt that this reduces the final performance that uses blending. In my opinion, suppose that having the static network gives better result, you should use blending in current time rendering and in image warping too. Again, I don't know whether changing this yields better result, but this part is confusing.

    I would like to know @zhengqili 's opinion on these points, and maybe suggest the users to try these modifications to see if it solves some problem.

    opened by kwea123 10
    Correct approach to separate static and dynamic regions

    Hi, I have originally mentioned this issue in #1, but it seems to deviate from the original question, so I decided to open a new issue.

    As discussed in #1, I tried setting the raw_blend_w to either 0 or 1 to create "static only" and "dynamic only" images that theoretically would look like the Fig. 5 in the paper and in the video. However, this approach seems to be wrong because from the result, the static part looks ok-ish but the dynamic part is almost everything, which is not good at all (we want only the moving part, e.g. only the running kid).

    It's been a week that I have been testing this while waiting for some response, but still to no avail. @zhengqili @sniklaus @snavely @owang Sorry for bothering, but could any of the authors kindly clarify what's wrong with my approach to separate static/dynamic by setting the blending weight to either 0 or 1? I also tried blending the sigmas (opacity in the code) instead of alphas as in the paper, or directly use the rgb_map_ref_dy as output image, but neither helped.

    I have applied the above approach to other pretrained scenes, but none of them produces good results.

    Left: static (raw_blend_w=0). Right: dynamic (raw_blend_w=1).

    Left: static (raw_blend_w=0). Right: dynamic (raw_blend_w=1).

    I believe there's something wrong with my approach, but I cannot figure out. I really appreciate if the authors could kindly point out what's the correct approach. Thank you very much.

    opened by kwea123 8
    Other data and the motion mask accuracy

    Hi, thanks for the code! Do you plan to publish the full data (running kid, and other data you used in the paper other than the NVIDIA ones) as well?

    In fact, the thing I'd like to check the most is your motion masks' accuracy. I'd like know if it's really possible to let the network learn to separate the background and the foreground by only providing the "coarse mask" that you mentioned in the supplementary.

    For example for the bubble scene on the project page, how accurate should the mask be to clearly separate the bubbles from the background like you showed? Have you also experimented on the influence of the mask quality, i.e. if masks are more coarse (larger), then how well can the model separate bg/fg?

    opened by kwea123 7
    error when run the with trained model


    Hello, I really appreciate to your awesome work!!

    However, I have an error when I try to run the file with our trained model.

    I think there are lack of declaration on --chain_sf in the file.

    I used the given configuration file of kid-running scene.

    Here are script that I run

    python --datadir /data1/dogyoon/neural_sceneflow_data/nerf_data/kid-running/dense/ --expname Default_Test --config configs/config_kid-running.txt

    Do I should add the argument in file or there are any way to solve this problem?


    opened by dogyoonlee 3
    Coordinate System Operations

    Hi team, amazing paper!

    I am trying to adapt your model to regularise the volume-rendered scene flow against a monocular scene flow estimator from a different paper. The third-party scene flow estimator produces results in world coordinates (not normalised) so for me to compare against this model's scene flow, which is in NDC space, I need to transform one coordinate system to the other. This has me confused by some of the code you use to do coordinate system conversions.

    1. Firstly, in the supplementary document, and in the code, you reference "euclidean space". I couldn't find anything online about whether this is world space or camera space. Could you please clarify?
    2. The supplementary document references the NDC ray space derivation from the NeRF paper. That derivation outlines how to convert points from camera space (o) to NDC space (o'): Screen Shot 2021-07-14 at 12 39 21 pm Following this, I found this function, which appears to do the inverse operation of eq 25 above. That is, it converts from NDC to what I can conclude must be camera space. However, when I look at its invocation, the variable name suggests that this function converts from NDC to world coordinates
    3. Following this, the pipeline to project from 3d NDC to a 2d image has me quite confused: a) I assume se3_transform_points converts from world space to camera space, is that correct? b) Why do you perform the perspective projection from camera space? Everything I have read online seems to perform perspective projection from either world coordinates / ndc.

    Generally, it would be very helpful to me if you could point me to where you obtained the operations for perspective_projection, se3_transform_points and NDC2Euclidean.

    My graphics knowledge is limited so apologies if these questions are trivial. Your help is greatly appreciated :)

    opened by rohaldb 3
    Running on my dataset

    Hi, I want to run nsff on my dataset. When I run the preprocessing section on my dataset, motion_masks are all white. Does that mean there are some issues with my dataset or maybe I cannot run it with my dataset? How can I solve it? Thanks! f229cf6bb9297d77c8ccf0a433e4b5b

    opened by Carinazhao22 3
    Evaluation Set


    I found the actually use the training images. Could you please share how to get the exact number in Table 3 of your paper. More specifically, how to know which are

    the remaining 11 held-out images per time instance for evaluation


    opened by fuqichen1998 2
    urlopen error [Errno 111]

    I get the following error while loading the pre-trained ResNet when I run from a remote server. On my local machine, it however worked (with python 3.9.12). Here I use python 3.7.4, but I also tried with python 3.8.5 with the same result.

    Traceback (most recent call last): File "", line 267, in <module> args.resize_height) File "", line 158, in run model = MidasNet(model_path, non_negative=True) File "/cluster/project/infk/courses/252-0579-00L/group34_nerf/CloudNeRF/other_papers/Neural-Scene-Flow-Fields/nsff_scripts/models/", line 30, in __init__ self.pretrained, self.scratch = _make_encoder(features, use_pretrained) File "/cluster/project/infk/courses/252-0579-00L/group34_nerf/CloudNeRF/other_papers/Neural-Scene-Flow-Fields/nsff_scripts/models/", line 6, in _make_encoder pretrained = _make_pretrained_resnext101_wsl(use_pretrained) File "/cluster/project/infk/courses/252-0579-00L/group34_nerf/CloudNeRF/other_papers/Neural-Scene-Flow-Fields/nsff_scripts/models/", line 26, in _make_pretrained_resnext101_wsl resnet = torch.hub.load("facebookresearch/WSL-Images", "resnext101_32x8d_wsl") File "/cluster/project/infk/courses/252-0579-00L/group34_nerf/CloudNeRF/other_papers/Neural-Scene-Flow-Fields/nsff_venv/lib64/python3.7/site-packages/torch/", line 403, in load repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, verbose, skip_validation) File "/cluster/project/infk/courses/252-0579-00L/group34_nerf/CloudNeRF/other_papers/Neural-Scene-Flow-Fields/nsff_venv/lib64/python3.7/site-packages/torch/", line 170, in _get_cache_or_reload repo_owner, repo_name, branch = _parse_repo_info(github) File "/cluster/project/infk/courses/252-0579-00L/group34_nerf/CloudNeRF/other_papers/Neural-Scene-Flow-Fields/nsff_venv/lib64/python3.7/site-packages/torch/", line 124, in _parse_repo_info with urlopen(f"{repo_owner}/{repo_name}/tree/main/"): File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/urllib/", line 222, in urlopen return, data, timeout) File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/urllib/", line 525, in open response = self._open(req, data) File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/urllib/", line 543, in _open '_open', req) File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/urllib/", line 503, in _call_chain result = func(*args) File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/urllib/", line 1360, in https_open context=self._context, check_hostname=self._check_hostname) File "/cluster/apps/nss/python/3.7.4/x86_64/lib64/python3.7/urllib/", line 1319, in do_open raise URLError(err) urllib.error.URLError: <urlopen error [Errno 111] Connection refused>

    opened by ChlaegerIO 1
    README Demo Not Working


    I believe there is an issue with the demo as outlined in the readme. Specifically, when trying to use the pre-trained model to render any of the 3 interpolation methods - time, viewpoint, or both - the resulting images (as found in nsff_exp/logs/kid-running_ndc_5f_sv_of_sm_unify3_testing_F00-30/<interpolation_dependent_name>/images) come out looking like this:

    5409d20a-cc1a-49fb-844b-2f74691ab620 or 49da9f71-2949-45b9-addd-7352e925ec48 depending on which form of interpolation I run.

    I have a colab file set up that obtains the above result. It pulls from a forked repository which has only two changes: I add a requirements.txt file and update the data_dir in nsff_exp/configs/config_kid-running.txt.

    Have I missed something or am I correct in saying the demo broken? Thanks!

    opened by rohaldb 1
    Evaluation metrics

    Hi, I am wondering if there is a standard for SSIM and LPIPS.

    For SSIM: I see you use the implementation of scikit-image. When I use kornia's implementation with window size = 11 (I don't know what size scikit-image uses if it's not set), it seems to yield different result... Do you have idea what other authors use?

    For LPIPS:

    1. Do the other authors also use alexnet?
    2. The network expects the rgb to be scaled to [-1, 1]. If it's [0, 1] it seems that you need to pass argument normalize=True So I'm afraid the evaluation you have is not exactly correct...

    It makes me think that if there's no common standard for these metrics that might differ from one implementation to the other, or that sometimes the authors make mistake in the evaluation process, then only the PSNR score is credible...

    opened by kwea123 1
    Multi-gpu training

    In readme you say it takes 2 days on 2 V100 gpus, but I don't see any option setting the number of gpus to use in Does it mean this code only supports single gpu training?

    opened by kwea123 1
    Question about Least Kinetic Motion Prior

    Hi, I was wondering why

    sf_sm_loss += args.w_sm * compute_sf_lke_loss(ret['raw_pts_ref'], 
                                                        H, W, focal)

    is called twice at the following two lines?:

    Should compute_sf_lke_loss compute the $L_{temp}$ term?

    Thank you!

    opened by rliu100 0
    Faster Training.

    Hello, Do you have any suggestions for making the training faster given GPUs with more memory? I'm working with 2 A6000s and would like to fully leverage the memory capacity.

    opened by jonathanhyunmoon 0
    Singularly in NDC2Euclidean

    NDC2Euclidean appears to be attempting to prevent a divide-by-zero error by the addition of an epsilon value:

    def NDC2Euclidean(xyz_ndc, H, W, f):
        z_e = 2./ (xyz_ndc[..., 2:3] - 1. + 1e-6)
        x_e = - xyz_ndc[..., 0:1] * z_e * W/ (2. * f)
        y_e = - xyz_ndc[..., 1:2] * z_e * H/ (2. * f)
        xyz_e =[x_e, y_e, z_e], -1)
        return xyz_e

    However, since the coordinates have scene flow field vectors added to them, and the scene flow field output ranges (-1.0,1.0), it is possible to have xyz_ndc significantly outside of the normal range. This means that a divide-by-zero can still happen in the above code if the z value hits (1.0+1e-6), which it does in our training.

    We suggest clamping to valid NDC values to the range (-1.0, 0.99), with 0.99 chosen to prevent the Euclidean far plane from getting too large. This choice of clamping has significantly stabilized our training in early iterations:

    z_e = 2./ (torch.clamp(xyz_ndc[..., 2:3], -1.0, 0.99) - 1.0)

    opened by geoffreymantel 0
    How would you recommend adapting NSFF to non-forward facing scenes?


    first of all, thank you for releasing the implementation for your amazing project. The question I wanted to ask is how does one adapt NSFF to support reconstruction in euclidean space, thereby extending it to also work on non-forward facing scenes?

    In other words, which parts of the codebase would I need to modify to enable the codebase to run on such scenes? I'm guessing just setting the "no_ndc" flag to "True" inside the config file wouldn't be enough.

    opened by andrewsonga 1
    RuntimeError: stack expects each tensor to be equal size & AttributeError: 'NoneType' object has no attribute 'shape'

    #local run
    colmap feature_extractor \
    --database_path ./database.db --image_path ./dense/images/
    colmap exhaustive_matcher \
    --database_path ./database.db
    colmap mapper \
    --database_path ./database.db \
    --image_path ./dense/images \
    --output_path ./dense/sparse
    colmap image_undistorter \
    --image_path ./dense/images \
    --input_path ./dense/sparse/0 \
    --output_path ./dense \
    --output_type COLMAP \
    --max_image_size 2000

    #colab run

    %cd /content/drive/MyDrive/neural-net
    !git clone
    %cd Neural-Scene-Flow-Fields
    !pip install configargparse
    !pip install matplotlib
    !pip install opencv
    !pip install scikit-image
    !pip install scipy
    !pip install cupy
    !pip install imageio.
    !pip install tqdm
    !pip install kornia

    my Images are 288x512 pixels

    %cd /content/drive/MyDrive/neural-net/Neural-Scene-Flow-Fields/nsff_scripts/
        # create camera intrinsics/extrinsic format for NSFF, same as original NeRF where it uses script from the LLFF code:
    !python --data_path "/content/drive/MyDrive/neural-net/Neural-Scene-Flow-Fields/nerf_data/bolli/dense"
        # Resize input images and run single view model, 
        # argument resize_height: resized image height for model training, width will be resized based on original aspect ratio
    !python --data_path "/content/drive/MyDrive/neural-net/Neural-Scene-Flow-Fields/nerf_data/bolli/dense"  --resize_height 512
    !bash ./
        # Run optical flow model
    !python --model models/raft-things.pth --data_path /content/drive/MyDrive/neural-net/Neural-Scene-Flow-Fields/nerf_data/bolli/dense


    Traceback (most recent call last):
      File "", line 448, in <module>
      File "", line 350, in run_optical_flows
        images = load_image_list(images)
      File "", line 257, in load_image_list
        images = torch.stack(images, dim=0)
    RuntimeError: stack expects each tensor to be equal size, but got [3, 512, 288] at entry 0 and [3, 512, 287] at entry 31

    So input_w = is not consistent, eventhough my images are all dimensions 288x512

    Even if I modify the script:

    def load_image(imfile):
        long_dim = 512
        img = np.array(
        # Portrait Orientation
        if img.shape[0] > img.shape[1]:
            input_h = long_dim
            input_w = 288

    The dimensions error is gone, but another error:

    flow input w 288 h 512
    /usr/local/lib/python3.7/dist-packages/torch/ UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at  ../aten/src/ATen/native/TensorShape.cpp:2157.)
      return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
    Traceback (most recent call last):
      File "", line 448, in <module>
      File "", line 363, in run_optical_flows
        (img_train.shape[1], img_train.shape[0]), 
    AttributeError: 'NoneType' object has no attribute 'shape'
    opened by bartman081523 1
Zhengqi Li
CS Ph.D. student at Cornell University/Cornell Tech
Zhengqi Li
