Repository for "Space-Time Correspondence as a Contrastive Random Walk" (NeurIPS 2020)

Overview

Space-Time Correspondence as a Contrastive Random Walk

This is the repository for Space-Time Correspondence as a Contrastive Random Walk, published at NeurIPS 2020.

[Paper] [Project Page] [Slides] [Poster] [Talk]

@inproceedings{jabri2020walk,
    Author = {Allan Jabri and Andrew Owens and Alexei A. Efros},
    Title = {Space-Time Correspondence as a Contrastive Random Walk},
    Booktitle = {Advances in Neural Information Processing Systems},
    Year = {2020},
}

Consider citing our work or acknowledging this repository if you found this code to be helpful :)

Requirements

  • pytorch (>1.3)
  • torchvision (0.6.0)
  • cv2
  • matplotlib
  • skimage
  • imageio

For visualization (--visualize):

  • wandb
  • visdom
  • sklearn

Train

An example training command is:

python -W ignore train.py --data-path /path/to/kinetics/ \
--frame-aug grid --dropout 0.1 --clip-len 4 --temp 0.05 \
--model-type scratch --workers 16 --batch-size 20  \
--cache-dataset --data-parallel --visualize --lr 0.0001

This yields a model with performance on DAVIS as follows (see below for evaluation instructions), provided as pretrained.pth:

 J&F-Mean    J-Mean  J-Recall  J-Decay    F-Mean  F-Recall   F-Decay
  0.67606  0.645902  0.758043   0.2031  0.706219   0.83221  0.246789

Arguments of interest:

  • --dropout: The rate of edge dropout (default 0.1).
  • --clip-len: Length of video sequence.
  • --temp: Softmax temperature.
  • --model-type: Type of encoder. Use scratch or scratch_zeropad if training from scratch. Use imagenet18 to load an Imagenet-pretrained network. Use scratch with --resume if reloading a checkpoint.
  • --batch-size: I've managed to train models with batch sizes between 6 and 24. If you have can afford a larger batch size, consider increasing the --lr from 0.0001 to 0.0003.
  • --frame-aug: grid samples a grid of patches to get nodes; none will just use a single image and use embeddings in the feature map as nodes.
  • --visualize: Log diagonistics to wandb and data visualizations to visdom.

Data

We use the official torchvision.datasets.Kinetics400 class for training. You can find directions for downloading Kinetics here. In particular, the code expects the path given for kinetics to contain a train_256 subdirectory.

You can also provide --data-path with a file with a list of directories of images, or a path to a directory of directory of images. In this case, clips are randomly subsampled from the directory.

Visualization

By default, the training script will log diagnostics to wandb and data visualizations to visdom.

Pretrained Model

You can find the model resulting from the training command above at pretrained.pth. We are still training updated ablation models and will post them when ready.


Evaluation: Label Propagation

The label propagation algorithm is described in test.py. The output of test.py (predicted label maps) must be post-processed for evaluation.

DAVIS

To evaluate a trained model on the DAVIS task, clone the davis2017-evaluation repository, and prepare the data by downloading the 2017 dataset and modifying the paths provided in eval/davis_vallist.txt. Then, run:

Label Propagation:

python test.py --filelist /path/to/davis/vallist.txt \
--model-type scratch --resume ../pretrained.pth --save-path /save/path \
--topk 10 --videoLen 20 --radius 12  --temperature 0.05  --cropSize -1

Though test.py expects a model file created with train.py, it can easily be modified to be used with other networks. Note that we simply use the same temperature used at training time.

You can also run the ImageNet baseline with the command below.

python test.py --filelist /path/to/davis/vallist.txt \
--model-type imagenet18 --save-path /save/path \
--topk 10 --videoLen 20 --radius 12  --temperature 0.05  --cropSize -1

Post-Process:

# Convert
python eval/convert_davis.py --in_folder /save/path/ --out_folder /converted/path --dataset /davis/path/

# Compute metrics
python /path/to/davis2017-evaluation/evaluation_method.py \
--task semi-supervised   --results_path /converted/path --set val \
--davis_path /path/to/davis/

You can generate the above commands with the script below, where removing --dryrun will actually run them in sequence.

python eval/run_test.py --model-path /path/to/model --L 20 --K 10  --T 0.05 --cropSize -1 --dryrun

Test-time Adaptation

To do.

Comments
  • Reproducing with pretrained.pth

    Reproducing with pretrained.pth

    HI @ajabri

    Thanks for sharing the code and model.

    However, I am having trouble reproducing your results with the provided pretrained.pth. It only yields J&F-Mean 0.407953.

    Could you please have a check on that?

    Thx!

    opened by xvjiarui 12
  • patch_grid(...): effective stride is always 32?

    patch_grid(...): effective stride is always 32?

    https://github.com/ajabri/videowalk/blob/c3e3d7c03001357b0969063d90505b95875b4c83/code/utils/augs.py#L56-L58

    @ajabri Do I understand correctly that after L58 stride is always equal [64, 64, 3] and random number is not used since the brackets in L57 evaluate to (0.5 - 0.5) == 0

    opened by vadimkantorov 7
  • Low performance with pretrained.pth

    Low performance with pretrained.pth

    Hi,

    I recently run your pre-trained model on davis 2017 with the exact same command you listed in the readme. python test.py --filelist /path/to/davis/vallist.txt \ --model-type scratch --resume ../pretrained.pth --save-path /save/path \ --topk 10 --videoLen 20 --radius 12 --temperature 0.05 --cropSize -1

    However, the final performance based on the official davis evaluation script is not as good as the one claimed in the paper. What I got is around 61 for J&F-Mean. Specifically, the detailed performance is listed as below:

    J&F-Mean   J-Mean  J-Recall  J-Decay   F-Mean  F-Recall  F-Decay
     0.614429 0.584634  0.686656 0.225137 0.644223  0.763603 0.256438
    
    ---------- Per sequence results for val ----------
                Sequence   J-Mean   F-Mean
          bike-packing_1 0.496049 0.711096
          bike-packing_2 0.685996 0.752332
             blackswan_1 0.934492 0.973339
             bmx-trees_1 0.301675 0.770057
             bmx-trees_2 0.644392 0.845591
            breakdance_1 0.666383 0.676260
                 camel_1 0.747073 0.855923
        car-roundabout_1 0.852337 0.714172
            car-shadow_1 0.807822 0.778809
                  cows_1 0.920527 0.956957
           dance-twirl_1 0.549648 0.593753
                   dog_1 0.851405 0.867017
             dogs-jump_1 0.302670 0.435166
             dogs-jump_2 0.536664 0.599638
             dogs-jump_3 0.788082 0.822245
         drift-chicane_1 0.729466 0.786235
        drift-straight_1 0.526541 0.528944
                  goat_1 0.800556 0.734920
             gold-fish_1 0.721810 0.717445
             gold-fish_2 0.659471 0.700005
             gold-fish_3 0.820182 0.845394
             gold-fish_4 0.848312 0.915238
             gold-fish_5 0.879084 0.878996
        horsejump-high_1 0.773536 0.888244
        horsejump-high_2 0.723407 0.944909
                 india_1 0.631993 0.592968
                 india_2 0.567645 0.560544
                 india_3 0.629983 0.627841
                  judo_1 0.760509 0.765048
                  judo_2 0.749010 0.756075
             kite-surf_1 0.270090 0.267305
             kite-surf_2 0.004306 0.062131
             kite-surf_3 0.093566 0.127047
              lab-coat_1 0.000000 0.000000
              lab-coat_2 0.000000 0.000300
              lab-coat_3 0.000000 0.000000
              lab-coat_4 0.000000 0.000000
              lab-coat_5 0.000000 0.000000
                 libby_1 0.803691 0.920149
               loading_1 0.900133 0.875399
               loading_2 0.383891 0.567959
               loading_3 0.682442 0.716217
           mbike-trick_1 0.571612 0.743456
           mbike-trick_2 0.639744 0.669962
        motocross-jump_1 0.340788 0.395740
        motocross-jump_2 0.519756 0.554731
    paragliding-launch_1 0.819913 0.923513
    paragliding-launch_2 0.645564 0.885479
    paragliding-launch_3 0.034370 0.137811
               parkour_1 0.805982 0.893970
                  pigs_1 0.812613 0.764461
                  pigs_2 0.617975 0.750136
                  pigs_3 0.906452 0.882834
         scooter-black_1 0.389385 0.669319
         scooter-black_2 0.722495 0.675855
              shooting_1 0.270579 0.454346
              shooting_2 0.747166 0.661882
              shooting_3 0.753406 0.872043
               soapbox_1 0.785921 0.778360
               soapbox_2 0.647941 0.710407
               soapbox_3 0.586195 0.741657
    

    I am wondering whether this is the expected performance without test time adaptation? Or could you list the detailed step-by-step procedure so we can reproduce the results more easily?

    Thanks.

    opened by lorenmt 7
  • How many GPUs do you used for training?

    How many GPUs do you used for training?

    Hi, thank you for making the code public.

    I use the training and testing command you provided. However, the final test result of the model from the last epoch is about slightly lower than the number you provided: J&F-Mean 67.6(yours) VS 66.9(ours).

    I'm guessing the problem might be that you didn't use sync_bn so the batch norm parameters are computed per GPU and maybe I'm using a different number of GPUs compared with you.

    So how many GPUs do you use during training?

    opened by Steve-Tod 7
  • Best feature

    Best feature

    Hi Allan, Great work! I see in the test code, by default the layer4 of ResNet is removed. May I know if it is also true when training? Or train with layer4 but test with layer3 is better?

    opened by Zhongdao 6
  • Q. Get affinity matrix for random walk.

    Q. Get affinity matrix for random walk.

    Hello. Thanks to your work!

    I've referred to your code and have a question.

    please see your code As = self.affinity(q[:, :, :-1], q[:, :, 1:]) (code/model.py, line 140.)

    we can define the affinity matrix that walk frame 1 to frame 2 as torch.matmul(frame2, frame1) and then, the walk frame1 to frame3 could be gotten as matmul( matmul(frame3, frame2), matmul(frame2, frame1) )

    As a result, I think your code should be changed **As = self.affinity(q[:, :, :-1], q[:, :, 1:]) ** to **As = self.affinity(q[:, :, 1:], q[:, :, :-1]) **

    But, you got a good performance in your expriment. It seems that I miss some figures. Could you explain it?

    Thanks. :)

    opened by sunwoo76 5
  • Label propgation problem

    Label propgation problem

    First of all, thanks for your great work!

    When I do label propagation, this error is happen in the test.py file. (I followed your 'Evaluation: Label Propagation' tab. in README)

    ******* Vid 0 TOOK 63.87427091598511 ******* ******* Vid 1 (70 frames) ******* computed features 0.48213911056518555 Killed

    Why does the process is killed after only processing video 0 ? How can I solve this problem?

    opened by sunwoo76 5
  • Cross-entropy loss computation question

    Cross-entropy loss computation question

    @ajabri The paper specifies that the loss is cross-entropy between the row-normalized cycle transition matrix and the identity matrix: image

    However, the code seems to compute something slightly different: https://github.com/ajabri/videowalk/blob/0834ff9/code/model.py#L175-L176:

    # self.xent = nn.CrossEntropyLoss(reduction="none")
    logits = torch.log(A+EPS).flatten(0, -2)
    loss = self.xent(logits, target).mean()
    

    where matrix A is row-stochastic.

    CrossEntropyLoss module expects unnormalized logits and does log-softmax directly. This is like computing log_softmax(log(P[i]))[i] - and this is not regular cross-entropy which would have been log(P[i])[i]. Should nn.NLLLoss have been used instead?

    The code seems to use log-probs in place of logits (by logits I mean raw unnormalized scores). Is this intentional? If not it might be a bug. @ajabri Could you please comment on this.

    Thank you!

    opened by vadimkantorov 3
  • Label propagation: predictions before context has burned in

    Label propagation: predictions before context has burned in

    @ajabri Could you please explain how results are filled for first n_context = 20 frames? Are they copied from ground truth? The paper suggests that the ground truth is only used for the 1st frame, but I can't find where predictions for 2nd-20th frames are filled in. Are they filled in as background?

    From what I could see, predictions affect lbls only after n_context frames https://github.com/ajabri/videowalk/blob/0834ff9/code/test.py#L144-L148:

    if t > 0:
        lbls[t + n_context] = pred
    else:
        pred = lbls[0]
        lbls[t + n_context] = pred
    

    For DAVIS evaluation, the frames are saved at index t and not t + n_context https://github.com/ajabri/videowalk/blob/0834ff9/code/test.py#L168:

    outpath = os.path.join(args.save_path, str(vid_idx) + '_' + str(t))
    

    Are these 2nd-20th frames included in error metric evaluation? and what prections are used for these frames?

    Thanks, @ajabri !

    opened by vadimkantorov 3
  • Using selfsim_fc layer for label propagation

    Using selfsim_fc layer for label propagation

    @ajabri By chance, have you tried using the layer from selfsim_fc head for label propagation? In appendix G you mention that res4-features perform worse than res3. But what about selfsim_fc? It is located even closer to the loss function, does it perform even worse than res4?

    Thanks!

    opened by vadimkantorov 3
  • Different image normalization mean/std in different code paths

    Different image normalization mean/std in different code paths

    @ajabri I noticed that different code paths use different image normalization parameters.

    Training Kinetics400 path: https://github.com/ajabri/videowalk/blob/0834ff9/code/utils/augs.py#L10-L11 :

    IMG_MEAN = (0.4914, 0.4822, 0.4465)
    IMG_STD  = (0.2023, 0.1994, 0.2010)
    

    Evaluation DAVIS2017 path: https://github.com/ajabri/videowalk/blob/0834ff9/code/data/vos.py#L173:

    mean, std = [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]
    

    Both seem RGB format. Is it correct?

    Why are they different? Does this lead to better accuracy? Thanks!

    opened by vadimkantorov 3
  • Is it possible to use ILSVRC-VID dataset to train ?

    Is it possible to use ILSVRC-VID dataset to train ?

    Hi, thanks for sharing your work! I was wondering if it is possible to train the model on ILSVRC-VID or YTB-VOS dataset instead?

    I have tried creating a dataset ILSVRC and YTB-VOS dataset that returns a Tensor[F, H, W, C], where F is the number of frames of the image without transformation. However, after passing through the train transformation, it returns a tuple instead.

    This tuple in turn gave me an error at train.py under train_one_epoch, video = video.to(device) list object has no attribute to. How can I rectify this issue? Thanks.

    opened by SimJJ96 0
  • Is it possible to use the trained model for fine tuning?

    Is it possible to use the trained model for fine tuning?

    Thanks for fantastic work, I have some questions:

    1. How can we Fine-Tune on a costume dataset?
    2. IS it possible to train the model on a small dataset?
    3. Can you please how to visualize it in another video (demo)?
    opened by zobeirraisi 0
  • Problems with expansion

    Problems with expansion

    Hi, at first thanks for sharing your code, it worked like a charm! I am using your work to use your work to track the process of contraction and expansion for various processes. Tracking an object which contracts itself (e.g a balloon which loses its air) works perfectly! However in contrast, trying to track the expansion, when filling it with air doesnt work as well. Only half of the object is captured at maximum expansion. I already tried increasing the radius but the problem is, that it somehow selects features next to the object as most similar. Do you have any idea how to circumvent this problem? (Training with smaller patches e.g?) Thank you!

    opened by mrfh12 1
  • Handling total occlusions

    Handling total occlusions

    I'm trying to reproduce some of the results in the paper, and I'm interested in how the model deals with total occlusions.

    For example, I notice in the extra qualitative results you provide, there is a moment where the person being tracked is fully occluded as someone else on a bike passes by (specifically here: https://youtu.be/R_Zae5N_hKw), and the occluded nodes no longer have the labels. I'm unsure how all of the labels disappeared? What happens to a node when its entirely occluded and goes out of sight?

    In some initial results of running the model, it appears to predict that entirely occluded nodes (incorrectly) transition to neighbouring nodes or thereafter start tracking the occlusion, as opposed to not being predicted at all.

    Thanks for any help in advance!

    opened by annahadji 1
  • Test time training code

    Test time training code

    Hi Allan,

    Many thanks for releasing the codes again! Could you tell us the time that you plan to release the test-time training code? Or would it be possible for you to give me some suggestions on how to implement this based on current codebase?

    Many thanks!

    opened by AndyTang15 1
  • Efficient way to download Kinetics-400

    Efficient way to download Kinetics-400

    @ajabri Would downloading it from AcademicTorrents have the good size/directory structure?

    Or did you download it using https://github.com/Showmax/kinetics-downloader? (recommended at https://github.com/pytorch/vision/tree/master/references/video_classification#data-preparation; which runs youtube-dl and then converts all them to mp4 (and I guess, h264). I tried it and in 2 hours it just downloaded ~500Mb out of 400Gb.

    Do you know if clips must converted to mp4? Or would VideoClips just use ffmpeg once for sampling frames? (in that case recoding to the same format is not needed)

    Did you use some other way?

    What is expected Kinetics400 dataset directory structure? (not explained at https://pytorch.org/docs/stable/torchvision/datasets.html#kinetics-400 or in the dataset metadata). Is it /path/to/dataset/<split>/<classlabel>/<youtubeid>.avi?

    If yes, then what is the origin of train_256? From what I understoo the only splits are train, val and test

    Thanks a lot!

    opened by vadimkantorov 10
git《Beta R-CNN: Looking into Pedestrian Detection from Another Perspective》(NeurIPS 2020) GitHub:[fig3]

Beta R-CNN: Looking into Pedestrian Detection from Another Perspective This is the pytorch implementation of our paper "[Beta R-CNN: Looking into Pede

null 35 Sep 8, 2021
Official implementation of "GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators" (NeurIPS 2020)

GS-WGAN This repository contains the implementation for GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators (NeurIPS

null 46 Nov 9, 2022
Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Diverse Image Captioning with Context-Object Split Latent Spaces This repository is the PyTorch implementation of the paper: Diverse Image Captioning

Visual Inference Lab @TU Darmstadt 34 Nov 21, 2022
Official implementation for Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020

Likelihood-Regret Official implementation of Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020. T

Xavier 33 Oct 12, 2022
Official Pytorch implementation of 'GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network' (NeurIPS 2020)

Official implementation of GOCor This is the official implementation of our paper : GOCor: Bringing Globally Optimized Correspondence Volumes into You

Prune Truong 71 Nov 18, 2022
《Dual-Resolution Correspondence Network》(NeurIPS 2020)

Dual-Resolution Correspondence Network Dual-Resolution Correspondence Network, NeurIPS 2020 Dependency All dependencies are included in asset/dualrcne

Active Vision Laboratory 45 Nov 21, 2022
(NeurIPS 2020) Wasserstein Distances for Stereo Disparity Estimation

Wasserstein Distances for Stereo Disparity Estimation Accepted in NeurIPS 2020 as Spotlight. [Project Page] Wasserstein Distances for Stereo Disparity

Divyansh Garg 92 Dec 12, 2022
Official Implementation of Swapping Autoencoder for Deep Image Manipulation (NeurIPS 2020)

Swapping Autoencoder for Deep Image Manipulation Taesung Park, Jun-Yan Zhu, Oliver Wang, Jingwan Lu, Eli Shechtman, Alexei A. Efros, Richard Zhang UC

null 449 Dec 27, 2022
Implementation of "Fast and Flexible Temporal Point Processes with Triangular Maps" (Oral @ NeurIPS 2020)

Fast and Flexible Temporal Point Processes with Triangular Maps This repository includes a reference implementation of the algorithms described in "Fa

Oleksandr Shchur 20 Dec 2, 2022
[NeurIPS 2020] Blind Video Temporal Consistency via Deep Video Prior

pytorch-deep-video-prior (DVP) Official PyTorch implementation for NeurIPS 2020 paper: Blind Video Temporal Consistency via Deep Video Prior TensorFlo

Yazhou XING 90 Oct 19, 2022
Code for ICE-BeeM paper - NeurIPS 2020

ICE-BeeM: Identifiable Conditional Energy-Based Deep Models Based on Nonlinear ICA This repository contains code to run and reproduce the experiments

Ilyes Khemakhem 65 Dec 22, 2022
Code for Discriminative Sounding Objects Localization (NeurIPS 2020)

Discriminative Sounding Objects Localization Code for our NeurIPS 2020 paper Discriminative Sounding Objects Localization via Self-supervised Audiovis

null 51 Dec 11, 2022
Advances in Neural Information Processing Systems (NeurIPS), 2020.

What is being transferred in transfer learning? This repo contains the code for the following paper: Behnam Neyshabur*, Hanie Sedghi*, Chiyuan Zhang*.

Google Research 36 Aug 26, 2022
Neuron Merging: Compensating for Pruned Neurons (NeurIPS 2020)

Neuron Merging: Compensating for Pruned Neurons Pytorch implementation of Neuron Merging: Compensating for Pruned Neurons, accepted at 34th Conference

Woojeong Kim 33 Dec 30, 2022
Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement (NeurIPS 2020)

MTTS-CAN: Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement Paper Xin Liu, Josh Fromm, Shwetak Patel, Daniel M

Xin Liu 106 Dec 30, 2022
Defending graph neural networks against adversarial attacks (NeurIPS 2020)

GNNGuard: Defending Graph Neural Networks against Adversarial Attacks Authors: Xiang Zhang ([email protected]), Marinka Zitnik (marinka@hms.

Zitnik Lab @ Harvard 44 Dec 7, 2022
Code for the Population-Based Bandits Algorithm, presented at NeurIPS 2020.

Population-Based Bandits (PB2) Code for the Population-Based Bandits (PB2) Algorithm, from the paper Provably Efficient Online Hyperparameter Optimiza

Jack Parker-Holder 22 Nov 16, 2022
Code release for NeurIPS 2020 paper "Co-Tuning for Transfer Learning"

CoTuning Official implementation for NeurIPS 2020 paper Co-Tuning for Transfer Learning. [News] 2021/01/13 The COCO 70 dataset used in the paper is av

THUML @ Tsinghua University 35 Sep 23, 2022
Discovering Interpretable GAN Controls [NeurIPS 2020]

GANSpace: Discovering Interpretable GAN Controls Figure 1: Sequences of image edits performed using control discovered with our method, applied to thr

Erik Härkönen 1.7k Jan 3, 2023