Repository for "Space-Time Correspondence as a Contrastive Random Walk" (NeurIPS 2020)

A. Jabri

Last update: Dec 27, 2022

Related tags

Deep Learning videowalk

Overview

Space-Time Correspondence as a Contrastive Random Walk

This is the repository for Space-Time Correspondence as a Contrastive Random Walk, published at NeurIPS 2020.

[Paper] [Project Page] [Slides] [Poster] [Talk]

@inproceedings{jabri2020walk,
    Author = {Allan Jabri and Andrew Owens and Alexei A. Efros},
    Title = {Space-Time Correspondence as a Contrastive Random Walk},
    Booktitle = {Advances in Neural Information Processing Systems},
    Year = {2020},
}

Consider citing our work or acknowledging this repository if you found this code to be helpful :)

Requirements

pytorch (>1.3)
torchvision (0.6.0)
cv2
matplotlib
skimage
imageio

For visualization (--visualize):

wandb
visdom
sklearn

Train

An example training command is:

python -W ignore train.py --data-path /path/to/kinetics/ \
--frame-aug grid --dropout 0.1 --clip-len 4 --temp 0.05 \
--model-type scratch --workers 16 --batch-size 20  \
--cache-dataset --data-parallel --visualize --lr 0.0001

This yields a model with performance on DAVIS as follows (see below for evaluation instructions), provided as pretrained.pth:

 J&F-Mean    J-Mean  J-Recall  J-Decay    F-Mean  F-Recall   F-Decay
  0.67606  0.645902  0.758043   0.2031  0.706219   0.83221  0.246789

Arguments of interest:

--dropout: The rate of edge dropout (default 0.1).
--clip-len: Length of video sequence.
--temp: Softmax temperature.
--model-type: Type of encoder. Use scratch or scratch_zeropad if training from scratch. Use imagenet18 to load an Imagenet-pretrained network. Use scratch with --resume if reloading a checkpoint.
--batch-size: I've managed to train models with batch sizes between 6 and 24. If you have can afford a larger batch size, consider increasing the --lr from 0.0001 to 0.0003.
--frame-aug: grid samples a grid of patches to get nodes; none will just use a single image and use embeddings in the feature map as nodes.
--visualize: Log diagonistics to wandb and data visualizations to visdom.

Data

We use the official torchvision.datasets.Kinetics400 class for training. You can find directions for downloading Kinetics here. In particular, the code expects the path given for kinetics to contain a train_256 subdirectory.

You can also provide --data-path with a file with a list of directories of images, or a path to a directory of directory of images. In this case, clips are randomly subsampled from the directory.

Visualization

By default, the training script will log diagnostics to wandb and data visualizations to visdom.

Pretrained Model

You can find the model resulting from the training command above at pretrained.pth. We are still training updated ablation models and will post them when ready.

Evaluation: Label Propagation

The label propagation algorithm is described in test.py. The output of test.py (predicted label maps) must be post-processed for evaluation.

DAVIS

To evaluate a trained model on the DAVIS task, clone the davis2017-evaluation repository, and prepare the data by downloading the 2017 dataset and modifying the paths provided in eval/davis_vallist.txt. Then, run:

Label Propagation:

python test.py --filelist /path/to/davis/vallist.txt \
--model-type scratch --resume ../pretrained.pth --save-path /save/path \
--topk 10 --videoLen 20 --radius 12  --temperature 0.05  --cropSize -1

Though test.py expects a model file created with train.py, it can easily be modified to be used with other networks. Note that we simply use the same temperature used at training time.

You can also run the ImageNet baseline with the command below.

python test.py --filelist /path/to/davis/vallist.txt \
--model-type imagenet18 --save-path /save/path \
--topk 10 --videoLen 20 --radius 12  --temperature 0.05  --cropSize -1

Post-Process:

# Convert
python eval/convert_davis.py --in_folder /save/path/ --out_folder /converted/path --dataset /davis/path/

# Compute metrics
python /path/to/davis2017-evaluation/evaluation_method.py \
--task semi-supervised   --results_path /converted/path --set val \
--davis_path /path/to/davis/

You can generate the above commands with the script below, where removing --dryrun will actually run them in sequence.

python eval/run_test.py --model-path /path/to/model --L 20 --K 10  --T 0.05 --cropSize -1 --dryrun

Test-time Adaptation

To do.

Comments

Reproducing with pretrained.pth

HI @ajabri

Thanks for sharing the code and model.

However, I am having trouble reproducing your results with the provided pretrained.pth. It only yields J&F-Mean 0.407953.

Could you please have a check on that?

Thx!

opened by xvjiarui 12
patch_grid(...): effective stride is always 32?

https://github.com/ajabri/videowalk/blob/c3e3d7c03001357b0969063d90505b95875b4c83/code/utils/augs.py#L56-L58

@ajabri Do I understand correctly that after L58 stride is always equal [64, 64, 3] and random number is not used since the brackets in L57 evaluate to (0.5 - 0.5) == 0

opened by vadimkantorov 7

Low performance with pretrained.pth

Hi,

I recently run your pre-trained model on davis 2017 with the exact same command you listed in the readme. python test.py --filelist /path/to/davis/vallist.txt \ --model-type scratch --resume ../pretrained.pth --save-path /save/path \ --topk 10 --videoLen 20 --radius 12 --temperature 0.05 --cropSize -1

However, the final performance based on the official davis evaluation script is not as good as the one claimed in the paper. What I got is around 61 for J&F-Mean. Specifically, the detailed performance is listed as below:

J&F-Mean   J-Mean  J-Recall  J-Decay   F-Mean  F-Recall  F-Decay
 0.614429 0.584634  0.686656 0.225137 0.644223  0.763603 0.256438

---------- Per sequence results for val ----------
            Sequence   J-Mean   F-Mean
      bike-packing_1 0.496049 0.711096
      bike-packing_2 0.685996 0.752332
         blackswan_1 0.934492 0.973339
         bmx-trees_1 0.301675 0.770057
         bmx-trees_2 0.644392 0.845591
        breakdance_1 0.666383 0.676260
             camel_1 0.747073 0.855923
    car-roundabout_1 0.852337 0.714172
        car-shadow_1 0.807822 0.778809
              cows_1 0.920527 0.956957
       dance-twirl_1 0.549648 0.593753
               dog_1 0.851405 0.867017
         dogs-jump_1 0.302670 0.435166
         dogs-jump_2 0.536664 0.599638
         dogs-jump_3 0.788082 0.822245
     drift-chicane_1 0.729466 0.786235
    drift-straight_1 0.526541 0.528944
              goat_1 0.800556 0.734920
         gold-fish_1 0.721810 0.717445
         gold-fish_2 0.659471 0.700005
         gold-fish_3 0.820182 0.845394
         gold-fish_4 0.848312 0.915238
         gold-fish_5 0.879084 0.878996
    horsejump-high_1 0.773536 0.888244
    horsejump-high_2 0.723407 0.944909
             india_1 0.631993 0.592968
             india_2 0.567645 0.560544
             india_3 0.629983 0.627841
              judo_1 0.760509 0.765048
              judo_2 0.749010 0.756075
         kite-surf_1 0.270090 0.267305
         kite-surf_2 0.004306 0.062131
         kite-surf_3 0.093566 0.127047
          lab-coat_1 0.000000 0.000000
          lab-coat_2 0.000000 0.000300
          lab-coat_3 0.000000 0.000000
          lab-coat_4 0.000000 0.000000
          lab-coat_5 0.000000 0.000000
             libby_1 0.803691 0.920149
           loading_1 0.900133 0.875399
           loading_2 0.383891 0.567959
           loading_3 0.682442 0.716217
       mbike-trick_1 0.571612 0.743456
       mbike-trick_2 0.639744 0.669962
    motocross-jump_1 0.340788 0.395740
    motocross-jump_2 0.519756 0.554731
paragliding-launch_1 0.819913 0.923513
paragliding-launch_2 0.645564 0.885479
paragliding-launch_3 0.034370 0.137811
           parkour_1 0.805982 0.893970
              pigs_1 0.812613 0.764461
              pigs_2 0.617975 0.750136
              pigs_3 0.906452 0.882834
     scooter-black_1 0.389385 0.669319
     scooter-black_2 0.722495 0.675855
          shooting_1 0.270579 0.454346
          shooting_2 0.747166 0.661882
          shooting_3 0.753406 0.872043
           soapbox_1 0.785921 0.778360
           soapbox_2 0.647941 0.710407
           soapbox_3 0.586195 0.741657

I am wondering whether this is the expected performance without test time adaptation? Or could you list the detailed step-by-step procedure so we can reproduce the results more easily?

Thanks.

opened by lorenmt 7

How many GPUs do you used for training?

Hi, thank you for making the code public.

I use the training and testing command you provided. However, the final test result of the model from the last epoch is about slightly lower than the number you provided: J&F-Mean 67.6(yours) VS 66.9(ours).

I'm guessing the problem might be that you didn't use sync_bn so the batch norm parameters are computed per GPU and maybe I'm using a different number of GPUs compared with you.

So how many GPUs do you use during training?

opened by Steve-Tod 7
Best feature

Hi Allan, Great work! I see in the test code, by default the layer4 of ResNet is removed. May I know if it is also true when training? Or train with layer4 but test with layer3 is better?

opened by Zhongdao 6
Q. Get affinity matrix for random walk.

Hello. Thanks to your work!

I've referred to your code and have a question.

please see your code As = self.affinity(q[:, :, :-1], q[:, :, 1:]) (code/model.py, line 140.)

we can define the affinity matrix that walk frame 1 to frame 2 as torch.matmul(frame2, frame1) and then, the walk frame1 to frame3 could be gotten as matmul( matmul(frame3, frame2), matmul(frame2, frame1) )

As a result, I think your code should be changed **As = self.affinity(q[:, :, :-1], q[:, :, 1:]) ** to **As = self.affinity(q[:, :, 1:], q[:, :, :-1]) **

But, you got a good performance in your expriment. It seems that I miss some figures. Could you explain it?

Thanks. :)

opened by sunwoo76 5
Label propgation problem

First of all, thanks for your great work!

When I do label propagation, this error is happen in the test.py file. (I followed your 'Evaluation: Label Propagation' tab. in README)

******* Vid 0 TOOK 63.87427091598511 ******* ******* Vid 1 (70 frames) ******* computed features 0.48213911056518555 Killed

Why does the process is killed after only processing video 0 ? How can I solve this problem?

opened by sunwoo76 5
Cross-entropy loss computation question
@ajabri The paper specifies that the loss is cross-entropy between the row-normalized cycle transition matrix and the identity matrix:

However, the code seems to compute something slightly different: https://github.com/ajabri/videowalk/blob/0834ff9/code/model.py#L175-L176:

# self.xent = nn.CrossEntropyLoss(reduction="none") logits = torch.log(A+EPS).flatten(0, -2) loss = self.xent(logits, target).mean()

where matrix A is row-stochastic.

CrossEntropyLoss module expects unnormalized logits and does log-softmax directly. This is like computing log_softmax(log(P[i]))[i] - and this is not regular cross-entropy which would have been log(P[i])[i]. Should nn.NLLLoss have been used instead?

The code seems to use log-probs in place of logits (by logits I mean raw unnormalized scores). Is this intentional? If not it might be a bug. @ajabri Could you please comment on this.

Thank you!
opened by vadimkantorov 3
Label propagation: predictions before context has burned in
@ajabri Could you please explain how results are filled for first n_context = 20 frames? Are they copied from ground truth? The paper suggests that the ground truth is only used for the 1st frame, but I can't find where predictions for 2nd-20th frames are filled in. Are they filled in as background?

From what I could see, predictions affect lbls only after n_context frames https://github.com/ajabri/videowalk/blob/0834ff9/code/test.py#L144-L148:

if t > 0: lbls[t + n_context] = pred else: pred = lbls[0] lbls[t + n_context] = pred

For DAVIS evaluation, the frames are saved at index t and not t + n_context https://github.com/ajabri/videowalk/blob/0834ff9/code/test.py#L168:

outpath = os.path.join(args.save_path, str(vid_idx) + '_' + str(t))

Are these 2nd-20th frames included in error metric evaluation? and what prections are used for these frames?

Thanks, @ajabri !
opened by vadimkantorov 3
Using selfsim_fc layer for label propagation

@ajabri By chance, have you tried using the layer from selfsim_fc head for label propagation? In appendix G you mention that res4-features perform worse than res3. But what about selfsim_fc? It is located even closer to the loss function, does it perform even worse than res4?

Thanks!

opened by vadimkantorov 3
Different image normalization mean/std in different code paths
@ajabri I noticed that different code paths use different image normalization parameters.

Training Kinetics400 path: https://github.com/ajabri/videowalk/blob/0834ff9/code/utils/augs.py#L10-L11 :

IMG_MEAN = (0.4914, 0.4822, 0.4465) IMG_STD = (0.2023, 0.1994, 0.2010)

Evaluation DAVIS2017 path: https://github.com/ajabri/videowalk/blob/0834ff9/code/data/vos.py#L173:

mean, std = [0.485, 0.456, 0.406], [0.229, 0.224, 0.225]

Both seem RGB format. Is it correct?

Why are they different? Does this lead to better accuracy? Thanks!
opened by vadimkantorov 3
Is it possible to use ILSVRC-VID dataset to train ?

Hi, thanks for sharing your work! I was wondering if it is possible to train the model on ILSVRC-VID or YTB-VOS dataset instead?

I have tried creating a dataset ILSVRC and YTB-VOS dataset that returns a Tensor[F, H, W, C], where F is the number of frames of the image without transformation. However, after passing through the train transformation, it returns a tuple instead.

This tuple in turn gave me an error at train.py under train_one_epoch, video = video.to(device) list object has no attribute to. How can I rectify this issue? Thanks.

opened by SimJJ96 0
Is it possible to use the trained model for fine tuning?
Thanks for fantastic work, I have some questions:

How can we Fine-Tune on a costume dataset?

IS it possible to train the model on a small dataset?

Can you please how to visualize it in another video (demo)?
opened by zobeirraisi 0
Problems with expansion

Hi, at first thanks for sharing your code, it worked like a charm! I am using your work to use your work to track the process of contraction and expansion for various processes. Tracking an object which contracts itself (e.g a balloon which loses its air) works perfectly! However in contrast, trying to track the expansion, when filling it with air doesnt work as well. Only half of the object is captured at maximum expansion. I already tried increasing the radius but the problem is, that it somehow selects features next to the object as most similar. Do you have any idea how to circumvent this problem? (Training with smaller patches e.g?) Thank you!

opened by mrfh12 1
Handling total occlusions

I'm trying to reproduce some of the results in the paper, and I'm interested in how the model deals with total occlusions.

For example, I notice in the extra qualitative results you provide, there is a moment where the person being tracked is fully occluded as someone else on a bike passes by (specifically here: https://youtu.be/R_Zae5N_hKw), and the occluded nodes no longer have the labels. I'm unsure how all of the labels disappeared? What happens to a node when its entirely occluded and goes out of sight?

In some initial results of running the model, it appears to predict that entirely occluded nodes (incorrectly) transition to neighbouring nodes or thereafter start tracking the occlusion, as opposed to not being predicted at all.

Thanks for any help in advance!

opened by annahadji 1
Test time training code

Hi Allan,

Many thanks for releasing the codes again! Could you tell us the time that you plan to release the test-time training code? Or would it be possible for you to give me some suggestions on how to implement this based on current codebase?

Many thanks!

opened by AndyTang15 1
Efficient way to download Kinetics-400

@ajabri Would downloading it from AcademicTorrents have the good size/directory structure?

Or did you download it using https://github.com/Showmax/kinetics-downloader? (recommended at https://github.com/pytorch/vision/tree/master/references/video_classification#data-preparation; which runs youtube-dl and then converts all them to mp4 (and I guess, h264). I tried it and in 2 hours it just downloaded ~500Mb out of 400Gb.

Do you know if clips must converted to mp4? Or would VideoClips just use ffmpeg once for sampling frames? (in that case recoding to the same format is not needed)

Did you use some other way?

What is expected Kinetics400 dataset directory structure? (not explained at https://pytorch.org/docs/stable/torchvision/datasets.html#kinetics-400 or in the dataset metadata). Is it /path/to/dataset/<split>/<classlabel>/<youtubeid>.avi?

If yes, then what is the origin of train_256? From what I understoo the only splits are train, val and test

Thanks a lot!

opened by vadimkantorov 10

Owner

A. Jabri

PhD student

GitHub http://ajabri.github.io/videowalk

git《Beta R-CNN: Looking into Pedestrian Detection from Another Perspective》(NeurIPS 2020) GitHub:[fig3]

Beta R-CNN: Looking into Pedestrian Detection from Another Perspective This is the pytorch implementation of our paper "[Beta R-CNN: Looking into Pede

35 Sep 8, 2021

Official implementation of "GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators" (NeurIPS 2020)

GS-WGAN This repository contains the implementation for GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators (NeurIPS

46 Nov 9, 2022

Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Diverse Image Captioning with Context-Object Split Latent Spaces This repository is the PyTorch implementation of the paper: Diverse Image Captioning

34 Nov 21, 2022

Official implementation for Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020

Likelihood-Regret Official implementation of Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020. T

33 Oct 12, 2022

Official Pytorch implementation of 'GOCor: Bringing Globally Optimized Correspondence Volumes into Your Neural Network' (NeurIPS 2020)

Official implementation of GOCor This is the official implementation of our paper : GOCor: Bringing Globally Optimized Correspondence Volumes into You