Official implementation of ACMMM'20 paper 'Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework'

Related tags

Deep Learning IIC
Overview

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework

Official code for paper, Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework [ACMMM'20].

Arxiv paper Project page

Requirements

This is my experimental enviroment.

  • PyTorch 1.3.0

It seems that PyTorch 1.7.0 is not compatible with current codes, causing poor performance. #9

  • python 3.7.4
  • accimage

Inter-intra contrastive (IIC) framework

For samples, we have

  • Inter-positives: samples with same labels, not used for self-supervised learning;
  • Inter-negatives: different samples, or samples with different indexes;
  • Intra-positives: data from the same sample, in different views / from different augmentations;
  • Intra-negatives: data from the same sample while some kind of information has been broken down. In video case, temporal information has been destoried.

Our work makes use of all usable parts (in this classification category) to form an inter-intra contrastive framework. The experiments here are mainly based on Contrastive Multiview Coding.

It is flexible to extend this framework to other contrastive learning methods which use negative samples, such as MoCo and SimCLR.

image

Highlights

Make the most of data for contrastive learning.

Except for inter-negative samples, all possible data are used to help train the network. This inter-intra learning framework can make the most use of data in contrastive learning.

Flexibility of the framework

The inter-intra learning framework can be extended to

  • Different contrastive learning methods: CMC, MoCo, SimCLR ...
  • Different intra-negative generation methods: frame repeating, frame shuffling ...
  • Different backbones: C3D, R3D, R(2+1)D, I3D ...

Updates

Oct. 1, 2020 - Results using C3D and R(2+1)D are added; fix random seed more tightly. Aug. 26, 2020 - Add pretrained weights for R3D.

Usage of this repo

Notification: we have added codes to fix random seed more tightly for better reproducibility. However, results in our paper used previous random seed settings. Therefore, there should be tiny differences for the performance from that reported in our paper. To reproduce retrieval results same as our paper, please use the provided model weights.

Data preparation

You can download UCF101/HMDB51 dataset from official website: UCF101 and HMDB51. Then decoded videos to frames.
I highly recommend the pre-computed optical flow images and resized RGB frames in this repo.

If you use pre-computed frames, the folder architecture is like path/to/dataset/video_id/frames.jpg. If you decode frames on your own, the folder architecture may be like path/to/dataset/class_name/video_id/frames.jpg, in which way, you need pay more attention to the corresponding paths in dataset preparation.

For pre-computed frames, find rgb_folder, u_folder and v_folder in datasets/ucf101.py for UCF101 datasets and change them to meet your environment. Please note that all those modalities are prepared even though in some settings, optical flow data are not used train the model.

If you do not prepare optical flow data, simply set u_folder=rgb_folder and v_folder=rgb_folder should help to avoid errors.

Train self-supervised learning part

python train_ssl.py --dataset=ucf101

This equals to

python train_ssl.py --dataset=ucf101 --model=r3d --modality=res --neg=repeat

This default setting uses frame repeating as intra-negative samples for videos. R3D is used.

We use two views in our experiments. View #1 is a RGB video clip, View #2 can be RGB/Res/Optical flow video clip. Residual video clips are default modality for View #2. You can use --modality to try other modalities. Intra-negative samples are generated from View #1.

It may be wired to use only one optical flow channel u or v. We use only one channel to make it possible for only one model to handle inputs from different modalities. It is also an optional setting that using different models to handle each modality.

Retrieve video clips

python retrieve_clips.py --ckpt=/path/to/your/model --dataset=ucf101 --merge=True

One model is used to handle different views/modalities. You can set --modality to decide which modality to use. When setting --merge=True, RGB for View #1 and the specific modality for View #2 will be jointly used for joint retrieval.

By default training setting, it is very easy to get over 30%@top1 for video retrieval in ucf101 and around 13%@top1 in hmdb51 without joint retrieval.

Fine-tune model for video recognition

python ft_classify.py --ckpt=/path/to/your/model --dataset=ucf101

Testing will be automatically conducted at the end of training.

Or you can use

python ft_classify.py --ckpt=/path/to/your/model --dataset=ucf101 --mode=test

In this way, only testing is conducted using the given model.

Note: The accuracies using residual clips are not stable for validation set (this may also caused by limited validation samples), the final testing part will use the best model on validation set.

If everything is fine, you can achieve around 70% accuracy on UCF101. The results will vary from each other with different random seeds.

Results

Retrieval results

The table lists retrieval results on UCF101 split 1. We reimplemented CMC and report its results. Other results are from corresponding paper. VCOP, VCP, CMC, PRP, and ours are based on R3D network backbone.

Method top1 top5 top10 top20 top50
Jigsaw 19.7 28.5 33.5 40.0 49.4
OPN 19.9 28.7 34.0 40.6 51.6
R3D (random) 9.9 18.9 26.0 35.5 51.9
VCOP 14.1 30.3 40.4 51.1 66.5
VCP 18.6 33.6 42.5 53.5 68.1
CMC 26.4 37.7 45.1 53.2 66.3
Ours (repeat + res) 36.5 54.1 62.9 72.4 83.4
Ours (repeat + u) 41.8 60.4 69.5 78.4 87.7
Ours (shuffle + res) 34.6 53.0 62.3 71.7 82.4
Ours (shuffle + v) 42.4 60.9 69.2 77.1 86.5
PRP 22.8 38.5 46.7 55.2 69.1
RTT 26.1 48.5 59.1 69.6 82.8
MemDPC-RGB 20.2 40.4 52.4 64.7 -
MemDPC-Flow 40.2 63.2 71.9 78.6 -

Recognition results

We only use R3D as our network backbone. In this table, all reported results are pre-trained on UCF101.

Usually, 1. using Resnet-18-3D, R(2+1)D or deeper networks; 2.pre-training on larger datasets; 3. using larger input resolutions; 4. combining with audios or other features will also help.

Method UCF101 HMDB51
Jigsaw 51.5 22.5
O3N (res) 60.3 32.5
OPN 56.3 22.1
OPN (res) 71.8 36.7
CrossLearn 58.7 27.2
CMC (3 views) 59.1 26.7
R3D (random) 54.5 23.4
ImageNet-inflated 60.3 30.7
3D ST-puzzle 65.8 33.7
VCOP (R3D) 64.9 29.5
VCOP (R(2+1)D) 72.4 30.9
VCP (R3D) 66.0 31.5
Ours (repeat + res, R3D) 72.8 35.3
Ours (repeat + u, R3D) 72.7 36.8
Ours (shuffle + res, R3D) 74.4 38.3
Ours (shuffle + v, R3D) 67.0 34.0
PRP (R3D) 66.5 29.7
PRP (R(2+1)D) 72.1 35.0

Residual clips + 3D CNN The residual clips with 3D CNNs are effective, especially for scratch training. More information about this part can be found in Rethinking Motion Representation: Residual Frames with 3D ConvNets for Better Action Recognition (previous but more detailed version) and Motion Representation Using Residual Frames with 3D CNN (short version with better results).

The key code for this part is

shift_x = torch.roll(x,1,2)
x = ((shift_x -x) + 1)/2

which is slightly different from that in papers.

We also reimplement VCP in this repo. By simply using residual clips, significant improvements can be obtained for both video retrieval and video recognition.

Pretrained weights

Pertrained weights from self-supervised training step: R3D(google drive).

With this model, for video retrieval, you should achieve

  • 33.4% @top1 with --modality=res --merge=False
  • 34.8% @top1 with --modality=rgb --merge=False
  • 36.5% @top1 with--modality=res --merge=True

Finetuned weights for action recognition: R3D(google drive).

With this model, for video recognition, you should achieve 72.7% @top1 with python ft_classify.py --model=r3d --modality=res --mode=test -ckpt=./path/to/model --dataset=ucf101 --split=1. This result is better than that reported in paper. Results may be further improved with strong data augmentations.

For any questions, please contact Li TAO ([email protected]).

Results for other network architectures

Results are averaged on 3 splits without using optical flow. R3D and R21D are the same as VCOP / VCP / PRP.

UCF101 top1 top5 top10 top20 top50 Recong
C3D (VCOP) 12.5 29.0 39.0 50.6 66.9 65.6
C3D (VCP) 17.3 31.5 42.0 52.6 67.7 68.5
C3D (PRP) 23.2 38.1 46.0 55.7 68.4 69.1
C3D (ours, repeat) 31.9 48.2 57.3 67.1 79.1 70.0
C3D (ours, shuffle) 28.9 45.4 55.5 66.2 78.8 69.7
R21D (VCOP) 10.7 25.9 35.4 47.3 63.9 72.4
R21D (VCP) 19.9 33.7 42.0 50.5 64.4 66.3
R21D (PRP) 20.3 34.0 41.9 51.7 64.2 72.1
R21D (ours, repeat) 34.7 51.7 60.9 69.4 81.9 72.4
R21D (ours, shuffle) 30.2 45.6 55.0 64.4 77.6 73.3
Res18-3D (ours, repeat) 36.8 54.1 63.1 72.0 83.3 72.4
Res18-3D (ours, shuffle) 33.0 49.2 59.1 69.1 80.6 73.1
HMDB51 top1 top5 top10 top20 top50 Recong
C3D (VCOP) 7.4 22.6 34.4 48.5 70.1 28.4
C3D (VCP) 7.8 23.8 35.3 49.3 71.6 32.5
C3D (PRP) 10.5 27.2 40.4 56.2 75.9 34.5
C3D (ours, repeat) 9.9 29.6 42.0 57.3 78.4 30.8
C3D (ours, shuffle) 11.5 31.3 43.9 60.1 80.3 29.7
R21D (VCOP) 5.7 19.5 30.7 45.6 67.0 30.9
R21D (VCP) 6.7 21.3 32.7 49.2 73.3 32.2
R21D (PRP) 8.2 25.3 36.2 51.0 73.0 35.0
R21D (ours, repeat) 12.7 33.3 45.8 61.6 81.3 34.0
R21D (ours, shuffle) 12.6 31.9 44.2 59.9 80.7 31.2
Res18-3D (ours, repeat) 15.5 34.4 48.9 63.8 83.8 34.3
Res18-3D (ours, shuffle) 12.4 33.6 46.9 63.2 83.5 34.3

Citation

If you find our work helpful for your research, please consider citing the paper

@article{tao2020selfsupervised,
    title={Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework},
    author={Li Tao and Xueting Wang and Toshihiko Yamasaki},
    journal={arXiv preprint arXiv:2008.02531},
    year={2020},
    eprint={2008.02531},
}

If you find the residual input helpful for video-related tasks, please consider citing the paper

@article{tao2020rethinking,
  title={Rethinking Motion Representation: Residual Frames with 3D ConvNets for Better Action Recognition},
  author={Tao, Li and Wang, Xueting and Yamasaki, Toshihiko},
  journal={arXiv preprint arXiv:2001.05661},
  year={2020}
}

@article{tao2020motion,
  title={Motion Representation Using Residual Frames with 3D CNN},
  author={Tao, Li and Wang, Xueting and Yamasaki, Toshihiko},
  journal={arXiv preprint arXiv:2006.13017},
  year={2020}
}

Acknowledgements

Part of this code is inspired by CMC and VCOP.

Comments
  • Poor finetuning results

    Poor finetuning results

    Thanks for your great work. I finetuned the pretrained model on UCF101 train split1, but evaluation results show about 6.5% accuracy. I think that is caused by multi gpus and the procedure of loading checkpoints. But, despite of the change, the result was same. I only change the original code about dataset path, model wrapped with torch.nn.DataParallel().

    opened by wj-son 23
  • Training on hmdb51

    Training on hmdb51

    Hello, Li Tao, Thanks for your great work. I recently want to reproduce your work on hmdb51 dataset, but why I self-supervised pretrain model on UCF101 train split1, finetune on HMDB trainlist01.txt, and then evaluate with testlist01.txt, but I got accuracy of 2% ???

    opened by yzleroy 8
  • UCF101 action classification result only at 0.68

    UCF101 action classification result only at 0.68

    Hi, thanks for the good work.

    I'm currently trying to reproduce the action classification results on UCF101. Using the training parameters that you provided, I've trained my own backbone network and the linear classifier. However, I'm only getting a 0.68 accuracy with the RGB+Res+Repeat settup. I've also trained a linear classifier with the backbone network that you provided, I'm also only getting a accuracy around 0.685.

    Could you please help me with this problem? Is there anything you think that could go wrong? Can you share the weights of your linear classifier network?

    Thanks very much.

    opened by wuchlei 6
  • repeat和shuffle的准确率都是73.5

    repeat和shuffle的准确率都是73.5

        首先我对您的论文很感兴趣,最近也正在尝试复现,但是在这个过程中我发现了一个很有意思的现象,在使用github的代码复现过程中pre-training改变--neg,不管是repeat或者是shuffle,best model都是best_model_141.pt,接下来fine-tune进行下游识别任务时,ucf101的acc-avg都是73.5,感到很奇怪,是不是哪里出现问题了,还请您指教。
    
    opened by Mrbishuai 5
  • loss stuck in multi-gpu

    loss stuck in multi-gpu

    when training the ssl model in multi-gpu setting, loss get stuck at around 15, but in single gpu, loss decrease normally. my environment is pytorch 1.5, CUDA10.2, GeForce RTX 2080 Ti

    opened by Zhuysheng 3
  • Backbone not freezed during training of classifier

    Backbone not freezed during training of classifier

    I see you didn't freeze the parameters of backbone during the classifier training, which is different from image SSL papers. I wonder is there any reason for this?

    opened by wuchlei 2
  • Training Loss Not Improving

    Training Loss Not Improving

    Hi, I'm trying to reproduce your results by training on ucf101. I noticed that my loss is not improving at all. I trained with a batch size of 20, and the loss is stuck at around 15.2 for every iteration, and does not decrease. Just wondering what's the configuration that you used to train the model?

    opened by sherrychen127 1
  • Runtime Error while running the training script

    Runtime Error while running the training script

    I am trying to the training script like this:

    python train_ssl.py --dataset=ucf101
    
    {'print_freq': 10, 'tb_freq': 500, 'save_freq': 40, 'batch_size': 16, 'num_workers': 8, 'epochs': 240, 'learning_rate': 0.01, 'lr_decay_epochs': '120,160,200', 'lr_decay_rate': 0.1, 'beta1': 0.5, 'beta2': 0.999, 'weight_decay': 0.0001, 'momentum': 0.9, 'resume': '', 'model': 'r3d', 'softmax': True, 'nce_k': 1024, 'nce_t': 0.07, 'nce_m': 0.5, 'feat_dim': 512, 'dataset': 'ucf101', 'model_path': './ckpt/', 'tb_path': './logs/', 'debug': True, 'modality': 'res', 'intra_neg': True, 'neg': 'repeat', 'seed': 632, 'model_name': 'intraneg_r3d_res_1012', 'model_folder': './ckpt/intraneg_r3d_res_1012', 'tb_folder': './logs/intraneg_r3d_res_1012'}
    [Warning] The training modalities are RGB and [res]
    Use split1
    v_LongJump_g18_c03 896] BT 5.066 (5.538)        DT 4.007 (4.468)        loss 15.277 (20.188))   1_p -0.001 (0.014)      2_p -0.009 (-0.007)
    
    
    Traceback (most recent call last):
      File "train_ssl.py", line 341, in <module>
        main()
      File "train_ssl.py", line 308, in main
        criterion_1, criterion_2, optimizer, args)
      File "train_ssl.py", line 166, in train
        for idx, (inputs, u_inputs, v_inputs, _, index) in enumerate(train_loader):
      File "/home/krishna/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 363, in __next__
        data = self._next_data()
      File "/home/krishna/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 989, in _next_data
        return self._process_data(data)
      File "/home/krishna/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1014, in _process_data
        data.reraise()
      File "/home/krishna/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/_utils.py", line 395, in reraise
        raise self.exc_type(msg)
    RuntimeError: Caught RuntimeError in DataLoader worker process 0.
    Original Traceback (most recent call last):
      File "/home/krishna/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 185, in _worker_loop
        data = fetcher.fetch(index)
      File "/home/krishna/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
        data = [self.dataset[idx] for idx in possibly_batched_index]
      File "/home/krishna/anaconda3/envs/py3.7/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
        data = [self.dataset[idx] for idx in possibly_batched_index]
      File "/tankpool/home/krishna/PycharmProjects/Inter-intra-video-contrastive-learning/datasets/ucf101.py", line 123, in __getitem__
        raise
    RuntimeError: No active exception to reraise
    

    It always gives me this runtime error. I followed all the steps in the README. Still unable to figure out. Can you please help me out? Thank you!

    opened by krishnachaitanya7 1
  • pytorch 1.7+

    pytorch 1.7+

    1. Has anyone reproduced the results with versions above pytorch1.7+?(1.8 1.9.1.10 and so on)
    2. For some reason I have to use cuda11, has anyone reproduced better results with this version of cuda11?
    opened by boombung 0
  • Question about pretraining

    Question about pretraining

    Thanks for sharing your great work.

    Just wondering which model do you select for the downstream task since you save pertaining models every 40 epochs.

    Thank you.

    opened by PipiZong 1
  • Have the same accuracy

    Have the same accuracy

    Hello, Li Tao, Thanks for your great work. I'am sorry to bother you, but the problem has been bothering me for a long time. When I run train_ssl.py for pre-train, the --neg is set to repeat. Next, run ft_classify.py to fine-tune. When I run train_ssl.py for pre-train, the --neg is set to shuffle. Other settings are the same. Next, run ft_classify.py to fine-tune. Why is there the same accuracy and best_model. I don't understand this very much. Can you help me analyze what's wrong?

    opened by Mrbishuai 6
Owner
Li Tao
Li Tao
This is the official implementation of the paper "Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation".

[CVPRW 2021] - Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation

Anirudh S Chakravarthy 6 May 3, 2022
A PyTorch implementation of "Multi-Scale Contrastive Siamese Networks for Self-Supervised Graph Representation Learning", IJCAI-21

MERIT A PyTorch implementation of our IJCAI-21 paper Multi-Scale Contrastive Siamese Networks for Self-Supervised Graph Representation Learning. Depen

Graph Analysis & Deep Learning Laboratory, GRAND 32 Jan 2, 2023
Dense Contrastive Learning (DenseCL) for self-supervised representation learning, CVPR 2021.

Dense Contrastive Learning for Self-Supervised Visual Pre-Training This project hosts the code for implementing the DenseCL algorithm for se

Xinlong Wang 491 Jan 3, 2023
Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

null 49 Nov 23, 2022
The official implementation of the paper, "SubTab: Subsetting Features of Tabular Data for Self-Supervised Representation Learning"

SubTab: Author: Talip Ucar ([email protected]) The official implementation of the paper, SubTab: Subsetting Features of Tabular Data for Self-Supervis

AstraZeneca 98 Dec 29, 2022
An official PyTorch implementation of the TKDE paper "Self-Supervised Graph Representation Learning via Topology Transformations".

Self-Supervised Graph Representation Learning via Topology Transformations This repository is the official PyTorch implementation of the following pap

Hsiang Gao 2 Oct 31, 2022
[CVPR2021] The source code for our paper 《Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning》.

TBE The source code for our paper "Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Le

Jinpeng Wang 150 Dec 28, 2022
Official PyTorch implementation of "Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition" in AAAI2022.

AimCLR This is an official PyTorch implementation of "Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Reco

Gty 44 Dec 17, 2022
Saeed Lotfi 28 Dec 12, 2022
The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Published by SpaceML • About SpaceML • Quick Colab Example Self-Supervised Learner The Self-Supervised Learner can be used to train a classifier with

SpaceML 92 Nov 30, 2022
Implementation for paper "Towards the Generalization of Contrastive Self-Supervised Learning"

Contrastive Self-Supervised Learning on CIFAR-10 Paper "Towards the Generalization of Contrastive Self-Supervised Learning", Weiran Huang, Mingyang Yi

Weiran Huang 13 Nov 30, 2022
PyTorch implementation of DirectCLR from paper Understanding Dimensional Collapse in Contrastive Self-supervised Learning

DirectCLR DirectCLR is a simple contrastive learning model for visual representation learning. It does not require a trainable projector as SimCLR. It

Meta Research 49 Dec 21, 2022
IntraQ: Learning Synthetic Images with Intra-Class Heterogeneity for Zero-Shot Network Quantization

IntraQ: Learning Synthetic Images with Intra-Class Heterogeneity for Zero-Shot Network Quantization paper Requirements Python >= 3.7.10 Pytorch == 1.7

null 1 Nov 19, 2021
Code for "Intra-hour Photovoltaic Generation Forecasting based on Multi-source Data and Deep Learning Methods."

pv_predict_unet-lstm Code for "Intra-hour Photovoltaic Generation Forecasting based on Multi-source Data and Deep Learning Methods." IEEE Transactions

FolkScientistInDL 8 Oct 8, 2022
This repository is the official implementation of Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning (NeurIPS21).

Core-tuning This repository is the official implementation of ``Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regular

vanint 18 Dec 17, 2022
Eff video representation - Efficient video representation through neural fields

Neural Residual Flow Fields for Efficient Video Representations 1. Download MPI

null 41 Jan 6, 2023
A self-supervised 3D representation learning framework named viewpoint bottleneck.

Pointly-supervised 3D Scene Parsing with Viewpoint Bottleneck Paper Created by Liyi Luo, Beiwen Tian, Hao Zhao and Guyue Zhou from Institute for AI In

null 63 Aug 11, 2022
A self-supervised 3D representation learning framework named viewpoint bottleneck.

Pointly-supervised 3D Scene Parsing with Viewpoint Bottleneck Paper Created by Liyi Luo, Beiwen Tian, Hao Zhao and Guyue Zhou from Institute for AI In

null 42 Sep 24, 2021
An official PyTorch Implementation of Boundary-aware Self-supervised Learning for Video Scene Segmentation (BaSSL)

An official PyTorch Implementation of Boundary-aware Self-supervised Learning for Video Scene Segmentation (BaSSL)

Kakao Brain 72 Dec 28, 2022