[NeurIPS'20] Self-supervised Co-Training for Video Representation Learning. Tengda Han, Weidi Xie, Andrew Zisserman.

Tengda Han

Last update: Jan 2, 2023

Related tags

Deep Learning CoCLR

Overview

CoCLR: Self-supervised Co-Training for Video Representation Learning

This repository contains the implementation of:

InfoNCE (MoCo on videos)
UberNCE (supervised contrastive learning on videos)
CoCLR

Link:

[Project Page] [PDF] [Arxiv]

News

[2021.01.29] Upload both RGB and optical flow dataset for UCF101 (links).
[2021.01.11] Update our paper for NeurIPS2020 final version: corrected InfoNCE-RGB-linearProbe baseline result in Table1 from 52.3% (pretrained for 800 epochs, unnessary and unfair) to 46.8% (pretrained for 500 epochs, fair comparison). Thanks @liuhualin333 for pointing out.
[2020.12.08] Update instructions.
[2020.11.17] Upload pretrained weights for UCF101 experiments.
[2020.10.30] Update "draft" dataloader files, CoCLR code, evaluation code as requested by some researchers. Will check and add detailed instructions later.

Pretrain Instruction

InfoNCE pretrain on UCF101-RGB

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node=2 main_nce.py --net s3d --model infonce --moco-k 2048 \
--dataset ucf101-2clip --seq_len 32 --ds 1 --batch_size 32 \
--epochs 300 --schedule 250 280 -j 16

InfoNCE pretrain on UCF101-Flow

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node=2 main_nce.py --net s3d --model infonce --moco-k 2048 \
--dataset ucf101-f-2clip --seq_len 32 --ds 1 --batch_size 32 \
--epochs 300 --schedule 250 280 -j 16

CoCLR pretrain on UCF101 for one cycle

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node=2 main_coclr.py --net s3d --topk 5 --moco-k 2048 \
--dataset ucf101-2stream-2clip --seq_len 32 --ds 1 --batch_size 32 \
--epochs 100 --schedule 80 --name_prefix Cycle1-FlowMining_ -j 8 \
--pretrain {rgb_infoNCE_checkpoint.pth.tar} {flow_infoNCE_checkpoint.pth.tar}

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node=2 main_coclr.py --net s3d --topk 5 --moco-k 2048 --reverse \
--dataset ucf101-2stream-2clip --seq_len 32 --ds 1 --batch_size 32 \
--epochs 100 --schedule 80 --name_prefix Cycle1-RGBMining_ -j 8 \
--pretrain {flow_infoNCE_checkpoint.pth.tar} {rgb_cycle1_checkpoint.pth.tar}

InfoNCE pretrain on K400-RGB

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
--nproc_per_node=4 main_infonce.py --net s3d --model infonce --moco-k 16384 \
--dataset k400-2clip --lr 1e-3 --seq_len 32 --ds 1 --batch_size 32 \
--epochs 300 --schedule 250 280 -j 16

InfoNCE pretrain on K400-Flow

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
--nproc_per_node=4 teco_fb_main.py --net s3d --model infonce --moco-k 16384 \
--dataset k400-f-2clip --lr 1e-3 --seq_len 32 --ds 1 --batch_size 32 \
--epochs 300 --schedule 250 280 -j 16

CoCLR pretrain on K400 for one cycle

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node=2 main_coclr.py --net s3d --topk 5 --moco-k 16384 \
--dataset k400-2stream-2clip --seq_len 32 --ds 1 --batch_size 32 \
--epochs 50 --schedule 40 --name_prefix Cycle1-FlowMining_ -j 8 \
--pretrain {rgb_infoNCE_checkpoint.pth.tar} {flow_infoNCE_checkpoint.pth.tar}

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch \
--nproc_per_node=2 main_coclr.py --net s3d --topk 5 --moco-k 16384 --reverse \
--dataset k400-2stream-2clip --seq_len 32 --ds 1 --batch_size 32 \
--epochs 50 --schedule 40 --name_prefix Cycle1-RGBMining_ -j 8 \
--pretrain {flow_infoNCE_checkpoint.pth.tar} {rgb_cycle1_checkpoint.pth.tar}

Finetune Instruction

cd eval/ e.g. finetune UCF101-rgb:

CUDA_VISIBLE_DEVICES=0,1 python main_classifier.py --net s3d --dataset ucf101 \
--seq_len 32 --ds 1 --batch_size 32 --train_what ft --epochs 500 --schedule 400 450 \
--pretrain {selected_rgb_pretrained_checkpoint.pth.tar}

then run the test with 10-crop (test-time augmentation is helpful, 10-crop gives better result than center-crop):

CUDA_VISIBLE_DEVICES=0,1 python main_classifier.py --net s3d --dataset ucf101 \
--seq_len 32 --ds 1 --batch_size 32 --train_what ft --epochs 500 --schedule 400 450 \
--test {selected_rgb_finetuned_checkpoint.pth.tar} --ten_crop

Nearest-neighbour Retrieval Instruction

cd eval/ e.g. nn-retrieval for UCF101-rgb

CUDA_VISIBLE_DEVICES=0 python main_classifier.py --net s3d --dataset ucf101 \
--seq_len 32 --ds 1 --test {selected_rgb_pretrained_checkpoint.pth.tar} --retrieval

Linear-probe Instruction

cd eval/

from extracted feature

The code support two methods on linear-probe, either feed the data end-to-end and freeze the backbone, or train linear layer on extracted features. Both methods give similar best results in our experiments.

e.g. on extracted features (after run NN-retrieval command above, features will be saved in os.path.dirname(checkpoint))

CUDA_VISIBLE_DEVICES=0 python feature_linear_probe.py --dataset ucf101 \
--test {feature_dirname} --final_bn --lr 1.0 --wd 1e-3

Note that the default setting should give an alright performance, maybe 1-2% lower than our paper's figure. For different datasets, lr and wd need to be tuned from lr: 0.1 to 1.0; wd: 1e-4 to 1e-1.

load data and freeze backbone

alternatively, feed data end-to-end and freeze the backbone.

CUDA_VISIBLE_DEVICES=0,1 python main_classifier.py --net s3d --dataset ucf101 \
--seq_len 32 --ds 1 --batch_size 32 --train_what last --epochs 100 --schedule 60 80 \
--optim sgd --lr 1e-1 --wd 1e-3 --final_bn --pretrain {selected_rgb_pretrained_checkpoint.pth.tar}

Similarly, lr and wd need to be tuned for different datasets for best performance.

Dataset

RGB for UCF101: [download-from-server] [download-from-gdrive] (tar file, 29GB, packed with lmdb)
TVL1 optical flow for UCF101: [download-from-server] [download-from-gdrive] (tar file, 20.5GB, packed with lmdb)
Note: I created these lmdb files with msgpack==0.6.2, when load them with msgpack>=1.0.0, you can do msgpack.loads(raw_data, raw=True)(issue#32)

Result

Finetune entire network for action classification on UCF101:

Pretrained Weights

Our models:

UCF101-RGB-CoCLR: [download] [NN@1=51.8 on UCF101-RGB]
UCF101-Flow-CoCLR: [download] [NN@1=48.4 on UCF101-Flow]

Baseline models:

UCF101-RGB-InfoNCE: [download] [NN@1=33.1 on UCF101-RGB]
UCF101-Flow-InfoNCE: [download] [NN@1=45.2 on UCF101-Flow]

Kinetics400-pretrained models：

K400-RGB-CoCLR: [download] [NN@1=45.6, Finetune-Acc@1=87.89 on UCF101-RGB]
K400-Flow-CoCLR: [download] [NN@1=44.4, Finetune-Acc@1=85.27 on UCF101-Flow]
Two-stream result by average the class probability: 0.8789 + 0.8527 => 0.9061

Comments

about Initialization & Alternation
Initialization -> use of the pretrained InfoNCE checkpoint.pth.tar

CUDA_VISIBLE_DEVICES=2,3 python -m torch.distributed.launch --nproc_per_node=2 main_coclr.py --net s3d --topk 5 --moco-k 2048 --dataset ucf101-2stream-2clip --seq_len 32 --ds 1 --batch_size 32 --epochs 100 --schedule 80 --name_prefix Cycle1-FlowMining_ -j 4 --pretrain /mypath/CoCLR/pretrained_by_TH/InfoNCE-ucf101-rgb-128-s3d-ep399.pth.tar /mypath/CoCLR/pretrained_by_TH/InfoNCE-ucf101-f-128-s3d-ep396.pth.tar

If i type it like this, does it Initialization? and when i do that, these words are printed out: =======Check Weights Loading====== Weights not used from pretrained file:

Weights not loaded into new model: queue queue_ptr queue_second queue_vname queue_label

Why is the weights of the pretrained model not used??

Alternation : In your paper, "where each cycle refers to a complete optimization of L1 and L2; meaning, the alternation only happens after the RGB or Flow network has converged."

So I entered what I wrote above into the terminal, and now I'm Training. (i.e Cycle 1 FlowMining) But acc@1 and acc@5 don't go over 1, is this the right value to have? Or is something wrong?

++ Additional If something's wrong, there's one thing I'm concerned about: in lmdb_dataset.py, I got a error for i.decode():

AttributeError: 'str' object has no attribute 'decode'

To fix this, I do that: self.db_keys_flow = msgpack.loads(txn.get(b'keys'), raw=True) self.db_order_flow = msgpack.loads(txn.get(b'order'), raw=True) . . self.db_order_rgb = msgpack.unpackb(txn.get(b'order'),raw=True) . . raw_rgb = msgpack.loads(txn.get(self.get_video_id_rgb[vname].encode('ascii')), raw=True) raw_flow = msgpack.loads(txn.get(self.get_video_id_flow[vname].encode('ascii')), raw=True)

I added "raw=True" and is this causing an error?
opened by junmin98 20
Question about reproducing CoCLR results
Hi Tengda,

I am currently trying to replicate your CoCLR result as one of the baselines in our work with the code you provide. However, I encounter some reproduction issues during the training. I understand that the code is not ready yet. It would be much appreciated if you could help us with replication. Thank you so much!

I found out that the Top 1 MoCo accuracy is quite low (only 4-5 percent in UCF101) with 1e-3 lr and Adam Optimizer, 1e-5 weight decay, 2048 moco queue size and 128 batch size. I wonder if you could provide a detailed training command for our reference.

The augmentation is not really clip-wise consistent since the value passed in is false. I wonder if this version is not final version. Could you provide the correct version of the augmentation you use?

Currently the code for data loader is not released and I don't know how input is prepared in data loader to be passed to TwoCropTransform and OneCropTransform. Could you please share the data loader code for our better replication?

Best Regards, Hualin
opened by liuhualin333 17
Questions on reproducing training from scratch (77% in Table 1)
Hi Tengda, thanks for these detailed answers. I looked into all of them, seems no detailed training instruction is given on using main_classifier.py to train from scratch. The thing is, I train on UCF101 with rgb from scratch, after 500 epochs, the reported validation accuracy is 46.1%, while in test set it is only 3.41% (center crop only, top1). The detailed commands are as below:

-- Training

CUDA_VISIBLE_DEVICES=0,1 python main_classifier.py --train_what all --epoch 500 --batch_size 24 --lr 1e-3 --wd 1e-3 --dropout 0.9 --schedule [60, 80]

-- Testing

CUDA_VISIBLE_DEVICES=0,1 python main_classifier.py --test epochxxx.pth --ten_crop

Would you like to have a quick look and help me to figure it out which configuration I made wrong? Though my computation resources is not enough, it is hard to understand why there is such a big gap between validation accuracy and test accuracy? My sincere appreciation.
opened by June01 8
Kinetics-400 dataset

Hi, I'm new to Kinetics-400 dataset. Can you provide some tutorial or instrcutions on how to generate lmdb for Kinetics-400 dataset. I find some useful message on non-local repositiry but not sure it's the proper way, thx ~

opened by JiaxinZhuang 8
lmdb_dataset.py txn.get(self.get_video_id[name].encode('ascii')))

When I ran the program, I found "txn.get(self.get_video_id[name].encode('ascii')))" severely limiting the speed. Meanwhile, CPU is free. I don't know what the problem is. Hope to get help. Thanks

opened by xiaochehe 6
cannot access your train/val split csv

Hi, I am trying to download your train/val split csv file here: https://github.com/TengdaHan/CoCLR/tree/main/process_data/data/k400

But it says 403 forbidden. I believe you put it in some internal-only storage?

opened by thematrixduo 6
Simple question on video classification of self-supervised learning and full-supervision methods

Hi Tengda,

Thanks for the detailed instruction for this code. I am a newbie in this field, have a very simple question regarding to table 2, and in desparate need of your help. Thanks very much in advance!

Question: From what I understand, self-supervised learning could be used to learn essencial video representation. So I guess with weights learnt by self-supervised learning methods, training the S3D network on UCF-101 will yield better results than train with random initialization. From Table 2, I suppose 90.6 is the former, and 96.8 is the latter. Would you like to explain a bit why there is such a gap?

opened by June01 6
questions of training details of coclr

Hi, im trying to replicate your result on the alternation stage, I now use two init models you provided (both 400~ epochs). I have two questions.

1). According to your paper, "At the alternation stage, on UCF101 the model is trained for two cycles, where each cycle includes 200 epochs, i.e. RGB and Flow networks are each trained for 100 epochs". Does that main I need to run main_coclr.py four times? each time with 100 epoch and the newest two pretrained models I have from the previous training process?

2). If so, what lr do you use in each of four 100 epochs in the alternation stage? I also checked the COCLR pretrained model you provided, it seems in 182 epoch and 109 epoch the lr is 1e-4. Is that mean I need to train the second cycles with larger lr, e.g. 1e-2, and decay down to 1e-4?

Best Regards, Yuqi

opened by YuqiHUO 6
data preparation step

Hi Tengda, thanks for sharing your code. I found CoCLR has no data preparation instruction and could you please provide some details about data preprocessing from the raw data? I found similar instruction in DPC and MemDPC, are they feasible for CoCLR?

opened by justlovebarbecue 5
is it possible to train main_coclr.py using single GPU?

I have only one gpu.

I wanted to train, so I entered the terminal as follows: CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 main_coclr.py

but i got an error: subprocess.CalledProcessError: Command '['/home/junmin/anaconda3/envs/python36/bin/python', '-u', 'main_coclr.py', '--local_rank=0']' returned non-zero exit status 1.

Is there any way to train with a single GPU?

opened by junmin98 5
CoCLR using only RGB frame(1 stream)

It took too long to extract the flow, so I am trying to train coclr using only rgb frames(1-stream). Is that possible?

I think it's possible when you see this part in Table 2 of your paper

If possible, can you tell me how to train using only rgb frames?

ps. Thanks for always answering in detail!

opened by junmin98 4
main_classifier.py: error: unrecognized arguments: --final_bn

When i try ： CUDA_VISIBLE_DEVICES=0,1,2 python main_classifier.py --net s3d --dataset ucf101 --seq_len 32 --ds 1 --batch_size 32 --train_what last --epochs 30 --schedule 60 80 --optim sgd --lr 1e-1 --wd 1e-3 --final_bn --pretrain CoCLR-ucf101-rgb-128-s3d-ep182.pth Out： usage: main_classifier.py [-h] [--net NET] [--model MODEL] [--dataset DATASET] [--which_split WHICH_SPLIT] [--seq_len SEQ_LEN] [--num_seq NUM_SEQ] [--num_fc NUM_FC] [--ds DS] [--batch_size BATCH_SIZE] [--optim OPTIM] [--lr LR] [--schedule [SCHEDULE [SCHEDULE ...]]] [--wd WD] [--dropout DROPOUT] [--epochs EPOCHS] [--start_epoch START_EPOCH] [--gpu GPU] [--train_what TRAIN_WHAT] [--img_dim IMG_DIM] [--print_freq PRINT_FREQ] [--eval_freq EVAL_FREQ] [--reset_lr] [--prefix PREFIX] [-j WORKERS] [--cos] [--resume RESUME] [--pretrain PRETRAIN] [--test TEST] [--retrieval] [--dirname DIRNAME] [--center_crop] [--five_crop] [--ten_crop] main_classifier.py: error: unrecognized arguments: --final_bn

May I ask how to use the command line command --final_bn. After the above error occurred, I deleted --final_bn. Although it can run normally, it shows： Weights not loaded into new model: final_bn.weight final_bn.bias final_bn.running_mean final_bn.running_var final_bn.num_batches_tracked final_fc.0.weight final_fc.0.bias

Thanks

opened by wys2929 0
Two-stream feature

How can I get the two-stream fearture? And the rgb pretrained model and flow model can be use to extract two-sream feature? How can I input the command?

opened by DoublePan-Oh 0

[NeurIPS'20] Self-supervised Co-Training for Video Representation Learning. Tengda Han, Weidi Xie, Andrew Zisserman.

Related tags

Overview

CoCLR: Self-supervised Co-Training for Video Representation Learning

Link:

News

Pretrain Instruction

Finetune Instruction

Nearest-neighbour Retrieval Instruction

Linear-probe Instruction

from extracted feature

load data and freeze backbone

Dataset

Result

Pretrained Weights

Comments

If i type it like this, does it Initialization? and when i do that, these words are printed out: =======Check Weights Loading====== Weights not used from pretrained file:

Weights not loaded into new model: queue queue_ptr queue_second queue_vname queue_label

Owner

Tengda Han

Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

[ICLR 2021] "CPT: Efficient Deep Neural Network Training via Cyclic Precision" by Yonggan Fu, Han Guo, Meng Li, Xin Yang, Yining Ding, Vikas Chandra, Yingyan Lin

[ICLR 2021 Spotlight Oral] "Undistillable: Making A Nasty Teacher That CANNOT teach students", Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, Zhangyang Wang

Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

[CVPR2021] The source code for our paper 《Removing the Background by Adding the Background: Towards Background Robust Self-supervised Video Representation Learning》.

Eff video representation - Efficient video representation through neural fields

The Self-Supervised Learner can be used to train a classifier with fewer labeled examples needed using self-supervised learning.

Code for: Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space. Nicholas Monath, Manzil Zaheer, Daniel Silva, Andrew McCallum, Amr Ahmed. KDD 2019.

Dense Contrastive Learning (DenseCL) for self-supervised representation learning, CVPR 2021.

Self-supervised learning on Graph Representation Learning (node-level task)

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

Implementation of Self-supervised Graph-level Representation Learning with Local and Global Structure (ICML 2021).

A PyTorch implementation of "Multi-Scale Contrastive Siamese Networks for Self-Supervised Graph Representation Learning", IJCAI-21

Code for the paper "Spatio-temporal Self-Supervised Representation Learning for 3D Point Clouds" (ICCV 2021)

A self-supervised 3D representation learning framework named viewpoint bottleneck.

A self-supervised 3D representation learning framework named viewpoint bottleneck.