DualFormer

This repo is the implementation of our manuscript entitled "Local-Global Stratified Transformer for Efficient Video Recognition". Our model is built on a popular video package called mmaction2. This repo also refers to the code templates provided by PVT, Twins and Swin. This repo is released under the Apache 2.0 license.

Introduction

DualFormer is a Transformer architecture that can effectively and efficiently perform space-time attention for video recognition. Specifically, our DualFormer stratifies the full space-time attention into dual cascaded levels, i.e., to first learn fine-grained local space-time interactions among nearby 3D tokens, followed by the capture of coarse-grained global dependencies between the query token and the coarse-grained global pyramid contexts. Experimental results show the superiority of DualFormer on five video benchmarks against existing methods. In particular, DualFormer sets new state-of-the-art 82.9%/85.2% top-1 accuracy on Kinetics-400/600 with ∼1000G inference FLOPs which is at least 3.2× fewer than existing methods with similar performances.

Installation & Requirement

Please refer to install.md for installation. The docker files are also provided for convenient usage - cuda10.1 and cuda11.0.

All models are trained on 8 Nvidia A100 GPUs. For example, training a DualFormer-T on Kinetics-400 takes ∼31 hours on 8 A100 GPUs, while training a larger model DualFormer-B on Kinetics-400 requires ∼3 days on 8 A100 GPUs.

Data Preparation

Please first see data_preparation.md for a general knowledge of data preparation.

For Kinetics-400/600, as these are dynamic datasets (videos may be removed from YouTube), we employ this repo to download the original files and the annotatoins. Only a few number of corrupted videos are removed (around 50).
For other datasets, i.e., HMDB-51, UCF-101 and Diving-48, we use the data downloader provided by mmaction2 as aforementioned.

The full supported datasets are listed below (more details in supported_datasets.md):

HMDB51 (Homepage) (ICCV'2011)	UCF101 (Homepage) (CRCV-IR-12-01)	ActivityNet (Homepage) (CVPR'2015)	Kinetics-[400/600/700] (Homepage) (CVPR'2017)
SthV1 (Homepage) (ICCV'2017)	SthV2 (Homepage) (ICCV'2017)	Diving48 (Homepage) (ECCV'2018)	Jester (Homepage) (ICCV'2019)
Moments in Time (Homepage) (TPAMI'2019)	Multi-Moments in Time (Homepage) (ArXiv'2019)	HVU (Homepage) (ECCV'2020)	OmniSource (Homepage) (ECCV'2020)

Models

We present a major part of the model results, the configuration files, and downloading links in the following table. The FLOPs is computed by fvcore, where we omit the classification head since it has low impact to the FLOPs.

Dataset	Version	Pretrain	GFLOPs	Param (M)	Top-1	Config	Download
K400	Tiny	IN-1K	240	21.8	79.5	link	link
K400	Small	IN-1K	636	48.9	80.6	link	link
K400	Base	IN-1K	1072	86.8	81.1	link	link
K600	Base	IN-22K	1072	86.8	85.2	link	link
Diving-48	Small	K400	1908	48.9	81.8	link	link
HMDB-51	Small	K400	1908	48.9	76.4	link	link
UCF-101	Small	K400	1908	48.9	97.5	link	link

Visualization

We visualize the attention maps at the last layer of our model generated by Grad-CAM on Kinetics-400. As shown in the following three gifs, our model successfully learns to focus on the relevant parts in the video clip. Left: flying kites. Middle: counting money. Right: walking dogs.

You can use the following commend to visualize the attention weights:

python demo/demo_gradcam.py 
    
     
     
       --target-layer-name 
      
        --out-filename

For example, to visualize the last layer of DualFormer-S on a K400 video (-cii-Z0dW2E_000020_000030.mp4), please run:

python demo/demo_gradcam.py \
    configs/recognition/dualformer/dualformer_small_patch244_window877_kinetics400_1k.py \
    checkpoints/k400/dualformer_small_patch244_window877.pth \
    /dataset/kinetics-400/train_files/-cii-Z0dW2E_000020_000030.mp4 \
    --target-layer-name backbone/blocks/3/3 --fps 10 \
    --out-filename output/-cii-Z0dW2E_000020_000030.gif

User Guide

Folder Structure

As our implementation is based on mmaction2, we specify our contributions as follows:

Data preparation and preprocessing are located at base.py and video_dataset.py. You can find the data augmentation details at the model config.
The source code of backbone: click here.
The source code of classification head: click here.
The training/test code without Token Labelling and with Token Labelling

Testing

# single-gpu testing
python tools/test.py 
    
    
      --eval top_k_accuracy

# multi-gpu testing
bash tools/dist_test.sh 
      
       
       
         --eval top_k_accuracy

Example 1: to validate a DualFormer-T model on Kinetics-400 dataset with 8 GPUs, please run:

bash tools/dist_test.sh configs/recognition/dualformer/dualformer_tiny_patch244_window877_kinetics400_1k.py checkpoints/k400/dualformer_tiny_patch244_window877.pth 8 --eval top_k_accuracy

You will obtain the result as follows:

Example 2: to validate a DualFormer-S model on Diving-48 dataset with 4 GPUs, please run:

bash tools/dist_test.sh configs/recognition/dualformer/dualformer_small_patch244_window877_diving48.py checkpoints/diving48/dualformer_small_patch244_window877.pth 4 --eval top_k_accuracy

The output will be as follows:

Training from scratch

To train a video recognition model from scratch for Kinetics-400, please run:

# single-gpu training
python tools/train.py 
   
     [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh 
     
     
       [other optional arguments]

For example, to train a DualFormer-T model for Kinetics-400 dataset with 8 GPUs, please run:

bash tools/dist_train.sh ./configs/recognition/dualformer/dualformer_tiny_patch244_window877_kinetics400_1k.py 8

Training a DualFormer-S model for Kinetics-400 dataset with 8 GPUs, please run:

bash tools/dist_train.sh ./configs/recognition/dualformer/dualformer_small_patch244_window877_kinetics400_1k.py 8

Training with pre-trained 2D models

To train a video recognition model with pre-trained image models, please run:

# single-gpu training
python tools/train.py 
   
     --cfg-options model.backbone.pretrained=
    
      [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh 
      
      
        --cfg-options model.backbone.pretrained=
       
         [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a DualFormer-T model for Kinetics-400 dataset with 8 GPUs, please run:

bash tools/dist_train.sh ./configs/recognition/dualformer/dualformer_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=

Training a DualFormer-B model for Kinetics-400 dataset with 8 GPUs, please run:

bash tools/dist_train.sh ./configs/recognition/dualformer/dualformer_base_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=

Note: use_checkpoint is used to save GPU memory. Please refer to this page for more details.

Training with Token Labelling

We also present the first attempt to improve the video recognition model by generalizing Token Labelling to videos as additional augmentations, in which MixToken is turned off as it does not work on our video datasets. For instance, to train a small version of DualFormer using DualFormer-B as the annotation model on the fly, please run:

bash tools/dist_train.sh configs/recognition/dualformer/dualformer_tiny_tokenlabel_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained='checkpoints/pretrained_2d/dualformer_tiny.pth' --validate

Notice that we place the checkpoint of the annotation model at 'checkpoints/k400/dualformer_base_patch244_window877.pth'. You can change it to anywhere you want, or modify the path variable in this file.

We present two examples of visualization of token labelling on video data. For simiplicity, we omit several frames and thus each example only shows 5 frames with uniform sampling rate. For each frame, each value p(i,j) on the left hand side means the pseudo label (index) at each patch of the last stage provided by the annotation model.

Visualization example 1 (Correct label: pushing cart, index: 262).
Visualization example 2 (Correct label: dribbling basketball, index: 99).

Apex (optional):

We use apex for mixed precision training by default. To install apex, use our provided docker or run:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

If you would like to disable apex, comment out the following code block in the configuration files:

# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
    type="DistOptimizerHook",
    update_interval=1,
    grad_clip=None,
    coalesce=True,
    bucket_size_mb=-1,
    use_fp16=True,
)

Citation

If you find our work useful in your research, please cite:

@article{liang2021dualformer,
         title={DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition}, 
         author={Yuxuan Liang and Pan Zhou and Roger Zimmermann and Shuicheng Yan},
         year={2021},
         journal={arXiv preprint arXiv:2112.04674},
}

Acknowledgement

We would like to thank the authors of the following helpful codebases:

Video Swin Transformer for video recognition.
Swin Transformer: the best paper award at ICCV 2021.
Twins-SVT for image processing.
Pyramid Vision Transformer for image processing.

Please kindly consider star these related packages as well. Thank you much for your attention.

The code for our paper submitted to RAL/IROS 2022: OverlapTransformer: An Efficient and Rotation-Invariant Transformer Network for LiDAR-Based Place Recognition.

OverlapTransformer The code for our paper submitted to RAL/IROS 2022: OverlapTransformer: An Efficient and Rotation-Invariant Transformer Network for

136 Jan 3, 2023

[AAAI 2021] MVFNet: Multi-View Fusion Network for Efficient Video Recognition

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

114 Nov 27, 2022

AdaFocus (ICCV 2021) Adaptive Focus for Efficient Video Recognition

AdaFocus (ICCV 2021) This repo contains the official code and pre-trained models for AdaFocus. Adaptive Focus for Efficient Video Recognition Referenc

115 Dec 21, 2022

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition [ArXiv] [Project Page] This repository is the official implementation of AdaMML:

43 Dec 26, 2022

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021)

2 Jan 29, 2022

Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Temporally Efficient Vision Transformer for Video Instance Segmentation Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR

203 Dec 31, 2022

PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

24 Nov 24, 2022

Code for the CVPR2021 paper "Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition"

Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition This repository contains code for the CVPR2021 paper "Patch-NetV

368 Jan 6, 2023

Eff video representation - Efficient video representation through neural fields

Neural Residual Flow Fields for Efficient Video Representations 1. Download MPI

41 Jan 6, 2023

Missing .md files

Hi @yoshall Thank you for your work. I have noticed that a few .md files in the repo have broken links (at the time of writing of this issue) e.g. install.md, data_preparation.md and supported_datasets.md

Could you please fix these links?

opened by asif-hanif 1
Training time for kinetics-400

Hello,

Thank you for sharing the codebase of your exciting work.

Could you please let me know the training time for pertaining & training on kinetics-400 and the resources you used?

Thank you!

opened by AbdelrahmanShakerYousef 0

Local-Global Stratified Transformer for Efficient Video Recognition

Related tags

Overview

DualFormer

Introduction

Installation & Requirement

Data Preparation

Models

Visualization

User Guide

Folder Structure

Testing

Training from scratch

Training with pre-trained 2D models

Training with Token Labelling

Apex (optional):

Citation

Acknowledgement

You might also like...

The code for our paper submitted to RAL/IROS 2022: OverlapTransformer: An Efficient and Rotation-Invariant Transformer Network for LiDAR-Based Place Recognition.

[AAAI 2021] MVFNet: Multi-View Fusion Network for Efficient Video Recognition

AdaFocus (ICCV 2021) Adaptive Focus for Efficient Video Recognition

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021)

Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

Code for the CVPR2021 paper "Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for Place Recognition"

Eff video representation - Efficient video representation through neural fields

Comments

Missing .md files

Training time for kinetics-400

Owner

Sea AI Lab

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

library for nonlinear optimization, wrapping many algorithms for global and local, constrained or unconstrained, optimization

Implementation of Self-supervised Graph-level Representation Learning with Local and Global Structure (ICML 2021).

Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"

Decentralized Reinforcment Learning: Global Decision-Making via Local Economic Transactions (ICML 2020)

Pytorch implementation of 'Fingerprint Presentation Attack Detector Using Global-Local Model'

Losslandscapetaxonomy - Taxonomizing local versus global structure in neural network loss landscapes

Code for the paper "MASTER: Multi-Aspect Non-local Network for Scene Text Recognition" (Pattern Recognition 2021)

PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

This is an official implementation for "ResT: An Efficient Transformer for Visual Recognition".