XViT - Space-time Mixing Attention for Video Transformer

Adrian Bulat

Last update: Dec 23, 2022

Related tags

Deep Learning video-transformers

Overview

XViT - Space-time Mixing Attention for Video Transformer

This is the official implementation of the XViT paper:

@inproceedings{bulat2021space,
  title={Space-time Mixing Attention for Video Transformer},
  author={Bulat, Adrian and Perez-Rua, Juan-Manuel and Sudhakaran, Swathikiran and Martinez, Brais and Tzimiropoulos, Georgios},
  booktitle={NeurIPS},
  year={2021}
}

In XViT, we introduce a novel Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence and hence induces no overhead compared to an image-based Transformer model. To achieve this, our model makes two approximations to the full space-time attention used in Video Transformers: (a) It restricts time attention to a local temporal window and capitalizes on the Transformer's depth to obtain full temporal coverage of the video sequence. (b) It uses efficient space-time mixing to attend jointly spatial and temporal locations without inducing any additional cost on top of a spatial-only attention model. We also show how to integrate 2 very lightweight mechanisms for global temporal-only attention which provide additional accuracy improvements at minimal computational cost. Our model produces very high recognition accuracy on the most popular video recognition datasets while at the same time is significantly more efficient than other Video Transformer models.

Model Zoo

We provide a series of models pre-trained on Kinetics-600 and Something-Something-v2.

Kinetics-600

Architecture	frames	views	Top-1	Top-5	url
XViT-B16	16	3x1	84.51%	96.26%	model
XViT-B16	16	3x2	84.71%	96.39%	model

Something-Something-V2

Architecture	frames	views	Top-1	Top-5	url
XViT-B16	16	32x2	67.19%	91.00%	model

Installation

Please make sure your setup satisfies the following requirements:

Requirements

Largely follows the original SlowFast repo requirements:

Python >= 3.8
Numpy
PyTorch >= 1.3
hdf5
fvcore: pip install 'git+https://github.com/facebookresearch/fvcore'
torchvision that matches the PyTorch installation. You can install them together at pytorch.org to make sure of this.
simplejson: pip install simplejson
GCC >= 4.9
PyAV: conda install av -c conda-forge
ffmpeg (4.0 is prefereed, will be installed along with PyAV)
PyYaml: (will be installed along with fvcore)
tqdm: (will be installed along with fvcore)
iopath: pip install -U iopath or conda install -c iopath iopath
psutil: pip install psutil
OpenCV: pip install opencv-python
torchvision: pip install torchvision or conda install torchvision -c pytorch
tensorboard: pip install tensorboard
PyTorchVideo: pip install pytorchvideo
Detectron2:

    pip install -U torch torchvision cython
    pip install -U 'git+https://github.com/facebookresearch/fvcore.git' 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
    git clone https://github.com/facebookresearch/detectron2 detectron2_repo
    pip install -e detectron2_repo
    # You can find more details at https://github.com/facebookresearch/detectron2/blob/master/INSTALL.md

Datasets

1. Kenetics

You can download Kinetics 400/600 datasets following the instructions provided by the cvdfundation repo: https://github.com/cvdfoundation/kinetics-dataset

Afterwars, resize the videos to the shorte edge size of 256 and prepare the csv files for training, validation in testting: train.csv, val.csv, test.csv. The formatof the csv file is:

path_to_video_1 label_1
path_to_video_2 label_2
...
path_to_video_N label_N

Depending on your system, we recommend decoding the videos to frames and then packing each set of frames into a h5 file with the same name as the original video.

2. Something-Something v2

You can download the datasets from the authors webpage: https://20bn.com/datasets/something-something

Perform the same packing procedure as for Kinetics.

Usage

Training

python tools/run_net.py \
  --cfg configs/Kinetics/xvit_B16_16x16_k600.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset

Evaluation

python tools/run_net.py \
  --cfg configs/Kinetics/xvit_B16_16x16_k600.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  TEST.CHECKPOINT_FILE_PATH path_to_your_checkpoint \
  TRAIN.ENABLE False \

Acknowledgements

This repo is built using components from SlowFast and timm

License

XViT code is released under the Apache 2.0 license.

Comments

Error when build model

Thanks for the great code, I got the following error message when doing the training:

`vit_base_patch16_224_in21k => base model: vit_base_patch16_224_in21k default cfg {'url': 'https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_patch16_224_in21k-e5005f0a.pth', 'num_classes': 21843, 'input_size': (3, 224, 224), 'pool_size': None, 'crop_pct': 0.9, 'interpolation': 'bicubic', 'mean': (0.5, 0.5, 0.5), 'std': (0.5, 0.5, 0.5), 'first_conv': 'patch_embed.proj', 'classifier': 'head'} Traceback (most recent call last): File "tools/run_net.py", line 45, in main() File "tools/run_net.py", line 30, in main launch_job(cfg=cfg, init_method=args.init_method, func=train) File "/mnt/lustre/sunweixuan/video-transformers/./slowfast/utils/misc.py", line 259, in launch_job torch.multiprocessing.spawn( File "/mnt/lustre/sunweixuan/anaconda3/envs/video/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/mnt/lustre/sunweixuan/anaconda3/envs/video/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/mnt/lustre/sunweixuan/anaconda3/envs/video/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/mnt/lustre/sunweixuan/anaconda3/envs/video/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/mnt/lustre/sunweixuan/video-transformers/./slowfast/utils/multiprocessing.py", line 63, in run ret = func(cfg) File "/mnt/lustre/sunweixuan/video-transformers/tools/train_net.py", line 361, in train model = build_model(cfg) File "/mnt/lustre/sunweixuan/video-transformers/./slowfast/models/build.py", line 38, in build_model model = MODEL_REGISTRY.get(name)(cfg) File "/mnt/lustre/sunweixuan/video-transformers/./slowfast/models/video_model_builder.py", line 42, in init self._prepare_base_model(self.cfg.XVIT.BASE_MODEL) File "/mnt/lustre/sunweixuan/video-transformers/./slowfast/models/video_model_builder.py", line 107, in _prepare_base_model self.base_model = create_model( File "/mnt/lustre/sunweixuan/anaconda3/envs/video/lib/python3.8/site-packages/timm/models/factory.py", line 71, in create_model model = create_fn(pretrained=pretrained, pretrained_cfg=pretrained_cfg, **kwargs) File "/mnt/lustre/sunweixuan/video-transformers/./slowfast/models/transformers/vit.py", line 469, in vit_base_patch16_224_in21k model = _create_vision_transformer( File "/mnt/lustre/sunweixuan/video-transformers/./slowfast/models/transformers/vit.py", line 425, in _create_vision_transformer model = build_model_with_cfg( File "/mnt/lustre/sunweixuan/anaconda3/envs/video/lib/python3.8/site-packages/timm/models/helpers.py", line 523, in build_model_with_cfg model = model_cls(**kwargs) if model_cfg is None else model_cls(cfg=model_cfg, **kwargs) TypeError: init() got an unexpected keyword argument 'default_cfg'`

My guess is it is a timm version problem, I wonder what is your timm version?

opened by weixuansun 0

Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

STCN Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang [a

456 Dec 12, 2022

Fast and Context-Aware Framework for Space-Time Video Super-Resolution (VCIP 2021)

Fast and Context-Aware Framework for Space-Time Video Super-Resolution Preparation Dependencies PyTorch 1.2.0 CUDA 10.0 DCNv2 cd model/DCNv2 bash make

1 Mar 29, 2022

Direct application of DALLE-2 to video synthesis, using factored space-time Unet and Transformers

DALLE2 Video (wip) ** only to be built after DALLE2 image is done and replicated, and the importance of the prior network is validated ** Direct appli

105 May 15, 2022

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

FuseDream This repo contains code for our paper (paper link): FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimizat

191 Dec 31, 2022

Space robot - (Course Project) Using the space robot to capture the target satellite that is disabled and spinning, then stabilize and fix it up

3 Jan 7, 2022

Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deformable Attention"

Deformable Attention Implementation of Deformable Attention from this paper in Pytorch, which appears to be an improvement to what was proposed in DET

128 Dec 24, 2022

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Relational Self-Attention: What's Missing in Attention for Video Understanding This repository is the official implementation of "Relational Self-Atte

43 Dec 7, 2022

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Episodic Transformers (E.T.) Episodic Transformer for Vision-and-Language Navigation Alexander Pashevich, Cordelia Schmid, Chen Sun Episodic Transform

62 Dec 24, 2022

The official TensorFlow implementation of the paper Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition

Action Transformer A Self-Attention Model for Short-Time Human Action Recognition This repository contains the official TensorFlow implementation of t

20 Jan 3, 2023

XViT - Space-time Mixing Attention for Video Transformer

Related tags

Overview

XViT - Space-time Mixing Attention for Video Transformer

Model Zoo

Kinetics-600

Something-Something-V2

Installation

Requirements

Datasets

Usage

Training

Evaluation

Acknowledgements

License

You might also like...

Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

Fast and Context-Aware Framework for Space-Time Video Super-Resolution (VCIP 2021)

Direct application of DALLE-2 to video synthesis, using factored space-time Unet and Transformers

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

Space robot - (Course Project) Using the space robot to capture the target satellite that is disabled and spinning, then stabilize and fix it up

Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deformable Attention"

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

The official TensorFlow implementation of the paper Action Transformer: A Self-Attention Model for Short-Time Pose-Based Human Action Recognition

Comments

Error when build model

Owner

Adrian Bulat

Codes for the paper Contrast and Mix: Temporal Contrastive Video Domain Adaptation with Background Mixing

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

Drone-based Joint Density Map Estimation, Localization and Tracking with Space-Time Multi-Scale Attention Network

style mixing for animation face

Unofficial implementation of Google's FNet: Mixing Tokens with Fourier Transforms

An implementation of "MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing" (ICML 2019).

SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data (AAAI 2021)

3D-Transformer: Molecular Representation with Transformer in 3D Space

The official pytorch implemention of the CVPR paper "Temporal Modulation Network for Controllable Space-Time Video Super-Resolution".