Code + pre-trained models for the paper Keeping Your Eye on the Ball Trajectory Attention in Video Transformers

Facebook Research

Last update: Dec 23, 2022

Related tags

Deep Learning Motionformer

Overview

Motionformer

This is an official pytorch implementation of paper Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers. In this repository, we provide PyTorch code for training and testing our proposed Motionformer model. Motionformer use proposed trajectory attention to achieve state-of-the-art results on several video action recognition benchmarks such as Kinetics-400 and Something-Something V2.

If you find Motionformer useful in your research, please use the following BibTeX entry for citation.

@misc{patrick2021keeping,
      title={Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers}, 
      author={Mandela Patrick and Dylan Campbell and Yuki M. Asano and Ishan Misra Florian Metze and Christoph Feichtenhofer and Andrea Vedaldi and Jo\ão F. Henriques},
      year={2021},
      eprint={2106.05392},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Model Zoo

We provide Motionformer models pretrained on Kinetics-400 (K400), Kinetics-600 (K600), Something-Something-V2 (SSv2), and Epic-Kitchens datasets.

name	dataset	# of frames	spatial crop	acc@1	acc@5	url
Joint	K400	16	224	79.2	94.2	model
Divided	K400	16	224	78.5	93.8	model
Motionformer	K400	16	224	79.7	94.2	model
Motionformer-HR	K400	16	336	81.1	95.2	model
Motionformer-L	K400	32	224	80.2	94.8	model

name	dataset	# of frames	spatial crop	acc@1	acc@5	url
Motionformer	K600	16	224	81.6	95.6	model
Motionformer-HR	K600	16	336	82.7	96.1	model
Motionformer-L	K600	32	224	82.2	96.0	model

name	dataset	# of frames	spatial crop	acc@1	acc@5	url
Joint	SSv2	16	224	64.0	88.4	model
Divided	SSv2	16	224	64.2	88.6	model
Motionformer	SSv2	16	224	66.5	90.1	model
Motionformer-HR	SSv2	16	336	67.1	90.6	model
Motionformer-L	SSv2	32	224	68.1	91.2	model

name	dataset	# of frames	spatial crop	A acc	N acc	url
Motionformer	EK	16	224	43.1	56.5	model
Motionformer-HR	EK	16	336	44.5	58.5	model
Motionformer-L	EK	32	224	44.1	57.6	model

Installation

First, create a conda virtual environment and activate it:

conda create -n motionformer python=3.8.5 -y
source activate motionformer

Then, install the following packages:

torchvision: pip install torchvision or conda install torchvision -c pytorch
fvcore: pip install 'git+https://github.com/facebookresearch/fvcore'
simplejson: pip install simplejson
einops: pip install einops
timm: pip install timm
PyAV: conda install av -c conda-forge
psutil: pip install psutil
scikit-learn: pip install scikit-learn
OpenCV: pip install opencv-python
tensorboard: pip install tensorboard
matplotlib: pip install matplotlib
pandas: pip install pandas
ffmeg: pip install ffmpeg-python

OR:

simply create conda environment with all packages just from yaml file:

conda env create -f environment.yml

Lastly, build the Motionformer codebase by running:

git clone https://github.com/facebookresearch/Motionformer
cd Motionformer
python setup.py build develop

Usage

Dataset Preparation

Please use the dataset preparation instructions provided in DATASET.md.

Training the Default Motionformer

Training the default Motionformer that uses trajectory attention, and operates on 16-frame clips cropped at 224x224 spatial resolution, can be done using the following command:

python tools/run_net.py \
  --cfg configs/K400/motionformer_224_16x4.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 8 \

You may need to pass location of your dataset in the command line by adding DATA.PATH_TO_DATA_DIR path_to_your_dataset, or you can simply modify

DATA:
  PATH_TO_DATA_DIR: path_to_your_dataset

To the yaml configs file, then you do not need to pass it to the command line every time.

Using a Different Number of GPUs

If you want to use a smaller number of GPUs, you need to modify .yaml configuration files in configs/. Specifically, you need to modify the NUM_GPUS, TRAIN.BATCH_SIZE, TEST.BATCH_SIZE, DATA_LOADER.NUM_WORKERS entries in each configuration file. The BATCH_SIZE entry should be the same or higher as the NUM_GPUS entry.

Using Different Self-Attention Schemes

If you want to experiment with different space-time self-attention schemes, e.g., joint space-time attention or divided space-time attention, use the following commands:

python tools/run_net.py \
  --cfg configs/K400/joint_224_16x4.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 8 \

and

python tools/run_net.py \
  --cfg configs/K400/divided_224_16x4.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 8 \

Training Different Motionformer Variants

If you want to train more powerful Motionformer variants, e.g., Motionformer-HR (operating on 16-frame clips sampled at 336x336 spatial resolution), and Motionformer-L (operating on 32-frame clips sampled at 224x224 spatial resolution), use the following commands:

python tools/run_net.py \
  --cfg configs/K400/motionformer_336_16x8.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 8 \

and

python tools/run_net.py \
  --cfg configs/K400/motionformer_224_32x3.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 8 \

Note that for these models you will need a set of GPUs with ~32GB of memory.

Inference

Use TRAIN.ENABLE and TEST.ENABLE to control whether training or testing is required for a given run. When testing, you also have to provide the path to the checkpoint model via TEST.CHECKPOINT_FILE_PATH.

python tools/run_net.py \
  --cfg configs/K400/motionformer_224_16x4.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  TEST.CHECKPOINT_FILE_PATH path_to_your_checkpoint \
  TRAIN.ENABLE False \

Alterantively, you can modify provided SLURM script and run following:

sbatch slurm_scripts/test.sh configs/K400/motionformer_224_16x4.yaml path_to_your_checkpoint

Single-Node Training via Slurm

To train Motionformer via Slurm, please check out our single node Slurm training script slurm_scripts/run_single_node_job.sh.

sbatch slurm_scripts/run_single_node_job.sh configs/K400/motionformer_224_16x4.yaml /your/job/dir/${JOB_NAME}/

Multi-Node Training via Submitit

Distributed training is available via Slurm and submitit

pip install submitit

To train Motionformer model on Kinetics using 8 nodes with 8 gpus each use the following command:

python run_with_submitit.py --cfg configs/K400/motionformer_224_16x4.yaml --job_dir  /your/job/dir/${JOB_NAME}/ --partition $PARTITION --num_shards 8 --use_volta32

We provide a script for launching slurm jobs in slurm_scripts/run_multi_node_job.sh.

sbatch slurm_scripts/run_multi_node_job.sh configs/K400/motionformer_224_16x4.yaml /your/job/dir/${JOB_NAME}/

Please note that hyper-parameters in configs were used with 8 nodes with 8 gpus (32 GB). Please scale batch-size, and learning-rate appropriately for your cluster configuration.

Finetuning

To finetune from an existing PyTorch checkpoint add the following line in the command line, or you can also add it in the YAML config:

TRAIN.CHECKPOINT_EPOCH_RESET: True
TRAIN.CHECKPOINT_FILE_PATH path_to_your_PyTorch_checkpoint

Environment

The code was developed using python 3.8.5 on Ubuntu 20.04. For training, we used eight GPU compute nodes each node containing 8 Tesla V100 GPUs (32 GPUs in total). Other platforms or GPU cards have not been fully tested.

License

The majority of this work is licensed under CC-NC 4.0 International license. However, portions of the project are available under separate license terms: SlowFast and pytorch-image-models are licensed under the Apache 2.0 license.

Contributing

We actively welcome your pull requests. Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.

Acknowledgements

Motionformer is built on top of PySlowFast, Timesformer and pytorch-image-models by Ross Wightman. We thank the authors for releasing their code. If you use our model, please consider citing these works as well:

@misc{fan2020pyslowfast,
  author =       {Haoqi Fan and Yanghao Li and Bo Xiong and Wan-Yen Lo and
                  Christoph Feichtenhofer},
  title =        {PySlowFast},
  howpublished = {\url{https://github.com/facebookresearch/slowfast}},
  year =         {2020}
}

@inproceedings{gberta_2021_ICML,
    author  = {Gedas Bertasius and Heng Wang and Lorenzo Torresani},
    title = {Is Space-Time Attention All You Need for Video Understanding?},
    booktitle   = {Proceedings of the International Conference on Machine Learning (ICML)}, 
    month = {July},
    year = {2021}
}

@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}

Comments

Can not reproduce the results

I run the motionformer for three different setting on Kinetics datasets.

I run motionformer_224_16x4.yaml with batchsize 8 using 8GPU. Finally, I got 71.48 top-1 on val dataset.

I run configs/K400/joint_224_16x4.yaml. I got 76.63 on val dataset.

I run configs/K400/divided_224_16x4.yaml. I got 76.27 on val dataset.

opened by lxtGH 8
temporal attention fix

A typo in the original code meant that the value tensors for the temporal attention step were identical to the input instead of being multiplied by a learned projection matrix (v = x rather than v = Wx). The original code is kept to facilitate replication and can be used by setting use_original_code=True, but is not recommended.
CLA Signed

opened by dylan-campbell 2

An uncleared step in TrajectoryAttention.forward()

Hello, thanks for this great work and the shared code. A small issue I have been trying to figure out: in vit_helper.py.TrajectoryAttention.forward(), when the temporal attention is applied:


1.         x = rearrange(x, '(b h) s f d -> b s f (h d)', b=B)
2.         x_diag = rearrange(x, 'b (g n) f d -> b g n f d', g=F)
3.         x_diag = torch.diagonal(x_diag, dim1=-4, dim2=-2)
4.         x_diag = rearrange(x_diag, f'b n d f -> b (f n) d', f=F)
5.         q2 = self.proj_q(x_diag)
6.         k2, v2 = self.proj_kv(x).chunk(2, dim=-1)
7.         q2 = rearrange(q2, f'b s (h d) -> b h s d', h=h)
8.         x, k2, v2 = map(
9.             lambda t: rearrange(t, f'b s f (h d) -> b h s f d', f=F,  h=h), (x, k2, v2))
10.         q2 *= self.scale
11.         attn = torch.einsum('b h s d, b h s f d -> b h s f', q2, k2)
12.         attn = attn.softmax(dim=-1)
13.         x = torch.einsum('b h s f, b h s f d -> b h s d', attn, x)
14.         x = rearrange(x, f'b h s d -> b s (h d)')

in line 249 (here 13), why is EINSUM operation is applied on attn and _x? in the paper, the temporal attention is applied as usual: but in the code it seem like v^{\sim}_{stt'} is replaced with the reshaped version x.

In addition, I am confused since x is reshaped in line (here 8) together with k1, v1. So it might seem to be intentional.

Thanks!

opened by ofir1080 2

Loading pretrained weights(Epic-kitchens)

Hi!! Thank you for releasing the code.

I tried to load the Epic-kitchens pretrained weights(ek_motionformer_224_16x4.pyth) but seems like there are some missing keys in the pytorch state_dict.

Missing Keys: {'blocks.9.attn.proj_kv.bias', 'blocks.7.attn.proj_q.weight', 'blocks.4.attn.proj_kv.weight', 'head0.weight', 'blocks.5.attn.proj_q.weight', 'patch_embed_3d.proj.weight', 'blocks.1.attn.proj_q.weight', 'head1.weight', 'blocks.10.attn.proj_q.weight', 'temp_embed', 'blocks.1.attn.proj_q.bias', 'blocks.10.attn.proj_kv.bias', 'blocks.1.attn.proj_kv.weight', 'pre_logits.fc.bias', 'blocks.5.attn.proj_q.bias', 'blocks.11.attn.proj_q.bias', 'blocks.2.attn.proj_q.weight', 'blocks.6.attn.proj_kv.weight', 'blocks.2.attn.proj_kv.bias', 'blocks.3.attn.proj_kv.bias', 'blocks.11.attn.proj_q.weight', 'blocks.10.attn.proj_q.bias', 'patch_embed_3d.proj.bias', 'blocks.8.attn.proj_kv.bias', 'blocks.3.attn.proj_q.bias', 'blocks.5.attn.proj_kv.weight', 'blocks.2.attn.proj_kv.weight', 'blocks.3.attn.proj_q.weight', 'blocks.9.attn.proj_kv.weight', 'blocks.9.attn.proj_q.weight', 'pre_logits.fc.weight', 'blocks.10.attn.proj_kv.weight', 'blocks.8.attn.proj_q.bias', 'blocks.5.attn.proj_kv.bias', 'blocks.0.attn.proj_kv.bias', 'blocks.4.attn.proj_kv.bias', 'blocks.0.attn.proj_q.bias', 'blocks.11.attn.proj_kv.weight', 'blocks.6.attn.proj_q.bias', 'head1.bias', 'blocks.3.attn.proj_kv.weight', 'blocks.7.attn.proj_kv.bias', 'blocks.8.attn.proj_q.weight', 'blocks.8.attn.proj_kv.weight', 'blocks.4.attn.proj_q.weight', 'blocks.6.attn.proj_kv.bias', 'blocks.7.attn.proj_q.bias', 'blocks.0.attn.proj_kv.weight', 'blocks.6.attn.proj_q.weight', 'blocks.4.attn.proj_q.bias', 'blocks.2.attn.proj_q.bias', 'blocks.1.attn.proj_kv.bias', 'blocks.9.attn.proj_q.bias', 'blocks.0.attn.proj_q.weight', 'head0.bias', 'blocks.11.attn.proj_kv.bias', 'blocks.7.attn.proj_kv.weight'}

Therefore, I could not reproduce the results with Epic-kitchens validation set. Could you please check the uploaded weights on Epic-kitchens?

opened by JaesungHuh 1
Kinetics DataSet issues

Hi! Thanks for opensource the code. I wonder what is the size of the kinetics dataset 400 for training and validation? https://github.com/facebookresearch/SlowFast/issues/42

opened by lxtGH 1
Reproduction for Sthv2

Hi, I just reproduced the experiment for Sthv2, with the config of motionformer_224_16x4.yaml. Specifically, I trained the model in one node with 8GPUS, which contains 32 samples in a mini-batch. The learning rate I used is 32 / 256 * 1e-4. However, the result I obtained is 64.3%, which is relatively lower than that reported in your paper. I wonder if the aforementioned learning rate is compatible with that setting since the original figure in that config is 1e-4.

opened by PeiqinZhuang 0
Adding Code of Conduct file

This is pull request was created automatically because we noticed your project was missing a Code of Conduct file.

Code of Conduct files facilitate respectful and constructive communities by establishing expected behaviors for project contributors.

This PR was crafted with love by Facebook's Open Source Team.
CLA Signed

opened by facebook-github-bot 0
Adding Contributing file

This is pull request was created automatically because we noticed your project was missing a Contributing file.

CONTRIBUTING files explain how a developer can contribute to the project - which you should actively encourage.

This PR was crafted with love by Facebook's Open Source Team.
CLA Signed

opened by facebook-github-bot 0
Strange RGB / BGR settings in ssv2 & kinetics data loader

Hi. Thanks for the nice work.

I have some questions regarding RGB / BGR standards used by ssv2 and kinetics loaders in this repo. Directly stating, I think RGB / BGR standards are mishandled in the current codebase.

Specifically, SSv2 data loader initially reads frames in BGR standard (using OpenCV), however, the data-loader sometimes incorrectly applies functions that assume RGB input standards (e.g., ToPILImage). The final output is in BGR which is compatible with the pretrained ViT-B that assumes BGR standard.

On the other hand, kinetics data loader initially reads frames in RGB standard (using PyAV), however, the data-loader sometimes incorrectly applies functions that assume BGR input standards (e.g., color_jitter). The final output is in RGB which is incompatible with the pretrained ViT-B that assumes BGR standard.

I will try to point out problems in the order that the data loaders actually processes input files.

1. Kinetics loader (https://github.com/facebookresearch/Motionformer/blob/main/slowfast/datasets/kinetics.py)

1-1. Kinetics loader reads mp4 videos with PyAV backend, using VideoFrame.to_rgb method

reference: https://github.com/facebookresearch/Motionformer/blob/main/slowfast/datasets/kinetics.py#L236-L246 https://github.com/facebookresearch/Motionformer/blob/main/slowfast/datasets/decoder.py#L269-L280

VideoFrame.to_rgb reads mp4 frames in RGB standard

1-2. frames_augmentation is applied, which assumes BGR standards.

Specifically, contrast jitter relies on "BGR to Grayscale" transform, which is sensitive to the channel order. As a result, the augmentation is being incorrectly applied.

1-3. The final output is RGB.

Since ViT-B assumes BGR standards, the performance can be potentially sub-optimal (though we will finetune with video datasets)

2. SSv2 loader (https://github.com/facebookresearch/Motionformer/blob/main/slowfast/datasets/ssv2.py)

I will try to point out problems in the order that the SSv2 actually processes input files.

2-1. SSv2 loader reads jpeg frames using cv2.imdecode method.

reference: https://github.com/facebookresearch/Motionformer/blob/main/slowfast/datasets/ssv2.py#L246-L251 https://github.com/facebookresearch/Motionformer/blob/main/slowfast/datasets/utils.py#L41-L52

cv2.imdecode reads jpeg files in BGR standard

2-2. Frames are converted to PIL images using torchvision.transforms.ToPILImage method.

As stated in the torchvision's documentation, torchvision.transforms.ToPILImage expects RGB standard, and the currently wrong channel order would lead to potentially incorrect color augmentations. Fortunately, I guess the current RandAug profile does not include channel-order sensitive augmentations.

2-3. The final output is BGR

ViT-B also follows BGR standards, hence there is no problem outputting BGR standard frames.

opened by kami93 0
Question about the hyper-parameter of SSV2?

Hi, may I ask the extract learning rate when setting the batch size as 64 in SSV2. Currently, it is set as 1e-4 in the provided config, while it validates the scaling rule, e.g. 32/256 * 1e-4, which has been mentioned for K400.

opened by PeiqinZhuang 0
Usage of keys in prototype selection

Hi, First of all, thanks a lot for your work and for providing a clear and documented repository associated with your paper! While reading your paper I wondered how you selected your most orthogonal subset in detail. By looking at the code, I see you provide both keys and queries to the function orthogonal_landmarks. However, it seems you do not use keys to select your subset. Is that an intended behavior?

Thanks !

opened by hugoych 1
MF-LONG config for SSv2

Hi, Thanks for providing this wonderful model.

I'm trying to reproduce Motionformer-L on SSv2, I see that you use: https://github.com/facebookresearch/Motionformer/blob/6c860614a3b252c6163971ba20e61ea3184d5291/configs/SSV2/motionformer_224_32x3.yaml#L4

But if I understand correctly this is equivalent to BATCH_SIZE=8, could you please clarify?

Thanks, Elad.

opened by eladb3 0

Code + pre-trained models for the paper Keeping Your Eye on the Ball Trajectory Attention in Video Transformers

Related tags

Overview

Motionformer

Model Zoo

Installation

Usage

Dataset Preparation

Training the Default Motionformer

Using a Different Number of GPUs

Using Different Self-Attention Schemes

Training Different Motionformer Variants

Inference

Single-Node Training via Slurm

Multi-Node Training via Submitit

Finetuning

Environment

License

Contributing

Acknowledgements

Comments

1. Kinetics loader (https://github.com/facebookresearch/Motionformer/blob/main/slowfast/datasets/kinetics.py)

1-1. Kinetics loader reads mp4 videos with PyAV backend, using VideoFrame.to_rgb method

1-2. frames_augmentation is applied, which assumes BGR standards.

1-3. The final output is RGB.

2. SSv2 loader (https://github.com/facebookresearch/Motionformer/blob/main/slowfast/datasets/ssv2.py)

2-1. SSv2 loader reads jpeg frames using cv2.imdecode method.

2-2. Frames are converted to PIL images using torchvision.transforms.ToPILImage method.

2-3. The final output is BGR

Owner

Facebook Research

Make a Turtlebot3 follow a figure 8 trajectory and create a robot arm and make it follow a trajectory

SE3 Pose Interp - Interpolate camera pose or trajectory in SE3, pose interpolation, trajectory interpolation

A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

Code, Data and Demo for Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

Source code for paper: Knowledge Inheritance for Pre-trained Language Models

Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).

The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

Official repository for "PAIR: Planning and Iterative Refinement in Pre-trained Transformers for Long Text Generation"

Tensorflow Implementation for "Pre-trained Deep Convolution Neural Network Model With Attention for Speech Emotion Recognition"

Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

Keeping it safe - AI Based COVID-19 Tracker using Deep Learning and facial recognition

LaneDetectionAndLaneKeeping - Lane Detection And Lane Keeping