The official pytorch implementation of our paper "Is Space-Time Attention All You Need for Video Understanding?"

Overview

TimeSformer

This is an official pytorch implementation of Is Space-Time Attention All You Need for Video Understanding?. In this repository, we provide PyTorch code for training and testing our proposed TimeSformer model. TimeSformer provides an efficient video classification framework that achieves state-of-the-art results on several video action recognition benchmarks such as Kinetics-400.

If you find TimeSformer useful in your research, please use the following BibTeX entry for citation.

@misc{bertasius2021spacetime,
    title   = {Is Space-Time Attention All You Need for Video Understanding?},
    author  = {Gedas Bertasius and Heng Wang and Lorenzo Torresani},
    year    = {2021},
    eprint  = {2102.05095},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Model Zoo

We provide TimeSformer models pretrained on Kinetics-400 (K400), Kinetics-600 (K600), Something-Something-V2 (SSv2), and HowTo100M datasets.

name dataset # of frames spatial crop acc@1 acc@5 url
TimeSformer K400 8 224 77.9 93.2 model
TimeSformer-HR K400 16 448 79.6 94.0 model
TimeSformer-L K400 96 224 80.6 94.7 model
name dataset # of frames spatial crop acc@1 acc@5 url
TimeSformer K600 8 224 79.1 94.4 model
TimeSformer-HR K600 16 448 81.8 95.8 model
TimeSformer-L K600 96 224 82.2 95.6 model
name dataset # of frames spatial crop acc@1 acc@5 url
TimeSformer SSv2 8 224 59.1 85.6 model
TimeSformer-HR SSv2 16 448 61.8 86.9 model
TimeSformer-L SSv2 64 224 62.0 87.5 model
name dataset # of frames spatial crop single clip coverage acc@1 url
TimeSformer HowTo100M 8 224 8.5s 56.8 model
TimeSformer HowTo100M 32 224 34.1s 61.2 model
TimeSformer HowTo100M 64 448 68.3s 62.2 model
TimeSformer HowTo100M 96 224 102.4s 62.6 model

We note that these models were retrained using a slightly different implementation than the one used in the paper. Therefore, there might be a small difference in performance compared to the results reported in the paper.

Installation

First, create a conda virtual environment and activate it:

conda create -n timesformer python=3.7 -y
source activate timesformer

Then, install the following packages:

  • torchvision: pip install torchvision or conda install torchvision -c pytorch
  • fvcore: pip install 'git+https://github.com/facebookresearch/fvcore'
  • simplejson: pip install simplejson
  • einops: pip install einops
  • timm: pip install timm
  • PyAV: conda install av -c conda-forge
  • psutil: pip install psutil
  • OpenCV: pip install opencv-python
  • tensorboard: pip install tensorboard

Lastly, build the TimeSformer codebase by running:

git clone https://github.com/facebookresearch/TimeSformer
cd TimeSformer
python setup.py build develop

Usage

Dataset Preparation

Please use the dataset preparation instructions provided in DATASET.md.

Training the Default TimeSformer

Training the default TimeSformer that uses divided space-time attention, and operates on 8-frame clips cropped at 224x224 spatial resolution, can be done using the following command:

python tools/run_net.py \
  --cfg configs/Kinetics/TimeSformer_divST_8x32_224.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 8 \

You may need to pass location of your dataset in the command line by adding DATA.PATH_TO_DATA_DIR path_to_your_dataset, or you can simply add

DATA:
  PATH_TO_DATA_DIR: path_to_your_dataset

To the yaml configs file, then you do not need to pass it to the command line every time.

Using a Different Number of GPUs

If you want to use a smaller number of GPUs, you need to modify .yaml configuration files in configs/. Specifically, you need to modify the NUM_GPUS, TRAIN.BATCH_SIZE, TEST.BATCH_SIZE, DATA_LOADER.NUM_WORKERS entries in each configuration file. The BATCH_SIZE entry should be the same or higher as the NUM_GPUS entry. In configs/Kinetics/TimeSformer_divST_8x32_224_4gpus.yaml, we provide a sample configuration file for a 4 GPU setup.

Using Different Self-Attention Schemes

If you want to experiment with different space-time self-attention schemes, e.g., space-only or joint space-time attention, use the following commands:

python tools/run_net.py \
  --cfg configs/Kinetics/TimeSformer_spaceOnly_8x32_224.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 8 \

and

python tools/run_net.py \
  --cfg configs/Kinetics/TimeSformer_jointST_8x32_224.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 8 \

Training Different TimeSformer Variants

If you want to train more powerful TimeSformer variants, e.g., TimeSformer-HR (operating on 16-frame clips sampled at 448x448 spatial resolution), and TimeSformer-L (operating on 96-frame clips sampled at 224x224 spatial resolution), use the following commands:

python tools/run_net.py \
  --cfg configs/Kinetics/TimeSformer_divST_16x16_448.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 8 \

and

python tools/run_net.py \
  --cfg configs/Kinetics/TimeSformer_divST_96x4_224.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  NUM_GPUS 8 \
  TRAIN.BATCH_SIZE 8 \

Note that for these models you will need a set of GPUs with ~32GB of memory.

Inference

Use TRAIN.ENABLE and TEST.ENABLE to control whether training or testing is required for a given run. When testing, you also have to provide the path to the checkpoint model via TEST.CHECKPOINT_FILE_PATH.

python tools/run_net.py \
  --cfg configs/Kinetics/TimeSformer_divST_8x32_224_TEST.yaml \
  DATA.PATH_TO_DATA_DIR path_to_your_dataset \
  TEST.CHECKPOINT_FILE_PATH path_to_your_checkpoint \
  TRAIN.ENABLE False \

Single-Node Training via Slurm

To train TimeSformer via Slurm, please check out our single node Slurm training script slurm_scripts/run_single_node_job.sh.

Multi-Node Training via Submitit

Distributed training is available via Slurm and submitit

pip install submitit

To train TimeSformer model on Kinetics using 4 nodes with 8 gpus each use the following command:

python tools/submit.py --cfg configs/Kinetics/TimeSformer_divST_8x32_224.yaml --job_dir  /your/job/dir/${JOB_NAME}/ --num_shards 4 --name ${JOB_NAME} --use_volta32

We provide a script for launching slurm jobs in slurm_scripts/run_multi_node_job.sh.

Finetuning

To finetune from an existing PyTorch checkpoint add the following line in the command line, or you can also add it in the YAML config:

TRAIN.CHECKPOINT_FILE_PATH path_to_your_PyTorch_checkpoint
TRAIN.FINETUNE True

HowTo100M Dataset Split

If you want to experiment with the long-term video modeling task on HowTo100M, please download the train/test split files from here.

Environment

The code was developed using python 3.7 on Ubuntu 20.04. For training, we used four GPU compute nodes each node containing 8 Tesla V100 GPUs (32 GPUs in total). Other platforms or GPU cards have not been fully tested.

License

The majority of this work is licensed under CC-NC 4.0 International license. However portions of the project are available under separate license terms: SlowFast and pytorch-image-models are licensed under the Apache 2.0 license.

Contributing

We actively welcome your pull requests. Please see CONTRIBUTING.md and CODE_OF_CONDUCT.md for more info.

Acknowledgements

TimeSformer is built on top of PySlowFast and pytorch-image-models by Ross Wightman. We thank the authors for releasing their code. If you use our model, please consider citing these works as well:

@misc{fan2020pyslowfast,
  author =       {Haoqi Fan and Yanghao Li and Bo Xiong and Wan-Yen Lo and
                  Christoph Feichtenhofer},
  title =        {PySlowFast},
  howpublished = {\url{https://github.com/facebookresearch/slowfast}},
  year =         {2020}
}
@misc{rw2019timm,
  author = {Ross Wightman},
  title = {PyTorch Image Models},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  doi = {10.5281/zenodo.4414861},
  howpublished = {\url{https://github.com/rwightman/pytorch-image-models}}
}
Comments
  • Regarding attention visualisation using Attention Rollout scheme.

    Regarding attention visualisation using Attention Rollout scheme.

    Hi, Thanks for the cool work. The paper mentions about learned attention visualization using Attention Rollout scheme presented in Quantifying Attention Flow in Transformers (Abnar and Zuidema, 2020). I wasn't able to locate the corresponding visualization class/files in repo. Could you tell me if the repo contains the mentioned visualization related files? If so, could you please point to me location?

    Thanks, Victor

    opened by victor-psiori 16
  • Transfer Learning using pretrained weights

    Transfer Learning using pretrained weights

    Hello, I want to do transfer learning using the timeSformer model. What I want to achieve is to predict 6 outputs(regression) for every 8 frames. I am a little confused about how exactly to do it. Any help is appreciated.

    Thank you

    opened by surajmahangade 10
  • How to start training from a pretrained model

    How to start training from a pretrained model

    hi,

    First of all awesome job bringing transformers to video recognition!! I am comparatively new to the scene. So please bear with me if there are any errors:

    My question is how to load the weights of a pretrained model and retrain it for our dataset (with unfreezing the learned weights if necessary) Let's say we have loaded a pretrained model with 400 classes (Kinetics). I want to train it on my dataset with 10 classes. So I create the CSV file and split it into train, val and test. Now I define the model model = TimeSformer(img_size=224, num_classes=10, num_frames=8, attention_type='divided_space_time', pretrained_model='/path/to/pretrained/model.pyth')

    But this model doesn't work for my case because, it's not trained on my dataset. What do I do next after this step?

    opened by jayanthante 7
  • Inconsistent of training performance on kinetics400

    Inconsistent of training performance on kinetics400

    Thanks for releasing the code. In my reproducing, I run your code without modification except for my onw data (~24w training and ~1.975w validation). But get the performance inconsistent of your paper.

    I run the code with config 'TimeSformer_divST_8x32_224.yaml', the best validate top-1 accuracy after 15 epochs training only achieve 73.63% which is much lower than the reported ~77%. (8 x V100 GPUs, the running command is python tools/run_net.py \ --cfg configs/Kinetics/TimeSformer_divST_8x32_224.yaml \ DATA.PATH_TO_DATA_DIR path_to_my_dataset \ NUM_GPUS 8 \ TRAIN.BATCH_SIZE 8 \ )

    Are there some configuration different or something wrong with my setting?

    opened by Hanqer 6
  • For Kinetics-400 can't achieve the score in paper. ''Top 1 76.7''  ''Top 5 92.9''

    For Kinetics-400 can't achieve the score in paper. ''Top 1 76.7'' ''Top 5 92.9''

    Hi, first of all, thank you for your amazing work for Video Understanding. I attempt to follow your work, but I can't achieve the best score in Kinetics-400 with the Code in this program. Is there anything that needs to be changed in the current version of the code for the best performance? Thank you again for your work!!!!

    Best regards, FightAllDays

    opened by FightAllDays 5
  • Dataloader Problem

    Dataloader Problem

    Thanks for your code. Could you please show me the .csv file sample? When I tried to load videos, there is always a bug indicating that

    Failed to meta load video idx 24899 from YvYW5eWaNKE; trial 0
    Failed to meta load video idx 24899 from YvYW5eWaNKE; trial 1
    Failed to meta load video idx 148797 from 5vPDEg8cefE; trial 0
    ......
    
    opened by Zwette 5
  • Configs for HowTo100M

    Configs for HowTo100M

    Hi,

    Thanks for this great work and repo! I'd like to know if you used different training parameters / processing for the HowTo100M task. I did a straightforward adaptation of the code and config used for Kinetics (just changing the number of classes to 1059) but it doesn't seem to work (loss doesn't decrease), both when fine-tuning from ImageNet directly / fine-tuning from Kinetics.

    Best, Antoine Yang

    opened by antoyang 5
  • Training Time Required

    Training Time Required

    Hi,

    I was trying to train the Timesformer model from scratch on Kinetics-600 and the estimated time was shown as ~9 days. In the paper it was mentioned that the training time is roughly 440 V100 GPU hours. My setup is 8x Titan V GPUs, so I assumed that the training time would be closer to 50 hours. What am I missing here?

    opened by kevaldoshi17 4
  • About UCF101 and HMDB51 results

    About UCF101 and HMDB51 results

    Dear Authors,

    Thanks for this great repo for reproducing the results in TimesFormer. I just want to have a quick check whether you have experimented with the two small video classification datasets (i.e., UCF101 and HMDB51) and have some initial results.

    opened by airsplay 4
  • Refactor to make installable

    Refactor to make installable

    This PR refactors renaming lib -> timesformer, this way one can install the library and import it locally into your project.

    • reanmes lib to timesformer (lib colissions with pythons lib).
    • adds example notebook
    • adds environment.yml to create a fresh conda env with depencies.
    CLA Signed 
    opened by tcapelle 4
  • Loading a pretrained model minimally

    Loading a pretrained model minimally

    Would you mind putting simple notebook on how to load a pretrained model minimally? Something like:

    import torch
    from lib import TimeSformer
    simple_config = blablabla
    
    model = TimeSformer(
    simple_config,
    )
    
    video = torch.randn(2, 8, 3, 224, 224) # (batch x frames x channels x height x width)
    
    pred = model(video,) # (2, 10)
    

    so one can use the model elsewhere? I am having a hard time understanding all the file parsing.

    opened by tcapelle 4
  • During training, why is optimizer step called after 32 iterations?

    During training, why is optimizer step called after 32 iterations?

    Referring to this part of code:

    if cur_global_batch_size >= cfg.GLOBAL_BATCH_SIZE: # Perform the backward pass. optimizer.zero_grad() loss.backward() # Update the parameters. optimizer.step() else: if cur_iter == 0: optimizer.zero_grad() loss.backward() if (cur_iter + 1) % num_iters == 0: for p in model.parameters(): p.grad /= num_iters optimizer.step() optimizer.zero_grad()

    Is it due to batchsize being too small?

    opened by karthik1145 0
  •   How to train driving dataset on  timesformer?

    How to train driving dataset on timesformer?

    Hi Team, I am using the driving dataset to understanding road behavior. I would like to train the transformer with the predefine class such as ['left turn', 'right turn', 'U turn']. Then I would like to test my data on the Timesformer. I would like to know how to train the timesformer.

    opened by pgupta119 1
  • Question about train time

    Question about train time

    it is my first time to touch with video transformer. I want to know that how long time will take during training process if I have four RTX3080 GPUs. I would appreciate that someone can answer me!

    opened by shiyi-z 1
  • How to load pretrained SlowFast Model?

    How to load pretrained SlowFast Model?

    for loading the slow fast models, cfg.TASK is required while building the model. However, it is not defined anywhere in the config file. To load the pretrained models, I am using weights from https://github.com/facebookresearch/SlowFast/blob/main/MODEL_ZOO.md. For them to be loaded properly,, cfg.TEST.CHECKPOINT_TYPE == "caffe2". Can you please let me know if this is the correct way of loading the model?

    opened by HashmatShadab 0
  • Training at epoch 1 end up with CUDA error and Assertion `t >= 0 && t < n_classes` failed

    Training at epoch 1 end up with CUDA error and Assertion `t >= 0 && t < n_classes` failed

    I ran the following command:

    python tools/run_net.py
    --cfg configs/Kinetics/TimeSformer_divST_8x32_224_4gpus.yaml
    DATA.PATH_TO_DATA_DIR /home/ubuntu/vit/kinetics-dataset/k400/videos_resized
    NUM_GPUS 4
    TRAIN.BATCH_SIZE 16
    \

    During training in epoch 1, I observed the following error:

    [06/30 01:54:32][INFO] train_net.py: 446: Start epoch: 1 [06/30 01:54:46][INFO] distributed.py: 995: Reducer buckets have been rebuilt in this iteration. [06/30 01:54:59][INFO] logging.py: 95: json_stats: {"_type": "train_iter", "dt": 1.46034, "dt_data": 0.00346, "dt_net": 1.45688, "epoch": "1/15", "eta": "7:33:55", "gpu_mem": "7.68G", "iter": "10/1244", "loss": 6.05343, "lr": 0.00500, "top1_err": 100.00000, "top5_err": 100.00000} [06/30 01:55:14][INFO] logging.py: 95: json_stats: {"_type": "train_iter", "dt": 1.50371, "dt_data": 0.00334, "dt_net": 1.50036, "epoch": "1/15", "eta": "7:47:09", "gpu_mem": "7.68G", "iter": "20/1244", "loss": 6.16927, "lr": 0.00500, "top1_err": 100.00000, "top5_err": 100.00000} ../aten/src/ATen/native/cuda/Loss.cu:271: nll_loss_forward_reduce_cuda_kernel_2d: block: [0,0,0], thread: [2,0,0] Assertion t >= 0 && t < n_classes failed. terminate called after throwing an instance of 'c10::CUDAError' what(): CUDA error: device-side assert triggered

    Follow by a lengthy exceptions being raised: Exception raised from createEvent at ../aten/src/ATen/cuda/CUDAEvent.h:166 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f10084d2612 in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: + 0xea8e4a (0x7f1009892e4a in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0x33a968 (0x7f1051d51968 in /home/ubuntu/anaconda3/envs/timesformer/lib/python3.7/site-packages/torch/lib/libtorch_python.so)

    .....

    In my instance I have four Tesla T4 GPU with Driver Version: 510.47.03 CUDA Version: 11.6

    What does the error I see above means, and how do I fix it?

    opened by kct22aws 1
Owner
Facebook Research
Facebook Research
Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

LLA: Loss-aware Label Assignment for Dense Pedestrian Detection This project provides an implementation for "LLA: Loss-aware Label Assignment for Dens

null 35 Dec 6, 2022
Official implementation of our CVPR2021 paper "OTA: Optimal Transport Assignment for Object Detection" in Pytorch.

OTA: Optimal Transport Assignment for Object Detection This project provides an implementation for our CVPR2021 paper "OTA: Optimal Transport Assignme

null 217 Jan 3, 2023
The repository offers the official implementation of our paper in PyTorch.

Cloth Interactive Transformer (CIT) Cloth Interactive Transformer for Virtual Try-On Bin Ren1, Hao Tang1, Fanyang Meng2, Runwei Ding3, Ling Shao4, Phi

Bingoren 49 Dec 1, 2022
This is the official pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering" on VQA Task

?? ERASOR (RA-L'21 with ICRA Option) Official page of "ERASOR: Egocentric Ratio of Pseudo Occupancy-based Dynamic Object Removal for Static 3D Point C

Hyungtae Lim 225 Dec 29, 2022
Official PyTorch implemention of our paper "Learning to Rectify for Robust Learning with Noisy Labels".

WarPI The official PyTorch implemention of our paper "Learning to Rectify for Robust Learning with Noisy Labels". Run python main.py --corruption_type

Haoliang Sun 3 Sep 3, 2022
The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Deep High-Resolution Representation Learning for Human Pose Estimation (CVPR 2019) News [2020/07/05] A very nice blog from Towards Data Science introd

Leo Xiao 3.9k Jan 5, 2023
This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.

BiPointNet: Binary Neural Network for Point Clouds Created by Haotong Qin, Zhongang Cai, Mingyuan Zhang, Yifu Ding, Haiyu Zhao, Shuai Yi, Xianglong Li

Haotong Qin 59 Dec 17, 2022
This is an official implementation of our CVPR 2021 paper "Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression" (https://arxiv.org/abs/2104.02300)

Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression Introduction In this paper, we are interested in the bottom-up paradigm of estima

HRNet 367 Dec 27, 2022
Official implementation of GraphMask as presented in our paper Interpreting Graph Neural Networks for NLP With Differentiable Edge Masking.

GraphMask This repository contains an implementation of GraphMask, the interpretability technique for graph neural networks presented in our ICLR 2021

Michael Schlichtkrull 29 Sep 2, 2022
The official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averaging Approach

Graph Optimizer This repo contains the official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averagin

Chenyu 109 Dec 23, 2022
The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

3D Human Pose Estimation with Spatial and Temporal Transformers This repo is the official implementation for 3D Human Pose Estimation with Spatial and

Ce Zheng 363 Dec 28, 2022
Official implementation of our paper "Learning to Bootstrap for Combating Label Noise"

Learning to Bootstrap for Combating Label Noise This repo is the official implementation of our paper "Learning to Bootstrap for Combating Label Noise

null 21 Apr 9, 2022
PyTorch implementation of the Deep SLDA method from our CVPRW-2020 paper "Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis"

Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis This is a PyTorch implementation of the Deep Streaming Linear Discriminant

Tyler Hayes 41 Dec 25, 2022
Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

Lottery Jackpots Exist in Pre-trained Models (Paper Link) Requirements Python >= 3.7.4 Pytorch >= 1.6.1 Torchvision >= 0.4.1 Reproduce the Experiment

Yuxin Zhang 27 Jun 28, 2022
PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimation

StructDepth PyTorch implementation of our ICCV2021 paper: StructDepth: Leveraging the structural regularities for self-supervised indoor depth estimat

SJTU-ViSYS 112 Nov 28, 2022
Pytorch implementation for our ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visual Question Answering".

TRAnsformer Routing Networks (TRAR) This is an official implementation for ICCV 2021 paper "TRAR: Routing the Attention Spans in Transformers for Visu

Ren Tianhe 49 Nov 10, 2022
PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

Saim Wani 4 May 8, 2022
PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short-Term Transformer for Online Action Detection".

Long Short-Term Transformer for Online Action Detection Introduction This is a PyTorch implementation for our NeurIPS 2021 Spotlight paper "Long Short

null 77 Dec 16, 2022
This project is the PyTorch implementation of our CVPR 2022 paper:

Requirements and Dependency Install PyTorch with CUDA (for GPU). (Experiments are validated on python 3.8.11 and pytorch 1.7.0) (For visualization if

Lei Huang 23 Nov 29, 2022