Anticipative Video Transformer
Ranked first in the Action Anticipation task of the CVPR 2021 EPIC-Kitchens Challenge! (entry: AVT-FB-UT)
[project page] [paper]
If this code helps with your work, please cite:
R. Girdhar and K. Grauman. Anticipative Video Transformer. IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
@inproceedings{girdhar2021anticipative,
title = {{Anticipative Video Transformer}},
author = {Girdhar, Rohit and Grauman, Kristen},
booktitle = {ICCV},
year = 2021
}
Installation
The code was tested on a Ubuntu 20.04
cluster with each server consisting of 8 V100 16GB GPUs.
First clone the repo and set up the required packages in a conda environment. You might need to make minor modifications here if some packages are no longer available. In most cases they should be replaceable by more recent versions.
$ git clone --recursive [email protected]:facebookresearch/AVT.git
$ conda env create -f env.yaml python=3.7.7
$ conda activate avt
Set up RULSTM codebase
If you plan to use EPIC-Kitchens datasets, you might need the train/test splits and evaluation code from RULSTM. This is also needed if you want to extract RULSTM predictions for test submissions.
$ cd external
$ git clone [email protected]:fpv-iplab/rulstm.git; cd rulstm
$ git checkout 57842b27d6264318be2cb0beb9e2f8c2819ad9bc
$ cd ../..
Datasets
The code expects the data in the DATA/
folder. You can also symlink it to a different folder on a faster/larger drive. Inside it will contain following folders:
videos/
which will contain raw videosexternal/
which will contain pre-extracted features from prior workextracted_features/
which will contain other extracted featurespretrained/
which contains pretrained models, eg from TIMM
The paths to these datasets are set in files like conf/dataset/epic_kitchens100/common.yaml
so you can also update the paths there instead.
EPIC-Kitchens
To train only the AVT-h on top of pre-extracted features, you can download the features from RULSTM into DATA/external/rulstm/RULSTM/data_full
for EK55 and DATA/external/rulstm/RULSTM/ek100_data_full
for EK100. If you plan to train models on features extracted from a irCSN-152 model finetuned from IG65M features, you can download our pre-extracted features from here into DATA/extracted_features/ek100/ig65m_ftEk100_logits_10fps1s/rgb/
or here into DATA/extracted_features/ek55/ig65m_ftEk55train_logits_25fps/rgb/
.
To train AVT end-to-end, you need to download the raw videos from EPIC-Kitchens. They can be organized as you wish, but this is how my folders are organized (since I first downloaded EK55 and then the remaining new videos for EK100):
DATA
├── videos
│ ├── EpicKitchens
│ │ └── videos_ht256px
│ │ ├── train
│ │ │ ├── P01
│ │ │ │ ├── P01_01.MP4
│ │ │ │ ├── P01_03.MP4
│ │ │ │ ├── ...
│ │ └── test
│ │ ├── P01
│ │ │ ├── P01_11.MP4
│ │ │ ├── P01_12.MP4
│ │ │ ├── ...
│ │ ...
│ ├── EpicKitchens100
│ │ └── videos_extension_ht256px
│ │ ├── P01
│ │ │ ├── P01_101.MP4
│ │ │ ├── P01_102.MP4
│ │ │ ├── ...
│ │ ...
│ ├── EGTEA/101020/videos/
│ │ ├── OP01-R01-PastaSalad.mp4
│ │ ...
│ └── 50Salads/rgb/
│ ├── rgb-01-1.avi
│ ...
├── external
│ └── rulstm
│ └── RULSTM
│ ├── egtea
│ │ ├── TSN-C_3_egtea_action_CE_flow_model_best_fcfull_hd
│ │ ...
│ ├── data_full # (EK55)
│ │ ├── rgb
│ │ ├── obj
│ │ └── flow
│ └── ek100_data_full
│ ├── rgb
│ ├── obj
│ └── flow
└── extracted_features
├── ek100
│ └── ig65m_ftEk100_logits_10fps1s
│ └── rgb
└── ek55
└── ig65m_ftEk55train_logits_25fps
└── rgb
If you use a different organization, you would need to edit the train/val dataset files, such as conf/dataset/epic_kitchens100/anticipation_train.yaml
. Sometimes the values are overriden in the TXT config files, so might need to change there too. The root
property takes a list of folders where the videos can be found, and it will search through all of them in order for a given video. Note that we resized the EPIC videos to 256px height for faster processing; you can use sample_scripts/resize_epic_256px.sh
script for the same.
Please see docs/DATASETS.md
for setting up other datasets.
Training and evaluating models
If you want to train AVT models, you would need pre-trained models from timm
. We have experiments that use the following models:
$ mkdir DATA/pretrained/TIMM/
$ wget https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_patch16_224_in21k-e5005f0a.pth -O DATA/pretrained/TIMM/jx_vit_base_patch16_224_in21k-e5005f0a.pth
$ wget https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p16_224-80ecf9dd.pth -O DATA/pretrained/TIMM/jx_vit_base_p16_224-80ecf9dd.pth
The code uses hydra 1.0
for configuration with submitit
plugin for jobs via SLURM. We provide a launch.py
script that is a wrapper around the training scripts and can run jobs locally or launch distributed jobs. The configuration overrides for a specific experiment is defined by a TXT file. You can run a config by:
$ python launch.py -c expts/01_ek100_avt.txt
where expts/01_ek100_avt.txt
can be replaced by any TXT config file.
By default, the launcher will launch the job to a SLURM cluster. However, you can run it locally using one of the following options:
-g
to run locally in debug mode with 1 GPU and 0 workers. Will allow you to placepdb.set_trace()
to debug interactively.-l
to run locally using as many GPUs on the local machine.
This will run the training, which will run validation every few epochs. You can also only run testing using the -t
flag.
The outputs will be stored in OUTPUTS/<path to config>
. This would include tensorboard files that you can use to visualize the training progress.
Model Zoo
EPIC-Kitchens-100
Backbone | Head | Class-mean Recall@5 (Actions) |
Config | Model |
---|---|---|---|---|
AVT-b (IN21K) | AVT-h | 14.9 | expts/01_ek100_avt.txt |
link |
TSN (RGB) | AVT-h | 13.6 | expts/02_ek100_avt_tsn.txt |
link |
TSN (Obj) | AVT-h | 8.7 | expts/03_ek100_avt_tsn_obj.txt |
link |
irCSN152 (IG65M) | AVT-h | 12.8 | expts/04_ek100_avt_ig65m.txt |
link |
Late fusing predictions
For comparison to methods that use multiple modalities, you can late fuse predictions from multiple models using functions from notebooks/utils.py
. For example, to compute the late fused performance reported in Table 3 (val) as AVT+
(obtains 15.9 recall@5 for actions):
from notebooks.utils import *
CFG_FILES = [
('expts/01_ek100_avt.txt', 0),
('expts/03_ek100_avt_tsn_obj.txt', 0),
]
WTS = [2.5, 0.5]
print_accuracies_epic(get_epic_marginalize_late_fuse(CFG_FILES, weights=WTS)[0])
Please see docs/MODELS.md
for test submission and models on other datasets.
License
This codebase is released under the license terms specified in the LICENSE file. Any imported libraries, datasets or other code follows the license terms set by respective authors.
Acknowledgements
The codebase was built on top of facebookresearch/VMZ
. Many thanks to Antonino Furnari, Fadime Sener and Miao Liu for help with prior work.