Code release for ICCV 2021 paper "Anticipative Video Transformer"

Facebook Research

Last update: Dec 13, 2022

Related tags

Deep Learning AVT

Overview

Anticipative Video Transformer

Ranked first in the Action Anticipation task of the CVPR 2021 EPIC-Kitchens Challenge! (entry: AVT-FB-UT)

[project page] [paper]

If this code helps with your work, please cite:

R. Girdhar and K. Grauman. Anticipative Video Transformer. IEEE/CVF International Conference on Computer Vision (ICCV), 2021.

@inproceedings{girdhar2021anticipative,
    title = {{Anticipative Video Transformer}},
    author = {Girdhar, Rohit and Grauman, Kristen},
    booktitle = {ICCV},
    year = 2021
}

Installation

The code was tested on a Ubuntu 20.04 cluster with each server consisting of 8 V100 16GB GPUs.

First clone the repo and set up the required packages in a conda environment. You might need to make minor modifications here if some packages are no longer available. In most cases they should be replaceable by more recent versions.

$ git clone --recursive [email protected]:facebookresearch/AVT.git
$ conda env create -f env.yaml python=3.7.7
$ conda activate avt

Set up RULSTM codebase

If you plan to use EPIC-Kitchens datasets, you might need the train/test splits and evaluation code from RULSTM. This is also needed if you want to extract RULSTM predictions for test submissions.

$ cd external
$ git clone [email protected]:fpv-iplab/rulstm.git; cd rulstm
$ git checkout 57842b27d6264318be2cb0beb9e2f8c2819ad9bc
$ cd ../..

Datasets

The code expects the data in the DATA/ folder. You can also symlink it to a different folder on a faster/larger drive. Inside it will contain following folders:

videos/ which will contain raw videos
external/ which will contain pre-extracted features from prior work
extracted_features/ which will contain other extracted features
pretrained/ which contains pretrained models, eg from TIMM

The paths to these datasets are set in files like conf/dataset/epic_kitchens100/common.yaml so you can also update the paths there instead.

EPIC-Kitchens

To train only the AVT-h on top of pre-extracted features, you can download the features from RULSTM into DATA/external/rulstm/RULSTM/data_full for EK55 and DATA/external/rulstm/RULSTM/ek100_data_full for EK100. If you plan to train models on features extracted from a irCSN-152 model finetuned from IG65M features, you can download our pre-extracted features from here into DATA/extracted_features/ek100/ig65m_ftEk100_logits_10fps1s/rgb/ or here into DATA/extracted_features/ek55/ig65m_ftEk55train_logits_25fps/rgb/.

To train AVT end-to-end, you need to download the raw videos from EPIC-Kitchens. They can be organized as you wish, but this is how my folders are organized (since I first downloaded EK55 and then the remaining new videos for EK100):

DATA
├── videos
│   ├── EpicKitchens
│   │   └── videos_ht256px
│   │       ├── train
│   │       │   ├── P01
│   │       │   │   ├── P01_01.MP4
│   │       │   │   ├── P01_03.MP4
│   │       │   │   ├── ...
│   │       └── test
│   │           ├── P01
│   │           │   ├── P01_11.MP4
│   │           │   ├── P01_12.MP4
│   │           │   ├── ...
│   │           ...
│   ├── EpicKitchens100
│   │   └── videos_extension_ht256px
│   │       ├── P01
│   │       │   ├── P01_101.MP4
│   │       │   ├── P01_102.MP4
│   │       │   ├── ...
│   │       ...
│   ├── EGTEA/101020/videos/
│   │   ├── OP01-R01-PastaSalad.mp4
│   │   ...
│   └── 50Salads/rgb/
│       ├── rgb-01-1.avi
│       ...
├── external
│   └── rulstm
│       └── RULSTM
│           ├── egtea
│           │   ├── TSN-C_3_egtea_action_CE_flow_model_best_fcfull_hd
│           │   ...
│           ├── data_full  # (EK55)
│           │   ├── rgb
│           │   ├── obj
│           │   └── flow
│           └── ek100_data_full
│               ├── rgb
│               ├── obj
│               └── flow
└── extracted_features
    ├── ek100
    │   └── ig65m_ftEk100_logits_10fps1s
    │       └── rgb
    └── ek55
        └── ig65m_ftEk55train_logits_25fps
            └── rgb

If you use a different organization, you would need to edit the train/val dataset files, such as conf/dataset/epic_kitchens100/anticipation_train.yaml. Sometimes the values are overriden in the TXT config files, so might need to change there too. The root property takes a list of folders where the videos can be found, and it will search through all of them in order for a given video. Note that we resized the EPIC videos to 256px height for faster processing; you can use sample_scripts/resize_epic_256px.sh script for the same.

Please see docs/DATASETS.md for setting up other datasets.

Training and evaluating models

If you want to train AVT models, you would need pre-trained models from timm. We have experiments that use the following models:

$ mkdir DATA/pretrained/TIMM/
$ wget https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_patch16_224_in21k-e5005f0a.pth -O DATA/pretrained/TIMM/jx_vit_base_patch16_224_in21k-e5005f0a.pth
$ wget https://github.com/rwightman/pytorch-image-models/releases/download/v0.1-vitjx/jx_vit_base_p16_224-80ecf9dd.pth -O DATA/pretrained/TIMM/jx_vit_base_p16_224-80ecf9dd.pth

The code uses hydra 1.0 for configuration with submitit plugin for jobs via SLURM. We provide a launch.py script that is a wrapper around the training scripts and can run jobs locally or launch distributed jobs. The configuration overrides for a specific experiment is defined by a TXT file. You can run a config by:

$ python launch.py -c expts/01_ek100_avt.txt

where expts/01_ek100_avt.txt can be replaced by any TXT config file.

By default, the launcher will launch the job to a SLURM cluster. However, you can run it locally using one of the following options:

-g to run locally in debug mode with 1 GPU and 0 workers. Will allow you to place pdb.set_trace() to debug interactively.
-l to run locally using as many GPUs on the local machine.

This will run the training, which will run validation every few epochs. You can also only run testing using the -t flag.

The outputs will be stored in OUTPUTS/<path to config>. This would include tensorboard files that you can use to visualize the training progress.

Model Zoo

EPIC-Kitchens-100

Backbone	Head	Class-mean Recall@5 (Actions)	Config	Model
AVT-b (IN21K)	AVT-h	14.9	`expts/01_ek100_avt.txt`	link
TSN (RGB)	AVT-h	13.6	`expts/02_ek100_avt_tsn.txt`	link
TSN (Obj)	AVT-h	8.7	`expts/03_ek100_avt_tsn_obj.txt`	link
irCSN152 (IG65M)	AVT-h	12.8	`expts/04_ek100_avt_ig65m.txt`	link

Late fusing predictions

For comparison to methods that use multiple modalities, you can late fuse predictions from multiple models using functions from notebooks/utils.py. For example, to compute the late fused performance reported in Table 3 (val) as AVT+ (obtains 15.9 recall@5 for actions):

from notebooks.utils import *
CFG_FILES = [
    ('expts/01_ek100_avt.txt', 0),
    ('expts/03_ek100_avt_tsn_obj.txt', 0),
]
WTS = [2.5, 0.5]
print_accuracies_epic(get_epic_marginalize_late_fuse(CFG_FILES, weights=WTS)[0])

Please see docs/MODELS.md for test submission and models on other datasets.

License

This codebase is released under the license terms specified in the LICENSE file. Any imported libraries, datasets or other code follows the license terms set by respective authors.

Acknowledgements

The codebase was built on top of facebookresearch/VMZ. Many thanks to Antonino Furnari, Fadime Sener and Miao Liu for help with prior work.

Comments

Couple questions about classification loss

Hi @rohitgirdhar,

Thanks for your great work -- I found it very interesting and plan to use it in my work! I was hoping to clear up exactly how the loss functions are working with the feature decoding since I was a little confused:

The decoder at each timestep from 1..t outputs features (in causal manner), which are then passed through a linear layer to obtain predicted frame features. Another linear layer on top of this then predicts distribution over action classes. Thus, we have t action predictions. Does the predictions for timestep 1 use the action label from timestep 2? The predictions for timestep t from my understanding represent the action at timestep t+1 (next action we want to anticipate). Based on the implementation, I was wondering if the classification loss also does a loss based on predictions for the next frame's labels and the first frame label is not used? Sorry if this is confusing, hope you can help clear my understanding!

opened by zerodecoder1 9

Unable to reproduce '01_ek100_avt.txt' val result

Hi @rohitgirdhar, I'm trying to reproduce the experiment '01_ek100_avt.txt'. After training, I read my evaluation results using your example code in README.md, which from ’notebooks.utils‘. And I got these outputs:

[('expts/01_ek100_avt.txt', 0)] Accuracies verb/noun/action: 32.3 77.3 22.3 51.8 13.6 32.8
[('expts/01_ek100_avt.txt', 0)] Mean class top-1 accuracies verb/noun/action: 6.0 7.9 1.5
[('expts/01_ek100_avt.txt', 0)] Recall@5 verb/noun/action: 22.3 28.7 12.0
[('expts/01_ek100_avt.txt', 0)] Recall@5 many shot verb/noun/action: nan nan 12.0
[('expts/01_ek100_avt.txt', 0)] Recall@5 tail verb/noun/action: 13.9 19.3 8.9
[('expts/01_ek100_avt.txt', 0)] Recall@5 unseen verb/noun/action: 27.5 26.0 12.6

My reproduction result is 12.0, lower than your report result 14.9. Am I doing the right evaluation process？

And the only thing I changed in the training is the interface of reading data. Could you please help me to check whether it could be the reason I fail the reproduction? The alteration is as below.

I wrote a PictureReader to read frames in iteration replace of the DefaultReader:

class PictureReader(Reader):
    def forward(self, video_path, start, end, fps, df_row, **kwargs):
        del df_row
        start = int(max(start * fps, 1))
        end = int(end * fps)
        if end <= start: return torch.Tensor()
        
        video_path = video_path[:-4]
        video = []
        for i in range(start, end):
            picture = torchvision.io.read_image(video_path+'/frame_{:010d}.jpg'.format(i))
            video.append(picture.permute(1,2,0))
        return torch.stack(video), {}, {}

As for the get_frame_rate(), I got fps from the annotation df. So I add a line in _init_ of class EPICKitchens. df['fps'] = df['start_frame'] / df['start']

opened by zhoumumu 7

Unable to reproduce val results

Hi @rohitgirdhar, I'm trying to test the irCSN-152 (IG65M) model for EK-55. I used the model https://dl.fbaipublicfiles.com/avt/checkpoints/expts/10_ek55_avt_ig65m.txt/0/checkpoint.pth and the config expts/10_ek55_avt_ig65m.txt, and added these lines to the config:

test_only=true
train.init_from_model=[[${cwd}/DATA/models/10_ek55_avt_ig65m.pth]]

However, I'm getting

[2021-10-05 12:37:04,999][root][INFO] - Reading from resfiles
[2021-10-05 12:37:11,072][func.train][INFO] - []
[2021-10-05 12:37:11,073][root][INFO] - iter_time: 0.294328
[2021-10-05 12:37:11,073][root][INFO] - data_time: 0.135377
[2021-10-05 12:37:11,074][root][INFO] - loss: 6.164686
[2021-10-05 12:37:11,074][root][INFO] - acc1/action: 7.351763
[2021-10-05 12:37:11,074][root][INFO] - acc5/action: 19.931891
[2021-10-05 12:37:11,074][root][INFO] - cls_action: 6.134162
[2021-10-05 12:37:11,074][root][INFO] - feat: 0.030524

which is far from the 14.4 and 31.7 Top 1/5 performance. Do you know what might be wrong here?

opened by haofengac 7

cant create video dataset

Hi, I am trying to run the code for ek100 dataset. But i notice that it is not able to compute any video clips for dataset creation on as here https://github.com/facebookresearch/AVT/blob/2d6781d5315a4c53bd059b1cd11ee46bd4427648/datasets/data.py#L41

The _dataset variable is EpicKitchen class object since the _target variable defined here is of the same type. https://github.com/facebookresearch/AVT/blob/2d6781d5315a4c53bd059b1cd11ee46bd4427648/conf/dataset/epic_kitchens100/anticipation_train.yaml#L3

Therefore it is not able to execute this line. https://github.com/facebookresearch/AVT/blob/2d6781d5315a4c53bd059b1cd11ee46bd4427648/datasets/data.py#L46 and it execute the except line after this.

Can you help me in this regard. Maybe I am doing something wrong ! Thanks :)

opened by sanketsans 5
Env Create Conflicts

Hi,

Your work is so impressive. Actually, I am a beginner in this field. I try to use the env.yaml file that you provide to create a conda env through conda env create -f env.yaml python=3.7.7. But it reports that found conflicts. Then what should I do?

Thank you in advance!

opened by GOZGOXGOYGO 5
Regarding sampling method and strategy
Hi again :) I have had some wonders regarding the frame sampling method. Would be highly appreciated to clear that out... :) Looking at expt. 09_ek55_avt settings, the following parameters are:

tau_o: 20sec

original fps (of EK): 60

req_fps: 1

frames_per_clip: 10

sampling_strategy: "last_clip"

As I see it, in practice out of a 20-second input clip (1200 frames), only 10 are sampled, where each frame represents 1-second sub-clip. As a result (due to req_fps=1 and sampling_strategy='last_clip') the model only looks at the last 10 seconds. Is that correct? If so, what is the actual roll of tau_o?

Thank you very much!
opened by ofir1080 4
Loading pretrained ViT base as a backbone

Hi there! I was wondering what is the difference between the ViT ckpt file you supplied instead of simply loading timm.vit_base_patch16_224 with using timm.create_model with the pretrained=True flag. Basically they were both pretrained on IN1k. Am I correct? Thanks!

opened by ofir1080 3
Data Augmentation and End-to-end Training

Hi,

I'm wondering if you have any results regarding performance without data augmentation and end-to-end training? If not, do you know how much augmentations helped in terms of overfitting and performance? Currently, I'm trying to run longer time-horizon experiments without augmentations, but it doesn't seem to perform much better than 2 second horizon experiments. Thank you!

opened by okay-okay 3
Some questions about future_prediction.py

Hi,

I'm trying to re-implement the model from the paper, and for a time horizon of 2 seconds I'm able to reach the same recall, however, increasing that time horizon didn't result in an increase of accuracy or recall. I noticed that here: https://github.com/facebookresearch/AVT/blob/b372773a2fd75e295da3ab737111363b4c860546/models/future_prediction.py#L71, there's this block of code on kmeans/centroids that is not described in the paper, so I'm unsure if this is the insight that I'm missing for longer horizons, or not. Would you be able to help shine some light on what this code is doing? Additionally, do you have any tips on how to see an increase in accuracy / recall when increasing horizon time?

opened by okay-okay 3
Question about Object/Image Features

Hi, I was just wondering how exactly the object features are used in the model? At each timestep, does the model consider both image features and object features concatenated? Are you familiar with how to extract these object features from the raw frames (ex: will it work at a lower resolution image)? Thanks!

opened by okay-okay 3
Can I train AVT on RGB frames?

Hi, the official epic-kitchen provide both the rgb-frame and raw videso source, i want to ask, can i train your model in the rgb-frames, since the raw videos are too large to download for me. if yes, would it lead to lower performance?

thanks!

opened by forrestsz 2
Problem when reproducing with Breakfast/50Salad data
Hi @rohitgirdhar , Thank you for great idea in video anticipation.

I'm reproducing your work with Breakfast/50salad data, and faced with 2 problem.

1.Do I have to locate breakfast video and 50 salad video at same folder?

While training with 50 Salad, ffmpeg returning some error : marker does not match f_code (image below). Although with this error, training goes well but how can i handle this error? I've changed avi with mp4 file, or resizing it, but not worked. Is this because of codec problem?

Thank you
opened by wasabipretzel 0
The input/output feature dimensions of Transformer Encoder and Causal Transformer Decoder?

Hi, thanks for your great project! I am wondering the input/output feature dimensions of Transformer Encoder. The description in Section 4.1 of the paper shows the input/output feature dimensions are both 768D, is it right? However, the description in Section 4.4 of the paper shows the input feature dimension of Causal Transformer Decoder is 2048D, what is the output feature dimension of Causal Transformer Decoder? And is there a dimension conversion (768D->2048D) before using Causal Transformer Decoder?

opened by yxgz 1
Long-term anticipation

Hi, thanks for repo! Excuse me, could you please tell how to perform long-term anticipation as showed in the paper? Have I missed this part in code, or it's not implemented here yet?

opened by RodinIvan 1
Generate submitted file

We can only see the accuracy of validation set, when we run python launch.py -c expts/02_ek100_avt_tsn.txt -l -t.

Can you tell us how to generate the submitted file for the challenge ek100?

opened by interestingzhuo 3
Cannot train/test with video data
How can I train/test with video data?

When I run python launch.py -c expts/09_ek55_avt.txt -t -g, I obtain the warning "No video_clips present":

[2022-02-08 17:15:53,778][func.train][INFO] - Computing clips... [2022-02-08 17:15:53,779][func.train][WARNING] - No video_clips present [2022-02-08 17:15:53,779][func.train][INFO] - Created dataset with 23430 elts

I have downloaded and cropped the videos and setup the same folder structure as in the README file.

Executing hasattr(_dataset, 'video_clips') results in False. How to add video_clips to _dataset to properly execute compute_clips in datasets/data.py?
opened by CodyQ3 8

Owner

Facebook Research

GitHub

Code Release for ICCV 2021 (oral), "AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds"

AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds (ICCV 2021 oral) **Project Page | Arxiv ** Runsong Zhu¹, Yuan Liu², Zhen Dong¹, Te

40 Dec 30, 2022

Official code release for "Learned Spatial Representations for Few-shot Talking-Head Synthesis" ICCV 2021

16 Oct 5, 2022

This is the official repo for TransFill: Reference-guided Image Inpainting by Merging Multiple Color and Spatial Transformations at CVPR'21. According to some product reasons, we are not planning to release the training/testing codes and models. However, we will release the dataset and the scripts to prepare the dataset.

TransFill-Reference-Inpainting This is the official repo for TransFill: Reference-guided Image Inpainting by Merging Multiple Color and Spatial Transf

80 Dec 8, 2022

Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".

PixelTransformer Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation". Project Page Installation Please insta

24 Dec 17, 2022

This repo is the code release of EMNLP 2021 conference paper "Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories".

Connect-the-Dots: Bridging Semantics between Words and Definitions via Aligning Word Sense Inventories This repo is the code release of EMNLP 2021 con

12 Nov 22, 2022

Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021)

Transferable Semantic Augmentation for Domain Adaptation Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021) Paper

66 Dec 16, 2022

Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

Self-Tuning for Data-Efficient Deep Learning This repository contains the implementation code for paper: Self-Tuning for Data-Efficient Deep Learning

101 Dec 11, 2022

The code release of paper 'Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization' NIPS 2020.

Domain Generalization for Medical Imaging Classification with Linear Dependency Regularization The code release of paper 'Domain Generalization for Me

56 Dec 28, 2022

This is the official code release for the paper Shape and Material Capture at Home

This is the official code release for the paper Shape and Material Capture at Home. The code enables you to reconstruct a 3D mesh and Cook-Torrance BRDF from one or more images captured with a flashlight or camera with flash.

89 Dec 10, 2022

Code release for paper: The Boombox: Visual Reconstruction from Acoustic Vibrations

The Boombox: Visual Reconstruction from Acoustic Vibrations Boyuan Chen, Mia Chiquier, Hod Lipson, Carl Vondrick Columbia University Project Website |

12 Nov 30, 2022

Code release to accompany paper "Geometry-Aware Gradient Algorithms for Neural Architecture Search."

Geometry-Aware Gradient Algorithms for Neural Architecture Search This repository contains the code required to run the experiments for the DARTS sear

18 May 27, 2022

Code release of paper "Deep Multi-View Stereo gone wild"

Deep MVS gone wild Pytorch implementation of "Deep MVS gone wild" (Paper | website) This repository provides the code to reproduce the experiments of

53 Dec 24, 2022

Code release for our paper, "SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo"

SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo Thomas Kollar, Michael Laskey, Kevin Stone, Brijen Thananjeyan

68 Dec 14, 2022

Code release for NeurIPS 2020 paper "Co-Tuning for Transfer Learning"

CoTuning Official implementation for NeurIPS 2020 paper Co-Tuning for Transfer Learning. [News] 2021/01/13 The COCO 70 dataset used in the paper is av

35 Sep 23, 2022

The code release of paper Low-Light Image Enhancement with Normalizing Flow

[AAAI 2022] Low-Light Image Enhancement with Normalizing Flow Paper | Project Page Low-Light Image Enhancement with Normalizing Flow Yufei Wang, Renji

176 Jan 6, 2023

Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks"

HKD Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks" cifia-100 result The implementation of compared methods are ba

30 Dec 18, 2022

code for ICCV 2021 paper 'Generalized Source-free Domain Adaptation'

G-SFDA Code (based on pytorch 1.3) for our ICCV 2021 paper 'Generalized Source-free Domain Adaptation'. [project] [paper]. Dataset preparing Download

84 Dec 26, 2022

Code for ICCV 2021 paper: ARAPReg: An As-Rigid-As Possible Regularization Loss for Learning Deformable Shape Generators..

ARAPReg Code for ICCV 2021 paper: ARAPReg: An As-Rigid-As Possible Regularization Loss for Learning Deformable Shape Generators.. Installation The cod

132 Nov 28, 2022

Code for the ICCV 2021 paper "Pixel Difference Networks for Efficient Edge Detection" (Oral).

Pixel Difference Convolution This repository contains the PyTorch implementation for "Pixel Difference Networks for Efficient Edge Detection" by Zhuo

236 Dec 21, 2022