Temporal-attentive-Covariance-Pooling-Networks-for-Video-Recognition
This is an implementation of TCPNet.
Introduction
For video recognition task, a global representation summarizing the whole contents of the video snippets plays an important role for the final performance. However, existing video architectures usually generate it by using a simple, global average pooling (GAP) method, which has limited ability to capture complex dynamics of videos. For image recognition task, there exist evidences showing that covariance pooling has stronger representation ability than GAP. Unfortunately, such plain covariance pooling used in image recognition is an orderless representative, which cannot model spatio-temporal structure inherent in videos. Therefore, this paper proposes a Temporal-attentive Covariance Pooling (TCP), inserted at the end of deep architectures, to produce powerful video representations. Specifi- cally, our TCP first develops a temporal attention module to adaptively calibrate spatio-temporal features for the succeeding covariance pooling, approximatively producing attentive covariance representations. Then, a temporal covariance pooling performs temporal pooling of the attentive covariance representations to char- acterize both intra-frame correlations and inter-frame cross-correlations of the calibrated features. As such, the proposed TCP can capture complex temporal dynamics. Finally, a fast matrix power normalization is introduced to exploit geometry of covariance representations. Note that our TCP is model-agnostic and can be flexibly integrated into any video architectures, resulting in TCPNet for effective video recognition. The extensive experiments on six benchmarks (e.g., Kinetics, Something-Something V1 and Charades) using various video architectures show our TCPNet is clearly superior to its counterparts, while having strong generalization ability.
Citation
@InProceedings{Gao_2021_TCP,
author = {Zilin, Gao and Qilong, Wang and Bingbing, Zhang and Qinghua, Hu and Peihua, Li},
title = {Temporal-attentive Covariance Pooling Networks for Video Recognition},
booktitle = {arxiv preprint axXiv:2021.06xxx},
year = {2021}
}
Model Zoo
Kinetics-400
Method | Backbone | frames | 1 crop Acc (%) | 30 views Acc (%) | Model | Pretrained Model | test log |
---|---|---|---|---|---|---|---|
TCPNet | TSN R50 | 8f | 72.4/90.4 | 75.3/91.8 | K400_TCP_TSN_R50_8f | Img1K_R50_GCP | log |
TCPNet | TEA R50 | 8f | 73.9/91.6 | 76.8/92.9 | K400_TCP_TEA_R50_8f | Img1K_Res2Net50_GCP | log |
TCPNet | TSN R152 | 8f | 75.7/92.2 | 78.3/93.7 | K400_TCP_TSN_R152_8f | Img11K_1K_R152_GCP | log |
TCPNet | TSN R50 | 16f | 73.9/91.2 | 75.8/92.1 | K400_TCP_TSN_R50_16f | Img1K_R50_GCP | log |
TCPNet | TEA R50 | 16f | 75.3/92.2 | 77.2/93.1 | K400_TCP_TEA_R50_16f | Img1K_Res2Net50_GCP | log |
TCPNet | TSN R152 | 16f | 77.2/93.1 | 79.3/94.0 | K400_TCP_TSN_R152_16f | Img11K_1K_R152_GCP | TODO |
Mini-Kinetics-200
Method | Backbone | frames | 1 crop Acc (%) | 30 views Acc (%) | Model | Pretrained Model |
---|---|---|---|---|---|---|
TCPNet | TSN R50 | 8f | 78.7 | 80.7 | K200_TCP_TSN_8f | K400_TCP_TSN_R50_8f |
Environments
pytorch v1.0+(for TCP_TSN); v1.0~1.4(for TCP+TEA)
ffmpeg
graphviz pip install graphviz
tensorboard pip install tensorboardX
tqdm pip install tqdm
scikit-learn conda install scikit-learn
matplotlib conda install -c conda-forge matplotlib
fvcore pip install 'git+https://github.com/facebookresearch/fvcore'
Dataset Preparation
We provide a detailed dataset preparation guideline for Kinetics-400 and Mini-Kinetics-200. See Dataset preparation.
StartUp
- download the pretrained model and put it in
pretrained_models/
- execute the training script file e.g.:
sh script/K400/train_TCP_TSN_8f_R50.sh
- execute the inference script file e.g.:
sh script/K400/test_TCP_TSN_R50_8f.sh
TCP Code
├── ops
| ├── TCP
| | ├── TCP_module.py
| | ├── TCP_att_module.py
| | ├── TSA.py
| | └── TCA.py
| ├ ...
├ ...
Acknowledgement
- We thank TSM for providing well-designed 2D action recognition toolbox.
- We also refer to some functions from iSQRT, TEA and Non-local.
- Mini-K200 dataset samplling strategy follows Mini_K200.
- We would like to thank Facebook for developing pytorch toolbox.
Thanks for their work!