Relational Self-Attention: What's Missing in Attention for Video Understanding
This repository is the official implementation of "Relational Self-Attention: What's Missing in Attention for Video Understanding" by Manjin Kim*, Heeseung Kwon*, Chunyu Wang, Suha Kwak, and Minsu Cho (*equal contribution).
Requirements
- Python: 3.7.9
- Pytorch: 1.6.0
- TorchVision: 0.2.1
- Cuda: 10.1
- Conda environment environment.yml
To install requirements:
conda env create -f environment.yml
conda activate rsa
Dataset Preparation
- Download Something-Something v1 & v2 (SSv1 & SSv2) datasets and extract RGB frames. Download URLs: SSv1, SSv2
- Make txt files that define training & validation splits. Each line in txt files is formatted as [video_path] [#frames] [class_label]. Please refer to any txt files in ./data directory.
Training
To train RSANet-R50 on SSv1 or SSv2 datasets in the paper, run this command:
# For SSv1
./scripts/train_Something_v1.sh
# example: ./scripts/train_Something_v1.sh RSA_R50_SSV1_16frames 16
# For SSv2
./scripts/train_Something_v2.sh
# example: ./scripts/train_Something_v2.sh RSA_R50_SSV2_16frames 16
Evaluation
To evaluate RSANet-R50 on SSv2 dataset in the paper, run:
# For SSv1
./scripts/test_Something_v1.sh
# example: ./scripts/test_Something_v1.sh RSA_R50_SSV1_16frames resnet_rgb_model_best.pth.tar 16
# For SSv2
./scripts/test_Something_v2.sh
# example: ./scripts/test_Something_v2.sh RSA_R50_SSV2_16frames resnet_rgb_model_best.pth.tar 16
Results
Our model achieves the following performance on Something-Something-V1 and Something-Something-V2:
model | dataset | frames | top-1 / top-5 | logs | checkpoints |
---|---|---|---|---|---|
RSANet-R50 | SSV1 | 16 | 54.0 % / 81.1 % | [log] | [checkpoint] |
RSANet-R50 | SSV2 | 16 | 66.0 % / 89.9 % | [log] | [checkpoint] |