This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

Last update: Jan 9, 2023

Related tags

Deep Learning ActionCLIP

Overview

This is an official pytorch implementation of ActionCLIP: A New Paradigm for Video Action Recognition [arXiv]

Overview

Content

Prerequisites
Data Preparation
Uodates
Pretrained Models
- Kinetics-400
- Hmdb51 && UCF101
Testing
Training
Contributors
Citing_ActionClip
Acknowledgments

Prerequisites

The code is built with following libraries:

PyTorch >= 1.8
wandb
RandAugment
pprint
tqdm
dotmap
yaml
csv

For video data pre-processing, you may need ffmpeg.

More detail information about libraries see INSTALL.md.

Data Preparation

We need to first extract videos into frames for fast reading. Please refer to TSN repo for the detailed guide of data pre-processing. We have successfully trained on Kinetics, UCF101, HMDB51, Charades.

Updates

We now support single crop validation(including zero-shot) on Kinetics-400, UCF101 and HMDB51. The pretrained models see MODEL_ZOO.md for more information.
we now support the model-training on Kinetics-400, UCF101 and HMDB51 on 8, 16 and 32 frames. The model-training configs see configs/README.md for more information.
We now support the model-training on your own datasets. The detail information see configs/README.md.

Pretrained Models

Training video models is computationally expensive. Here we provide some of the pretrained models. We provide a large set of trained models in the ActionCLIP MODEL_ZOO.md.

Kinetics-400

We experiment ActionCLIP with different backbones(we choose Transf as our final visual prompt since it obtains the best results) and input frames configurations on k400. Here is a list of pre-trained models that we provide (see Table 6 of the paper).

model	n-frame	top1 Acc(single-crop)	top5 Acc(single-crop)	checkpoint
ViT-B/32	8	78.36%	94.25%	link pwd:8hg2
ViT-B/16	8	81.09%	95.49%	link
ViT-B/16	16	81.68%	95.87%	link
ViT-B/16	32	82.32%	96.20%	link pwd:v7nn

HMDB51 && UCF101

On HMDB51 and UCF101 datasets, the accuracy(k400 pretrained) is reported under the accurate setting.

HMDB51

model	n-frame	top1 Acc(single-crop)	checkpoint
ViT-B/16	32	76.2%	link

UCF101

model	n-frame	top1 Acc(single-crop)	checkpoint
ViT-B/16	32	97.1%	link

Testing

To test the downloaded pretrained models on Kinetics or HMDB51 or UCF101, you can run scripts/run_test.sh. For example:

# test
bash scripts/run_test.sh  ./configs/k400/k400_ft_tem.yaml

Zero-shot

We provide several examples to do zero-shot validation on kinetics-400, UCF101 and HMDB51.

To do zero-shot validation on Kinetics from CLIP pretrained models, you can run:

# zero-shot
bash scripts/run_test.sh  ./configs/k400/k400_ft_zero_shot.yaml

To do zero-shot validation on UCF101 and HMDB51 from Kinetics pretrained models, you need first prepare the k400 pretrained model and then you can run:

# zero-shot
bash scripts/run_test.sh  ./configs/hmdb51/hmdb_ft_zero_shot.yaml

Training

We provided several examples to train ActionCLIP with this repo:

To train on Kinetics from CLIP pretrained models, you can run:

# train 
bash scripts/run_train.sh  ./configs/k400/k400_ft_tem_test.yaml

To train on HMDB51 from Kinetics400 pretrained models, you can run:

# train 
bash scripts/run_train.sh  ./configs/hmdb51/hmdb_ft.yaml

To train on UCF101 from Kinetics400 pretrained models, you can run:

# train 
bash scripts/run_train.sh  ./configs/ucf101/ucf_ft.yaml

More training details, you can find in configs/README.md

Contributors

ActionCLIP is written and maintained by Mengmeng Wang and Jiazheng Xing.

Citing ActionCLIP

If you find ActionClip useful in your research, please use the following BibTex entry for citation.

@inproceedings{wang2022ActionCLIP,
  title={ActionCLIP: A New Paradigm for Video Action Recognition},
  author={Mengmeng Wang, Jiazheng Xing and Yong Liu},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},
  year={2021}
}

Acknowledgments

Our code is based on CLIP and STM.

Comments

Cannot reproduce results

Hello, I tried to reproduce the Kinetics 400 results using the config file k400_test.yaml and the 32-frame ViT-B/16 model. I get the following results: Epoch: [DotMap()/DotMap()]: Top1: 81.32102272727273, Top5: 95.90097402597402. This is slightly lower than the 82.32% and 96.20% provided in the README. Any insights? Thanks.

Also, do you have code for the 10-clip 3-crop setting used with the best performant model? If I understand properly, this setting achieves 83.8% and 97.1% as reported in the paper, is that correct?

opened by angelaaye 4
根据配置文件无法复现结果

您好，我尝试使用您的配置文件 k400_ft_tem.yaml，但是无法复现 ViT-32_8segments 的结果。我得到的结果是 76.9。我注意到 k400_ft_tem.yaml 使用了数据增广而 k400_ft_tem_test.yaml 没有使用数据增广。数据增广是否应该使用呢？不知您是否知道无法复现的原因可能出现在哪里？谢谢

opened by nbl97 4
About TemporalShift_VIT

https://github.com/sallymmx/ActionCLIP/blob/master/modules/temporal_shift.py#L73-L75

Does it really work? I run into errors, can you fix it, or it is just un-runable

opened by dreamerlin 1
About the bibtex

Hi there,

Thank you for sharing your great work. I am wondering whether it is an accepted paper (according to the provided bibtex in readme) or an pre-print paper in arXiv? It seems that I cannot find this paper by searching the information in the bibtex.

opened by zhenzhiwang 1
关于小样本设置下的结果

你好，我比较感兴趣论文中汇报的小样本设置下的准确率您是如何得到的？ 1.是否按照小样本的一般范式（meta-learning）重新进行fine-tune? 2.zero-shot下可以直接计算视频特征与标签文本的相似度，但是few-shot下每个类别除了标签还有少量的样本，这些少量样本如何贡献到最终的预测得分？希望得到您的回复！🙏

opened by bedman367 1
The details about multi-label video classification (Charades)

It is mentioned in the paper that the method is also effective for multi-classification. “ActionCLIP achieves the top performance of 44.3 mAP, which demonstrates its effectiveness on multi-label video classification."

Could you please explain the details of how to deal with multiple categories? thanks.

opened by MiaSanLei 0

Owner

GitHub

Official implement of Paper：A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sening images

A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images 深度监督影像融合网络DSIFN用于高分辨率双时相遥感影像变化检测 Of

135 Dec 19, 2022

Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

RGBT Crowd Counting Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, Liang Lin. "Cross-Modal Collaborative Representation Learning and a L

37 Dec 8, 2022

Official implement of "CAT: Cross Attention in Vision Transformer".

CAT: Cross Attention in Vision Transformer This is official implement of "CAT: Cross Attention in Vision Transformer". Abstract Since Transformer has

100 Dec 15, 2022

Official pytorch implement for “Transformer-Based Source-Free Domain Adaptation”

Official implementation for TransDA Official pytorch implement for “Transformer-Based Source-Free Domain Adaptation”. Overview: Result: Prerequisites:

54 Dec 22, 2022

Official implement of Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer This repository contains the PyTorch code for Evo-ViT. This work proposes a slow-fas

53 Dec 5, 2022

Unofficial implement with paper SpeakerGAN: Speaker identification with conditional generative adversarial network

Introduction This repository is about paper SpeakerGAN , and is unofficially implemented by Mingming Huang ([email protected]), Tiezheng Wang (wtz920729

7 Jan 3, 2023

It's a implement of this paper：Relation extraction via Multi-Level attention CNNs

Relation Classification via Multi-Level Attention CNNs It's a implement of this paper：Relation Classification via Multi-Level Attention CNNs. Training

2 Nov 4, 2022

A Pytorch implement of paper "Anomaly detection in dynamic graphs via transformer" (TADDY).

TADDY: Anomaly detection in dynamic graphs via transformer This repo covers an reference implementation for the paper "Anomaly detection in dynamic gr

21 Nov 24, 2022

This repository is related to an Arabic tutorial, within the tutorial we discuss the common data structure and algorithms and their worst and best case for each, then implement the code using Python.

Data Structure and Algorithms with Python This repository is related to the Arabic tutorial here, within the tutorial we discuss the common data struc

33 Dec 2, 2022

Implement face detection, and age and gender classification, and emotion classification.

YOLO Keras Face Detection Implement Face detection, and Age and Gender Classification, and Emotion Classification. (image from wider face dataset) Ove

10 Nov 14, 2022

offical implement of our Lifelong Person Re-Identification via Adaptive Knowledge Accumulation in CVPR2021

LifelongReID Offical implementation of our Lifelong Person Re-Identification via Adaptive Knowledge Accumulation in CVPR2021 by Nan Pu, Wei Chen, Yu L

76 Dec 8, 2022

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

61 Jan 1, 2023

This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

Related tags

Overview

This is an official pytorch implementation of ActionCLIP: A New Paradigm for Video Action Recognition [arXiv]

Overview

Content

Prerequisites

Data Preparation

Updates

Pretrained Models

Kinetics-400

HMDB51 && UCF101

HMDB51

UCF101

Testing

Zero-shot

Training

Contributors

Citing ActionCLIP

Acknowledgments

Comments

Cannot reproduce results

根据配置文件无法复现结果

About TemporalShift_VIT

About the bibtex

关于小样本设置下的结果

The details about multi-label video classification (Charades)

Owner

Official implement of Paper：A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sening images

Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

Official implement of "CAT: Cross Attention in Vision Transformer".

Official pytorch implement for “Transformer-Based Source-Free Domain Adaptation”

Official implement of Evo-ViT: Slow-Fast Token Evolution for Dynamic Vision Transformer

Unofficial implement with paper SpeakerGAN: Speaker identification with conditional generative adversarial network

It's a implement of this paper：Relation extraction via Multi-Level attention CNNs

A Pytorch implement of paper "Anomaly detection in dynamic graphs via transformer" (TADDY).

This repository is related to an Arabic tutorial, within the tutorial we discuss the common data structure and algorithms and their worst and best case for each, then implement the code using Python.

Implement face detection, and age and gender classification, and emotion classification.

offical implement of our Lifelong Person Re-Identification via Adaptive Knowledge Accumulation in CVPR2021

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

A Marvelous ChatBot implement using PyTorch.

Implement A3C for Mujoco gym envs

Perfect implement. Model shared. x0.5 (Top1:60.646) and 1.0x (Top1:69.402).

implement of SwiftNet:Real-time Video Object Segmentation

The implement of papar "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization"

a Pytorch easy re-implement of "YOLOX: Exceeding YOLO Series in 2021"

PyTorch Implement of Context Encoders: Feature Learning by Inpainting