Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

Last update: Nov 12, 2022

Related tags

Computer Vision STAM

Overview

An Image is Worth 16x16 Words, What is a Video Worth?

paper

Official PyTorch Implementation

Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor
DAMO Academy, Alibaba Group

Abstract

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with ×30 less frames per video, and ×40 faster inference than the current leading method

Main Article Results

STAM models accuracy and GPU throughput on Kinetics400, compared to X3D. All measurements were done on Nvidia V100 GPU, with mixed precision. All models are trained on input resolution of 224.

Models	Top-1 Accuracy (%)	Flops × views (10^9)	# Input Frames	Runtime (Videos/sec)
X3D-M	76.0	6.2 × 30	480	1.3
X3D-L	77.5	24.8 × 30	480	0.46
X3D-XL	79.1	48.4 × 30	480	N/A
STAM-16	77.8	270 × 1	16	20.0
STAM-64	79.2	1080 × 1	64	4.8

Pretrained Models

We provide a collection of STAM models pre-trained on Kinetics400.

Model name	checkpoint
STAM_16	link
STAM_32	link
STAM_64	link

Reproduce Article Scores

We provide code for reproducing the validation top-1 score of STAM models on Kinetics400. First, download pretrained models from the links above.

Then, run the infer.py script. For example, for stam_16 (input size 224) run:

python -m infer \
--val_dir=/path/to/kinetics_val_folder \
--model_path=/model/path/to/stam_16.pth \
--model_name=stam_16
--input_size=224

Citations

@misc{sharir2021image,
    title   = {An Image is Worth 16x16 Words, What is a Video Worth?}, 
    author  = {Gilad Sharir and Asaf Noy and Lihi Zelnik-Manor},
    year    = {2021},
    eprint  = {2103.13915},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Acknowledgements

We thank Tal Ridnik for discussions and comments.

Some components of this code implementation are adapted from the excellent repository of Ross Wightman. Check it out and give it a star while you are at it.

Comments

Pretrain weights from the ImageNet

Hi,

Thanks for sharing this amazing work. I am just wondering where I can get the ImageNet pretrained weights as I see that Kinetics uses pretrained ImageNet weights for training in the paper. I would like to retrain the model.

opened by villawang 2
时间聚合时维度如何对齐？

temporal_aggregation.py第39行对x进行reshape时按照我的理解应该是由B,N,C变为nvids, self.clip_length, NC，为何最后一个维度NC还可以与TransformerEncoderLayer中的embed_dim对齐？按理说这里x的输入维度在经过transformer_model.py第179行的embadding后已经变为B,N,C并一直保持到时间聚合模块。这里的代码实在没有看懂，还希望作者如果看到的话能做出一些解答，谢谢

opened by unclebuff 2
39.8% of the validation data is not used for performance test

Hi researchers. Great work for getting rid of multi-view inference. Some problems in my experiment: Many recent methods use non-local copies of Kinetics-400 dataset for experiments since more and more YouTube videos are unavailable. While using validation set of non-local copies and torchvion.datasets.Kinetics400 API(in src/utils/utils.py) for loading clips, there are around 39% of the validation data is discarded. In my experiment, top1 accuracy is the same as STAM_16 shows but fewer data is used. Print valid_data.len() at utils.py and it should show there are around 11897 clips if using non-local copies(19761 total). I believe STAM using one clip per video as the paper described. It seems that torchvion.datasets.Kinetics400 API discards same videos due to params settings. I also change the extensions('avi', 'mp4') to extensions('avi','mp4','mkv','webm') to cover all format, but still 11.5% discarded. So could you explain more about your experiment settings like details about dataset source (Kinetics official download links or non-local copies), how many samples in validation set and list of validation file names) or make your validation data public if convenient. Thank you.

opened by FaceAnalysis 2
Linear Projection

Dear researchers,

Thank you for this great work!

I have a confusion about the linear projection. As of the paper, "We design a convolution-free model that is fully based on self-attention blocks for the spatiotemporal domain" So, I was expecting no Conv block in the implementation. But I see a Conv2D in the linear projection. https://github.com/Alibaba-MIIL/STAM/blob/master/src/models/transformer_model.py#L224

Can you provide some explanation on this?

opened by ShuvenduRoy 2
Training code

Dear researchers,

Thank you for this very nice piece of of work.

Can you also provide the code you use for the training ?

Without it, it is impossible to reproduce your results, and validate your conclusion.

Best regards,

opened by OValery16 2
from .layers.drop import DropPath

This is great work and I want to read the code. But I am a rookie on pytorch, I can`t find DropPath, to_2tuple, register_model module, can you tell me where to find them? Thanks a lot!

opened by zkx-sust 2
some training hyperparameters about kinetics400

I want to know hyperparameters in Kinetics400(root=source, step_between_clips=args.step_between_clips, frames_per_clip=args.frames_per_clip, frame_rate=args.frame_rate)

opened by lwdoubles 1

About TAggreagate

Thank you for your great work! Q1: I can't understand the nvids in the code blew. Does nvids represent batch number? Q2: What is the value of pos_drop should be set ?

  def forward(self, x):
    nvids = x.shape[0] // self.clip_length
    x = x.view((nvids, self.clip_length, -1))

    cls_tokens = self.cls_token.expand(nvids, -1, -1)
    x = torch.cat((cls_tokens, x), dim=1)
    x = x + self.pos_embed
    # x = self.pos_drop(x)

    x.transpose_(1,0)
    # x = x.view((self.clip_length, nvids, -1))
    o = self.transformer_enc(x)
    # o = o.mean(dim=0)

    return o[0]

opened by TitaniumOne 1

Training hyperparameters？

Quiet promising work which shows the great potential of Video Transformer. Looking forward to the training code and details about training hyperparameters!

opened by jianghaojun 1

How to train?

Hi, Thanks for this implementation. But I still have some problems. The parameters in model are initialize as follow：

  def _init_weights(self, m):
    if isinstance(m, nn.Linear):
      with torch.no_grad():
        trunc_normal_(m.weight, std=.02)
      if isinstance(m, nn.Linear) and m.bias is not None:
        nn.init.constant_(m.bias, 0)
    elif isinstance(m, nn.LayerNorm):
      nn.init.constant_(m.bias, 0)
      nn.init.constant_(m.weight, 1.0)

Does this mean I should make these parameters trainable when I train this model on my own dataset?

opened by TianshengSun 1

Performance with low hyper parameters

Hi,

Thanks for your great work!

I'm trying to reproduce the results with your code now. However, due to limited computational resources, we can only reach batch_size 32 for the STAM 16 network. Would you have any idea about how good the performance this network can achieve under this setting?

Thanks.

opened by chenyangjamie 0
train model

Hello, your work is great, thanks for sharing the code, could you share training-related information? Or the model trained on UCF101, I am very interested in this, thank you very much！

opened by yeboqxc 0
Why did you use a pytorch built-in TransformerEncoder in TAggregate module?

Are there any differences between nn.TransformerEncoder and class Block in transformer_model.py?

Have you ever tried to use class Block instead of nn.TransformerEncoder in aggregate module just like what you do in spatial dimension?

I appreciate for the brilliant model you have created, but I am still confusing about this questions, I would appreciate it if you could reply.

opened by yojayc 0
Could you please share training hyper-parameters?

Hello,

This work is really inspiring, and thanks for sharing the code. Meanwhile, could you please also share the training hyper-parameters (e.g., learning rate, optimizer, warmup lr, warmup epochs, etc.)? I would really like to train the model to get a deeper understanding of the model.

Thanks, Steve

opened by stevehuanghe 2

Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

Related tags

Overview

An Image is Worth 16x16 Words, What is a Video Worth?

Main Article Results

Pretrained Models

Reproduce Article Scores

Citations

Acknowledgements

Comments

Owner

An official PyTorch implementation of the paper "Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences", ICCV 2021.

The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

Official implementation of Character Region Awareness for Text Detection (CRAFT)

Official PyTorch implementation for "Mixed supervision for surface-defect detection: from weakly to fully supervised learning"

[BMVC'21] Official PyTorch Implementation of Grounded Situation Recognition with Transformers

Code for AAAI 2021 paper: Sequential End-to-end Network for Efficient Person Search

SceneCollisionNet This repo contains the code for "Object Rearrangement Using Learned Implicit Collision Functions", an ICRA 2021 paper. For more info

Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd.

code for our ICCV 2021 paper "DeepCAD: A Deep Generative Network for Computer-Aided Design Models"

Dataset and Code for ICCV 2021 paper "Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme"

Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

An Implementation of the alogrithm in paper IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection

A PyTorch implementation of ECCV2018 Paper: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

Implementation of our paper 'PixelLink: Detecting Scene Text via Instance Segmentation' in AAAI2018

An Implementation of the seglink alogrithm in paper Detecting Oriented Text in Natural Images by Linking Segments

This is the implementation of the paper "Gated Recurrent Convolution Neural Network for OCR"

An unofficial implementation of the paper "AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss".

This is the open source implementation of the ICLR2022 paper "StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis"