Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

Related tags

Computer Vision STAM

An Image is Worth 16x16 Words, What is a Video Worth?



Official PyTorch Implementation

Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor
DAMO Academy, Alibaba Group


Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with ×30 less frames per video, and ×40 faster inference than the current leading method

Main Article Results

STAM models accuracy and GPU throughput on Kinetics400, compared to X3D. All measurements were done on Nvidia V100 GPU, with mixed precision. All models are trained on input resolution of 224.

Models Top-1 Accuracy
Flops × views
# Input Frames Runtime
X3D-M 76.0 6.2 × 30 480 1.3
X3D-L 77.5 24.8 × 30 480 0.46
X3D-XL 79.1 48.4 × 30 480 N/A
STAM-16 77.8 270 × 1 16 20.0
STAM-64 79.2 1080 × 1 64 4.8

Pretrained Models

We provide a collection of STAM models pre-trained on Kinetics400.

Model name checkpoint
STAM_16 link
STAM_32 link
STAM_64 link

Reproduce Article Scores

We provide code for reproducing the validation top-1 score of STAM models on Kinetics400. First, download pretrained models from the links above.

Then, run the script. For example, for stam_16 (input size 224) run:

python -m infer \
--val_dir=/path/to/kinetics_val_folder \
--model_path=/model/path/to/stam_16.pth \


    title   = {An Image is Worth 16x16 Words, What is a Video Worth?}, 
    author  = {Gilad Sharir and Asaf Noy and Lihi Zelnik-Manor},
    year    = {2021},
    eprint  = {2103.13915},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}


We thank Tal Ridnik for discussions and comments.

Some components of this code implementation are adapted from the excellent repository of Ross Wightman. Check it out and give it a star while you are at it.

  • Pretrain weights from the ImageNet

    Pretrain weights from the ImageNet


    Thanks for sharing this amazing work. I am just wondering where I can get the ImageNet pretrained weights as I see that Kinetics uses pretrained ImageNet weights for training in the paper. I would like to retrain the model.

    opened by villawang 2
  • 时间聚合时维度如何对齐?


    temporal_aggregation.py第39行对x进行reshape时按照我的理解应该是由B,N,C变为nvids, self.clip_length, NC,为何最后一个维度NC还可以与TransformerEncoderLayer中的embed_dim对齐? 按理说这里x的输入维度在经过transformer_model.py第179行的embadding后已经变为B,N,C并一直保持到时间聚合模块。这里的代码实在没有看懂,还希望作者如果看到的话能做出一些解答,谢谢

    opened by unclebuff 2
  • 39.8% of the validation data is not used for performance test

    39.8% of the validation data is not used for performance test

    Hi researchers. Great work for getting rid of multi-view inference. Some problems in my experiment: Many recent methods use non-local copies of Kinetics-400 dataset for experiments since more and more YouTube videos are unavailable. While using validation set of non-local copies and torchvion.datasets.Kinetics400 API(in src/utils/ for loading clips, there are around 39% of the validation data is discarded. In my experiment, top1 accuracy is the same as STAM_16 shows but fewer data is used. Print valid_data.len() at and it should show there are around 11897 clips if using non-local copies(19761 total). I believe STAM using one clip per video as the paper described. It seems that torchvion.datasets.Kinetics400 API discards same videos due to params settings. I also change the extensions('avi', 'mp4') to extensions('avi','mp4','mkv','webm') to cover all format, but still 11.5% discarded. So could you explain more about your experiment settings like details about dataset source (Kinetics official download links or non-local copies), how many samples in validation set and list of validation file names) or make your validation data public if convenient. Thank you.

    opened by FaceAnalysis 2
  • Linear Projection

    Linear Projection

    Dear researchers,

    Thank you for this great work!

    I have a confusion about the linear projection. As of the paper, "We design a convolution-free model that is fully based on self-attention blocks for the spatiotemporal domain" So, I was expecting no Conv block in the implementation. But I see a Conv2D in the linear projection.

    Can you provide some explanation on this?

    opened by ShuvenduRoy 2
  • Training code

    Training code

    Dear researchers,

    Thank you for this very nice piece of of work.

    Can you also provide the code you use for the training ?

    Without it, it is impossible to reproduce your results, and validate your conclusion.

    Best regards,

    opened by OValery16 2
  • from .layers.drop import DropPath

    from .layers.drop import DropPath

    This is great work and I want to read the code. But I am a rookie on pytorch, I can`t find DropPath, to_2tuple, register_model module, can you tell me where to find them? Thanks a lot!

    opened by zkx-sust 2
  • some training hyperparameters about kinetics400

    some training hyperparameters about kinetics400

    I want to know hyperparameters in Kinetics400(root=source, step_between_clips=args.step_between_clips, frames_per_clip=args.frames_per_clip, frame_rate=args.frame_rate)

    opened by lwdoubles 1
  • About TAggreagate

    About TAggreagate

    Thank you for your great work! Q1: I can't understand the nvids in the code blew. Does nvids represent batch number? Q2: What is the value of pos_drop should be set ?

      def forward(self, x):
        nvids = x.shape[0] // self.clip_length
        x = x.view((nvids, self.clip_length, -1))
        cls_tokens = self.cls_token.expand(nvids, -1, -1)
        x =, x), dim=1)
        x = x + self.pos_embed
        # x = self.pos_drop(x)
        # x = x.view((self.clip_length, nvids, -1))
        o = self.transformer_enc(x)
        # o = o.mean(dim=0)
        return o[0]
    opened by TitaniumOne 1
  • Training hyperparameters?

    Training hyperparameters?

    Quiet promising work which shows the great potential of Video Transformer. Looking forward to the training code and details about training hyperparameters!

    opened by jianghaojun 1
  • How to train?

    How to train?

    Hi, Thanks for this implementation. But I still have some problems. The parameters in model are initialize as follow:

      def _init_weights(self, m):
        if isinstance(m, nn.Linear):
          with torch.no_grad():
            trunc_normal_(m.weight, std=.02)
          if isinstance(m, nn.Linear) and m.bias is not None:
            nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.LayerNorm):
          nn.init.constant_(m.bias, 0)
          nn.init.constant_(m.weight, 1.0)

    Does this mean I should make these parameters trainable when I train this model on my own dataset?

    opened by TianshengSun 1
  • Performance with low hyper parameters

    Performance with low hyper parameters


    Thanks for your great work!

    I'm trying to reproduce the results with your code now. However, due to limited computational resources, we can only reach batch_size 32 for the STAM 16 network. Would you have any idea about how good the performance this network can achieve under this setting?


    opened by chenyangjamie 0
  • train model

    train model

    Hello, your work is great, thanks for sharing the code, could you share training-related information? Or the model trained on UCF101, I am very interested in this, thank you very much!

    opened by yeboqxc 0
  • Why did you use a pytorch built-in TransformerEncoder in TAggregate module?

    Why did you use a pytorch built-in TransformerEncoder in TAggregate module?

    Are there any differences between nn.TransformerEncoder and class Block in

    Have you ever tried to use class Block instead of nn.TransformerEncoder in aggregate module just like what you do in spatial dimension?

    I appreciate for the brilliant model you have created, but I am still confusing about this questions, I would appreciate it if you could reply.

    opened by yojayc 0
  • Could you please share training hyper-parameters?

    Could you please share training hyper-parameters?


    This work is really inspiring, and thanks for sharing the code. Meanwhile, could you please also share the training hyper-parameters (e.g., learning rate, optimizer, warmup lr, warmup epochs, etc.)? I would really like to train the model to get a deeper understanding of the model.

    Thanks, Steve

    opened by stevehuanghe 2
An official PyTorch implementation of the paper "Learning by Aligning: Visible-Infrared Person Re-identification using Cross-Modal Correspondences", ICCV 2021.

PyTorch implementation of Learning by Aligning (ICCV 2021) This is an official PyTorch implementation of the paper "Learning by Aligning: Visible-Infr

CV Lab @ Yonsei University 30 Nov 5, 2022
The project is an official implementation of our paper "3D Human Pose Estimation with Spatial and Temporal Transformers".

3D Human Pose Estimation with Spatial and Temporal Transformers This repo is the official implementation for 3D Human Pose Estimation with Spatial and

Ce Zheng 363 Dec 28, 2022
This is the official PyTorch implementation of the paper "TransFG: A Transformer Architecture for Fine-grained Recognition" (Ju He, Jie-Neng Chen, Shuai Liu, Adam Kortylewski, Cheng Yang, Yutong Bai, Changhu Wang, Alan Yuille).

TransFG: A Transformer Architecture for Fine-grained Recognition Official PyTorch code for the paper: TransFG: A Transformer Architecture for Fine-gra

Ju He 307 Jan 3, 2023
Official implementation of Character Region Awareness for Text Detection (CRAFT)

CRAFT: Character-Region Awareness For Text detection Official Pytorch implementation of CRAFT text detector | Paper | Pretrained Model | Supplementary

Clova AI Research 2.5k Jan 3, 2023
Official PyTorch implementation for "Mixed supervision for surface-defect detection: from weakly to fully supervised learning"

Mixed supervision for surface-defect detection: from weakly to fully supervised learning [Computers in Industry 2021] Official PyTorch implementation

ViCoS Lab 169 Dec 30, 2022
[BMVC'21] Official PyTorch Implementation of Grounded Situation Recognition with Transformers

Grounded Situation Recognition with Transformers Paper | Model Checkpoint This is the official PyTorch implementation of Grounded Situation Recognitio

Junhyeong Cho 18 Jul 19, 2022
Code for AAAI 2021 paper: Sequential End-to-end Network for Efficient Person Search

This repository hosts the source code of our paper: [AAAI 2021]Sequential End-to-end Network for Efficient Person Search. SeqNet achieves the state-of

Zj Li 218 Dec 31, 2022
SceneCollisionNet This repo contains the code for "Object Rearrangement Using Learned Implicit Collision Functions", an ICRA 2021 paper. For more info

SceneCollisionNet This repo contains the code for "Object Rearrangement Using Learned Implicit Collision Functions", an ICRA 2021 paper. For more info

NVIDIA Research Projects 31 Nov 22, 2022
Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd.

Head Detector Code for the head detector (HeadHunter) proposed in our CVPR 2021 paper Tracking Pedestrian Heads in Dense Crowd. The head_detection mod

Ramana Subramanyam 76 Dec 6, 2022
code for our ICCV 2021 paper "DeepCAD: A Deep Generative Network for Computer-Aided Design Models"

DeepCAD This repository provides source code for our paper: DeepCAD: A Deep Generative Network for Computer-Aided Design Models Rundi Wu, Chang Xiao,

Rundi Wu 85 Dec 31, 2022
Dataset and Code for ICCV 2021 paper "Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme"

Dataset and Code for RealVSR Real-world Video Super-resolution: A Benchmark Dataset and A Decomposition based Learning Scheme Xi Yang, Wangmeng Xiang,

Xi Yang 91 Nov 22, 2022
Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding for Zero-Example Video Retrieval.

Dual Encoding for Video Retrieval by Text Source code of our TPAMI'21 paper Dual Encoding for Video Retrieval by Text and CVPR'19 paper Dual Encoding

null 81 Dec 1, 2022
An Implementation of the alogrithm in paper IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection

InceptText-Tensorflow An Implementation of the alogrithm in paper IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Orien

GeorgeJoe 115 Dec 12, 2022
A PyTorch implementation of ECCV2018 Paper: TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes

TextSnake: A Flexible Representation for Detecting Text of Arbitrary Shapes A PyTorch implement of TextSnake: A Flexible Representation for Detecting

Prince Wang 417 Dec 12, 2022
Implementation of our paper 'PixelLink: Detecting Scene Text via Instance Segmentation' in AAAI2018

Code for the AAAI18 paper PixelLink: Detecting Scene Text via Instance Segmentation, by Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. Contributions

null 758 Dec 22, 2022
An Implementation of the seglink alogrithm in paper Detecting Oriented Text in Natural Images by Linking Segments

Tips: A more recent scene text detection algorithm: PixelLink, has been implemented here: Contents: Introduc

dengdan 484 Dec 7, 2022
This is the implementation of the paper "Gated Recurrent Convolution Neural Network for OCR"

Gated Recurrent Convolution Neural Network for OCR This project is an implementation of the GRCNN for OCR. For details, please refer to the paper: htt

null 90 Dec 22, 2022
An unofficial implementation of the paper "AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss".

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss This is an unofficial implementation of AutoVC based on the official one. The reposi

Chien-yu Huang 27 Jun 16, 2022
This is the open source implementation of the ICLR2022 paper "StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis"

StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image Synthesis StyleNeRF: A Style-based 3D-Aware Generator for High-resolution Image

Meta Research 840 Dec 26, 2022