Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)

Related tags

Deep Learning STAM
Overview

An Image is Worth 16x16 Words, What is a Video Worth?

paper

Official PyTorch Implementation

Gilad Sharir, Asaf Noy, Lihi Zelnik-Manor
DAMO Academy, Alibaba Group

Abstract

Leading methods in the domain of action recognition try to distill information from both the spatial and temporal dimensions of an input video. Methods that reach State of the Art (SotA) accuracy, usually make use of 3D convolution layers as a way to abstract the temporal information from video frames. The use of such convolutions requires sampling short clips from the input video, where each clip is a collection of closely sampled frames. Since each short clip covers a small fraction of an input video, multiple clips are sampled at inference in order to cover the whole temporal length of the video. This leads to increased computational load and is impractical for real-world applications. We address the computational bottleneck by significantly reducing the number of frames required for inference. Our approach relies on a temporal transformer that applies global attention over video frames, and thus better exploits the salient information in each frame. Therefore our approach is very input efficient, and can achieve SotA results (on Kinetics dataset) with a fraction of the data (frames per video), computation and latency. Specifically on Kinetics-400, we reach 78.8 top-1 accuracy with ×30 less frames per video, and ×40 faster inference than the current leading method

Update 2/5/2021: Improved results

Due to improved training hyperparameters, and using KD training, we were able to improve STAM results on Kinetics400 (+ ~1.5%). We are releasing the pretrained weights of the improved models (see Pretrained Models below).

Main Article Results

STAM models accuracy and GPU throughput on Kinetics400, compared to X3D. All measurements were done on Nvidia V100 GPU, with mixed precision. All models are trained on input resolution of 224.

Models Top-1 Accuracy
(%)
Flops × views
(10^9)
# Input Frames Runtime
(Videos/sec)
X3D-M 76.0 6.2 × 30 480 1.3
X3D-L 77.5 24.8 × 30 480 0.46
X3D-XL 79.1 48.4 × 30 480 N/A
X3D-XXL 80.4 194 × 30 480 N/A
TimeSformer-L 80.7 2380 × 3 288 N/A
ViViT-L 81.3 3992 × 12 384 N/A
STAM-16 79.3 270 × 1 16 20.0
STAM-64 80.5 1080 × 1 64 4.8

Pretrained Models

We provide a collection of STAM models pre-trained on Kinetics400.

Model name checkpoint
STAM_16 link
STAM_64 link

Reproduce Article Scores

We provide code for reproducing the validation top-1 score of STAM models on Kinetics400. First, download pretrained models from the links above.

Then, run the infer.py script. For example, for stam_16 (input size 224) run:

python -m infer \
--val_dir=/path/to/kinetics_val_folder \
--model_path=/model/path/to/stam_16.pth \
--model_name=stam_16
--input_size=224

Citations

@misc{sharir2021image,
    title   = {An Image is Worth 16x16 Words, What is a Video Worth?}, 
    author  = {Gilad Sharir and Asaf Noy and Lihi Zelnik-Manor},
    year    = {2021},
    eprint  = {2103.13915},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Acknowledgements

We thank Tal Ridnik for discussions and comments.

Some components of this code implementation are adapted from the excellent repository of Ross Wightman. Check it out and give it a star while you are at it.

Comments
  • Pretrain weights from the ImageNet

    Pretrain weights from the ImageNet

    Hi,

    Thanks for sharing this amazing work. I am just wondering where I can get the ImageNet pretrained weights as I see that Kinetics uses pretrained ImageNet weights for training in the paper. I would like to retrain the model.

    opened by villawang 2
  • 时间聚合时维度如何对齐?

    时间聚合时维度如何对齐?

    temporal_aggregation.py第39行对x进行reshape时按照我的理解应该是由B,N,C变为nvids, self.clip_length, NC,为何最后一个维度NC还可以与TransformerEncoderLayer中的embed_dim对齐? 按理说这里x的输入维度在经过transformer_model.py第179行的embadding后已经变为B,N,C并一直保持到时间聚合模块。这里的代码实在没有看懂,还希望作者如果看到的话能做出一些解答,谢谢

    opened by unclebuff 2
  • 39.8% of the validation data is not used for performance test

    39.8% of the validation data is not used for performance test

    Hi researchers. Great work for getting rid of multi-view inference. Some problems in my experiment: Many recent methods use non-local copies of Kinetics-400 dataset for experiments since more and more YouTube videos are unavailable. While using validation set of non-local copies and torchvion.datasets.Kinetics400 API(in src/utils/utils.py) for loading clips, there are around 39% of the validation data is discarded. In my experiment, top1 accuracy is the same as STAM_16 shows but fewer data is used. Print valid_data.len() at utils.py and it should show there are around 11897 clips if using non-local copies(19761 total). I believe STAM using one clip per video as the paper described. It seems that torchvion.datasets.Kinetics400 API discards same videos due to params settings. I also change the extensions('avi', 'mp4') to extensions('avi','mp4','mkv','webm') to cover all format, but still 11.5% discarded. So could you explain more about your experiment settings like details about dataset source (Kinetics official download links or non-local copies), how many samples in validation set and list of validation file names) or make your validation data public if convenient. Thank you.

    opened by FaceAnalysis 2
  • Linear Projection

    Linear Projection

    Dear researchers,

    Thank you for this great work!

    I have a confusion about the linear projection. As of the paper, "We design a convolution-free model that is fully based on self-attention blocks for the spatiotemporal domain" So, I was expecting no Conv block in the implementation. But I see a Conv2D in the linear projection. https://github.com/Alibaba-MIIL/STAM/blob/master/src/models/transformer_model.py#L224

    Can you provide some explanation on this?

    opened by ShuvenduRoy 2
  • Training code

    Training code

    Dear researchers,

    Thank you for this very nice piece of of work.

    Can you also provide the code you use for the training ?

    Without it, it is impossible to reproduce your results, and validate your conclusion.

    Best regards,

    opened by OValery16 2
  • from .layers.drop import DropPath

    from .layers.drop import DropPath

    This is great work and I want to read the code. But I am a rookie on pytorch, I can`t find DropPath, to_2tuple, register_model module, can you tell me where to find them? Thanks a lot!

    opened by zkx-sust 2
  • some training hyperparameters about kinetics400

    some training hyperparameters about kinetics400

    I want to know hyperparameters in Kinetics400(root=source, step_between_clips=args.step_between_clips, frames_per_clip=args.frames_per_clip, frame_rate=args.frame_rate)

    opened by lwdoubles 1
  • About TAggreagate

    About TAggreagate

    Thank you for your great work! Q1: I can't understand the nvids in the code blew. Does nvids represent batch number? Q2: What is the value of pos_drop should be set ?

      def forward(self, x):
        nvids = x.shape[0] // self.clip_length
        x = x.view((nvids, self.clip_length, -1))
    
        cls_tokens = self.cls_token.expand(nvids, -1, -1)
        x = torch.cat((cls_tokens, x), dim=1)
        x = x + self.pos_embed
        # x = self.pos_drop(x)
    
        x.transpose_(1,0)
        # x = x.view((self.clip_length, nvids, -1))
        o = self.transformer_enc(x)
        # o = o.mean(dim=0)
    
        return o[0]
    
    opened by TitaniumOne 1
  • Training hyperparameters?

    Training hyperparameters?

    Quiet promising work which shows the great potential of Video Transformer. Looking forward to the training code and details about training hyperparameters!

    opened by jianghaojun 1
  • train model

    train model

    Hello, your work is great, thanks for sharing the code, could you share training-related information? Or the model trained on UCF101, I am very interested in this, thank you very much!

    opened by yeboqxc 0
  • How to train?

    How to train?

    Hi, Thanks for this implementation. But I still have some problems. The parameters in model are initialize as follow:

      def _init_weights(self, m):
        if isinstance(m, nn.Linear):
          with torch.no_grad():
            trunc_normal_(m.weight, std=.02)
          if isinstance(m, nn.Linear) and m.bias is not None:
            nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.LayerNorm):
          nn.init.constant_(m.bias, 0)
          nn.init.constant_(m.weight, 1.0)
    

    Does this mean I should make these parameters trainable when I train this model on my own dataset?

    opened by TianshengSun 0
  • Performance with low hyper parameters

    Performance with low hyper parameters

    Hi,

    Thanks for your great work!

    I'm trying to reproduce the results with your code now. However, due to limited computational resources, we can only reach batch_size 32 for the STAM 16 network. Would you have any idea about how good the performance this network can achieve under this setting?

    Thanks.

    opened by chenyangjamie 0
  • Why did you use a pytorch built-in TransformerEncoder in TAggregate module?

    Why did you use a pytorch built-in TransformerEncoder in TAggregate module?

    Are there any differences between nn.TransformerEncoder and class Block in transformer_model.py?

    Have you ever tried to use class Block instead of nn.TransformerEncoder in aggregate module just like what you do in spatial dimension?

    I appreciate for the brilliant model you have created, but I am still confusing about this questions, I would appreciate it if you could reply.

    opened by yojayc 0
  • Could you please share training hyper-parameters?

    Could you please share training hyper-parameters?

    Hello,

    This work is really inspiring, and thanks for sharing the code. Meanwhile, could you please also share the training hyper-parameters (e.g., learning rate, optimizer, warmup lr, warmup epochs, etc.)? I would really like to train the model to get a deeper understanding of the model.

    Thanks, Steve

    opened by stevehuanghe 2
Owner
null
Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

IC-Conv This repository is an official implementation of the paper Inception Convolution with Efficient Dilation Search. Getting Started Download Imag

Jie Liu 111 Dec 31, 2022
This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.

BiPointNet: Binary Neural Network for Point Clouds Created by Haotong Qin, Zhongang Cai, Mingyuan Zhang, Yifu Ding, Haiyu Zhao, Shuai Yi, Xianglong Li

Haotong Qin 59 Dec 17, 2022
This is an official implementation of our CVPR 2021 paper "Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression" (https://arxiv.org/abs/2104.02300)

Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression Introduction In this paper, we are interested in the bottom-up paradigm of estima

HRNet 367 Dec 27, 2022
The official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averaging Approach

Graph Optimizer This repo contains the official implementation of our CVPR 2021 paper - Hybrid Rotation Averaging: A Fast and Robust Rotation Averagin

Chenyu 109 Dec 23, 2022
Official Pytorch Implementation of: "ImageNet-21K Pretraining for the Masses"(2021) paper

ImageNet-21K Pretraining for the Masses Paper | Pretrained models Official PyTorch Implementation Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, Lihi Zelni

null 574 Jan 2, 2023
Official Pytorch Implementation of: "Semantic Diversity Learning for Zero-Shot Multi-label Classification"(2021) paper

Semantic Diversity Learning for Zero-Shot Multi-label Classification Paper Official PyTorch Implementation Avi Ben-Cohen, Nadav Zamir, Emanuel Ben Bar

null 28 Aug 29, 2022
Official PyTorch implementation of the preprint paper "Stylized Neural Painting", accepted to CVPR 2021.

Official PyTorch implementation of the preprint paper "Stylized Neural Painting", accepted to CVPR 2021.

Zhengxia Zou 1.5k Dec 28, 2022
Official implementation of the paper Vision Transformer with Progressive Sampling, ICCV 2021.

Vision Transformer with Progressive Sampling This is the official implementation of the paper Vision Transformer with Progressive Sampling, ICCV 2021.

yuexy 123 Jan 1, 2023
Official implementation for ICDAR 2021 paper "Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer"

Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer Description Convert offline handwritten mathematical expressi

Wenqi Zhao 87 Dec 27, 2022
Official PyTorch implementation of the paper "Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN", accepted to ACM MM 2021 BNI Track.

RecycleD Official PyTorch implementation of the paper "Recycling Discriminator: Towards Opinion-Unaware Image Quality Assessment Using Wasserstein GAN

Yunan Zhu 23 Nov 5, 2022
Official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence".

The DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings and that the spatial embeddings make minor contributions, increasing the need for high-quality content embeddings and thus increasing the training difficulty.

null 281 Dec 30, 2022
The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution.

WSRGlow The official implementation of the Interspeech 2021 paper WSRGlow: A Glow-based Waveform Generative Model for Audio Super-Resolution. Audio sa

Kexun Zhang 96 Jan 3, 2023
The Official Implementation of the ICCV-2021 Paper: Semantically Coherent Out-of-Distribution Detection.

SCOOD-UDG (ICCV 2021) This repository is the official implementation of the paper: Semantically Coherent Out-of-Distribution Detection Jingkang Yang,

Jake YANG 62 Nov 21, 2022
Official implementation of the ICCV 2021 paper: "The Power of Points for Modeling Humans in Clothing".

The Power of Points for Modeling Humans in Clothing (ICCV 2021) This repository contains the official PyTorch implementation of the ICCV 2021 paper: T

Qianli Ma 158 Nov 24, 2022
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab 89 Dec 26, 2022
official Pytorch implementation of ICCV 2021 paper FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting.

FuseFormer: Fusing Fine-Grained Information in Transformers for Video Inpainting By Rui Liu, Hanming Deng, Yangyi Huang, Xiaoyu Shi, Lewei Lu, Wenxiu

null 77 Dec 27, 2022
Official implementation of the ICCV 2021 paper "Joint Inductive and Transductive Learning for Video Object Segmentation"

JOINT This is the official implementation of Joint Inductive and Transductive learning for Video Object Segmentation, to appear in ICCV 2021. @inproce

Yunyao 35 Oct 16, 2022
Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"

Official implementation of NeurIPS 2021 paper "One Loss for All: Deep Hashing with a Single Cosine Similarity based Learning Objective"

Ng Kam Woh 71 Dec 22, 2022
Official PyTorch Implementation of paper "Deep 3D Mask Volume for View Synthesis of Dynamic Scenes", ICCV 2021.

Deep 3D Mask Volume for View Synthesis of Dynamic Scenes Official PyTorch Implementation of paper "Deep 3D Mask Volume for View Synthesis of Dynamic S

Ken Lin 17 Oct 12, 2022