Implementation of TimeSformer, a pure attention-based solution for video classification

Phil Wang

Last update: Jan 3, 2023

Related tags

Overview

TimeSformer - Pytorch

Implementation of TimeSformer, a pure and simple attention-based solution for reaching SOTA on video classification. This repository will only house the best performing variant, 'Divided Space-Time Attention', which is nothing more than attention along the time axis before the spatial.

Install

$ pip install timesformer-pytorch

Usage

import torch
from timesformer_pytorch import TimeSformer

model = TimeSformer(
    dim = 512,
    image_size = 224,
    patch_size = 16,
    num_frames = 8,
    num_classes = 10,
    depth = 12,
    heads = 8,
    dim_head =  64,
    attn_dropout = 0.1,
    ff_dropout = 0.1
)

video = torch.randn(2, 8, 3, 224, 224) # (batch x frames x channels x height x width)
pred = model(video) # (2, 10)

Citations

@misc{bertasius2021spacetime,
    title   = {Is Space-Time Attention All You Need for Video Understanding?}, 
    author  = {Gedas Bertasius and Heng Wang and Lorenzo Torresani},
    year    = {2021},
    eprint  = {2102.05095},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Comments

How to deal with varying length video? Thanks

Dear all, I am wondering if TimeSformer can handle different videos with diverse lengths? Is it possible to use mask as the original Transformer? Any ideas, thanks a lot.

opened by junyongyou 2
fix runtime error in SpaceTime Attention

There is a shape mismatch error in Attention. When we splice out the classification token from the first token of each sequence in q, k and v, the shape becomes (batch_size * num_heads, num_frames * num_patches - 1, head_dim). Then we try to reshape the tensor by taking out a factor of num_frames or num_patches (depending on whether it is space or time attention) from dimension 1. That doesn't work because we subtracted out the classification token.

I found that performing the rearrange operation before splicing the token fixes the issue.

I recreate the problem and illustrate the solution in this notebook: https://colab.research.google.com/drive/1lHFcn_vgSDJNSqxHy7rtqhMVxe0nUCMS?usp=sharing.

By the way, thank you to @lucidrains; all of your implementations on attention-based models are helping me more than you know.

opened by adam-mehdi 1
Update timesformer_pytorch.py

fixing issue for scaling

File "/home/aarti9/.local/lib/python3.6/site-packages/timesformer_pytorch/timesformer_pytorch.py", line 82, in forward q *= self.scale

RuntimeError: Output 0 of ViewBackward is a view and is being modified inplace. This view is an output of a function that returns multiple views. Inplace operators on such views is forbidden. You should replace the inplace operation by an out-of-place one.

opened by aarti9 0
Fine-tune with new datasets

Thank you so much for your great effort. I can predict the images using the given .py files. But, I couldn't find train.py files, so how to fine-tune the network with new datasets? where should i define the image samples of the new dataset ?

opened by Jeba-create 0
problem in timesformer_pytorch.py

start from line 182 video = rearrange(video, 'b f c (h p1) (w p2) -> b (f h w) (p1 p2 c)', p1 = p, p2 = p) i think this should be video = rearrange(video, 'b f c (hp p1) (wp p2) -> b (f hp wp) (p1 p2 c)', p1 = p, p2 = p)

opened by Weizhongjin 2
Imagenet Pretrained Weights

Thanks for the work! In their paper they say For all our experiments, we adopt the “Base” ViT model architecture (Dosovitskiy et al., 2020) pretrained on ImageNet.

I know that you said the official weights trained on kinetics and such are not officially released yet. However, I am not interested in those but am actually in need of the initial weights of the network just based on ViT Imagenet pretraining. I need to train this implementation of yours starting from those. From what it looks like, you don't have weights for this implementation that come from imagenet pretraining, do you?

opened by RaivoKoot 5

Releases(0.4.1)

0.4.1(Aug 25, 2021)

Source code(tar.gz)
Source code(zip)
0.4.0(Aug 16, 2021)

Source code(tar.gz)
Source code(zip)
0.3.3(Jul 4, 2021)

Source code(tar.gz)
Source code(zip)
0.3.2(Apr 26, 2021)

Source code(tar.gz)
Source code(zip)
0.3.1(Apr 25, 2021)

Source code(tar.gz)
Source code(zip)
0.2.1(Apr 21, 2021)

Source code(tar.gz)
Source code(zip)
0.1.1(Mar 23, 2021)

Source code(tar.gz)
Source code(zip)
0.1.0(Mar 21, 2021)

Source code(tar.gz)
Source code(zip)
0.0.5(Mar 18, 2021)

Source code(tar.gz)
Source code(zip)
0.0.4(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)
0.0.3(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)
0.0.2(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)
0.0.1a(Feb 11, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Phil Wang

Working with Attention. It's all we need.

GitHub

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

272 Dec 23, 2022

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Relational Self-Attention: What's Missing in Attention for Video Understanding This repository is the official implementation of "Relational Self-Atte

43 Dec 7, 2022

Implementation of Uniformer, a simple attention and 3d convolutional net that achieved SOTA in a number of video classification tasks

Uniformer - Pytorch Implementation of Uniformer, a simple attention and 3d convolutional net that achieved SOTA in a number of video classification ta

90 Nov 24, 2022

Xview3 solution - XView3 challenge, 2nd place solution

Xview3, 2nd place solution https://iuu.xview.us/ test split aggregate score publ

24 Nov 23, 2022

A Simple LSTM-Based Solution for "Heartbeat Signal Classification and Prediction" in Tianchi

LSTM-Time-Series-Prediction A Simple LSTM-Based Solution for "Heartbeat Signal Classification and Prediction" in Tianchi Contest. The Link of the Cont

1 Jun 13, 2022

Implementation of ResMLP, an all MLP solution to image classification, in Pytorch

ResMLP - Pytorch Implementation of ResMLP, an all MLP solution to image classification out of Facebook AI, in Pytorch Install $ pip install res-mlp-py

178 Dec 2, 2022

PyTorch implementation of Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network

hierarchical-multi-label-text-classification-pytorch Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach This

17 Dec 13, 2022

7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle

kaggle-hpa-2021-7th-place-solution Code for 7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle. A description of the met

8 Jul 9, 2021

4st place solution for the PBVS 2022 Multi-modal Aerial View Object Classification Challenge - Track 1 (SAR) at PBVS2022

A Two-Stage Shake-Shake Network for Long-tailed Recognition of SAR Aerial View Objects 4st place solution for the PBVS 2022 Multi-modal Aerial View Ob

5 Nov 9, 2022

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

1 Jan 23, 2022

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

HaloNet - Pytorch Implementation of the Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones. This re

189 Nov 22, 2022

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

180 Jan 5, 2023

[CVPR 2022] Official PyTorch Implementation for "Reference-based Video Super-Resolution Using Multi-Camera Video Triplets"

Reference-based Video Super-Resolution (RefVSR) Official PyTorch Implementation of the CVPR 2022 Paper Project | arXiv | RealMCVSR Dataset This repo c

151 Dec 30, 2022

Implementation of TimeSformer, a pure attention-based solution for video classification

Related tags

Overview

TimeSformer - Pytorch

Install

Usage

Citations

Comments

How to deal with varying length video? Thanks

fix runtime error in SpaceTime Attention

Update timesformer_pytorch.py

Fine-tune with new datasets

problem in timesformer_pytorch.py

Imagenet Pretrained Weights

Releases(0.4.1)

0.4.1(Aug 25, 2021)

0.4.0(Aug 16, 2021)

0.3.3(Jul 4, 2021)

0.3.2(Apr 26, 2021)

0.3.1(Apr 25, 2021)

0.2.1(Apr 21, 2021)

0.1.1(Mar 23, 2021)

0.1.0(Mar 21, 2021)

0.0.5(Mar 18, 2021)

0.0.4(Feb 11, 2021)

0.0.3(Feb 11, 2021)

0.0.2(Feb 11, 2021)

0.0.1a(Feb 11, 2021)

Owner

Phil Wang

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Official Pytorch Implementation of Relational Self-Attention: What's Missing in Attention for Video Understanding

Implementation of Uniformer, a simple attention and 3d convolutional net that achieved SOTA in a number of video classification tasks

Xview3 solution - XView3 challenge, 2nd place solution

A Simple LSTM-Based Solution for "Heartbeat Signal Classification and Prediction" in Tianchi

Implementation of ResMLP, an all MLP solution to image classification, in Pytorch

PyTorch implementation of Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network

7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle

4st place solution for the PBVS 2022 Multi-modal Aerial View Object Classification Challenge - Track 1 (SAR) at PBVS2022

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Implementation of the 😇 Attention layer from the paper, Scaling Local Self-Attention For Parameter Efficient Visual Backbones

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Official implementation of cosformer-attention in cosFormer: Rethinking Softmax in Attention

Implementation of Deformable Attention in Pytorch from the paper "Vision Transformer with Deformable Attention"

Hl classification bc - A Network-Based High-Level Data Classification Algorithm Using Betweenness Centrality

Implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

A PyTorch implementation of "Graph Classification Using Structural Attention" (KDD 2018).

Official implementation of CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification

[CVPR 2022] Official PyTorch Implementation for "Reference-based Video Super-Resolution Using Multi-Camera Video Triplets"