MoViNets PyTorch implementation: Mobile Video Networks for Efficient Video Recognition;

Overview

MoViNet-pytorch

Open In Colab Paper

Pytorch unofficial implementation of MoViNets: Mobile Video Networks for Efficient Video Recognition.
Authors: Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, Boqing Gong (Google Research)
[Authors' Implementation]

Stream Buffer

stream buffer

Clean stream buffer

It is required to clean the buffer after all the clips of the same video have been processed.

model.clean_activation_buffers()

Usage

Open In Colab
Click on "Open in Colab" to open an example of training on HMDB-51

installation

pip install git+https://github.com/Atze00/MoViNet-pytorch.git

How to build a model

Use causal = True to use the model with stream buffer, causal = False will use standard convolutions

from movinets import MoViNet
from movinets.config import _C

MoViNetA0 = MoViNet(_C.MODEL.MoViNetA0, causal = True, pretrained = True )
MoViNetA1 = MoViNet(_C.MODEL.MoViNetA1, causal = True, pretrained = True )
...
Load weights

Use pretrained = True to use the model with pretrained weights

    """
    If pretrained is True:
        num_classes is set to 600,
        conv_type is set to "3d" if causal is False, "2plus1d" if causal is True
        tf_like is set to True
    """
model = MoViNet(_C.MODEL.MoViNetA0, causal = True, pretrained = True )
model = MoViNet(_C.MODEL.MoViNetA0, causal = False, pretrained = True )

Training loop examples

Training loop with stream buffer

def train_iter(model, optimz, data_load, n_clips = 5, n_clip_frames=8):
    """
    In causal mode with stream buffer a single video is fed to the network
    using subclips of lenght n_clip_frames. 
    n_clips*n_clip_frames should be equal to the total number of frames presents
    in the video.
    
    n_clips : number of clips that are used
    n_clip_frames : number of frame contained in each clip
    """
    
    #clean the buffer of activations
    model.clean_activation_buffers()
    optimz.zero_grad()
    for i, data, target in enumerate(data_load):
        #backward pass for each clip
        for j in range(n_clips):
          out = F.log_softmax(model(data[:,:,(n_clip_frames)*(j):(n_clip_frames)*(j+1)]), dim=1)
          loss = F.nll_loss(out, target)/n_clips
          loss.backward()
        optimz.step()
        optimz.zero_grad()
        
        #clean the buffer of activations
        model.clean_activation_buffers()

Training loop with standard convolutions

def train_iter(model, optimz, data_load):

    optimz.zero_grad()
    for i, (data,_ , target) in enumerate(data_load):
        out = F.log_softmax(model(data), dim=1)
        loss = F.nll_loss(out, target)
        loss.backward()
        optimz.step()
        optimz.zero_grad()

Pretrained models

Weights

The weights are loaded from the tensorflow models released by the authors, trained on kinetics.

Base Models

Base models implement standard 3D convolutions without stream buffers.

Model Name Top-1 Accuracy* Top-5 Accuracy* Input Shape
MoViNet-A0-Base 72.28 90.92 50 x 172 x 172
MoViNet-A1-Base 76.69 93.40 50 x 172 x 172
MoViNet-A2-Base 78.62 94.17 50 x 224 x 224
MoViNet-A3-Base 81.79 95.67 120 x 256 x 256
MoViNet-A4-Base 83.48 96.16 80 x 290 x 290
MoViNet-A5-Base 84.27 96.39 120 x 320 x 320
Model Name Top-1 Accuracy* Top-5 Accuracy* Input Shape**
MoViNet-A0-Stream 72.05 90.63 50 x 172 x 172
MoViNet-A1-Stream 76.45 93.25 50 x 172 x 172
MoViNet-A2-Stream 78.40 94.05 50 x 224 x 224

**In streaming mode, the number of frames correspond to the total accumulated duration of the 10-second clip.

*Accuracy reported on the official repository for the dataset kinetics 600, It has not been tested by me. It should be the same since the tf models and the reimplemented pytorch models output the same results [Test].

I currently haven't tested the speed of the streaming models, feel free to test and contribute.

Status

Currently are available the pretrained models for the following architectures:

  • MoViNetA1-BASE
  • MoViNetA1-STREAM
  • MoViNetA2-BASE
  • MoViNetA2-STREAM
  • MoViNetA3-BASE
  • MoViNetA3-STREAM
  • MoViNetA4-BASE
  • MoViNetA4-STREAM
  • MoViNetA5-BASE
  • MoViNetA5-STREAM

I currently have no plans to include streaming version of A3,A4,A5. Those models are too slow for most mobile applications.

Testing

I recommend to create a new environment for testing and run the following command to install all the required packages:
pip install -r tests/test_requirements.txt

Citations

@article{kondratyuk2021movinets,
  title={MoViNets: Mobile Video Networks for Efficient Video Recognition},
  author={Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Matthew Brown, and Boqing Gong},
  journal={arXiv preprint arXiv:2103.11511},
  year={2021}
}
Comments
  • Neural network arch displayed by Netron is wrong

    Neural network arch displayed by Netron is wrong

    Hi, @Atze00 i saved MoViNtes-A0 model to 'pth' and look through model by Netron, but the structure is a little strange. maybe there is something wrong at '_forward_impl' at class 'class MoViNet(nn.Module)'.

    My code is as follows:

    model = MoViNet(_C.MODEL.MoViNetA0, causal=True, pretrained=False, num_classes=num_class)
    ...
    torch.save(model, '/path/to/*.pth')
    path = '/path/to/*.pth'
    model = torch.load(path, map_location='cpu')
    

    Please let me know if i did something wrong. 截屏2021-10-28 下午2 57 31

    opened by erwangccc 7
  • When pretrained stream version will be available?

    When pretrained stream version will be available?

    It looks like there is a bug in the current causal padding of the official TF models, I have filed an issue and a pull request to fix it. I am currently waiting for feedbacks on the PR. There is the possibility to obtain the same behavior even without the fix but I'm not considering it at the current moment. Reference to the issue: https://github.com/tensorflow/models/issues/10062

    opened by Atze00 5
  • Training code

    Training code

    Thank you for your good implementation! Could you provide more detail training code or example?

    Originally posted by @zhangyuan1994511 in https://github.com/Atze00/MoViNet-pytorch/issues/2#issuecomment-842952447

    opened by Atze00 5
  • Error in  running movinet-pytorch to onnx converted model

    Error in running movinet-pytorch to onnx converted model

    Hi @Atze00

    error_important

    I succesfully transfer trained this repos movinets

    But i want to deploy it else where into tensorrt format

    So i first converted pytorch model to onnx using torch.onnx.export

    I wanted to verify whether model is correctly ported . So I ran it in onnxruntime

    It is throwing error like in the screenshot

    Suggest any solutions

    kindly help me here

    let me know if any more screenshots or anything else

    opened by papasanimohansrinivas 4
  • About frames number of dataset

    About frames number of dataset

    According to paper, "For all datasets, we train with 64 frames (except when the inference frames are fewer) at various frame-rates, and run inference with the same frame-rate". Frames of training dataset are 64. but frames of one video from my dataset may be less than 32, may be it's 8, 10,13, 17 etc. so is it ok?

    Thanks in advance.

    opened by erwangccc 4
  • Test model based on 'evaluate_stream' is ok, but do inference frame by frame is very different?

    Test model based on 'evaluate_stream' is ok, but do inference frame by frame is very different?

    Train 'a0' model by clip=1, Tclip=8, and the acc on my custom test dataset is good, but when the model do inference for online video stream, recognition is not good as test dataset. In fact, test dataset if's from the video stream i use, so i believe their domain is same.

    Online demo read one frame from video stream and concatente with previous 7 frames, then input to model, so the model do inference with 8 frames too.

    code for online video stream is as follows:

    if isinstance(new_frame, torch.Tensor):
          torch_inputs = torch.cat((tensor_fifo[:, :, 1:, :, :], new_frame), 2)
    else:
          cvt_img = cv2.cvtColor(new_frame, cv2.COLOR_BGR2RGB)
          cvt_img = np.transpose(cvt_img, (2, 0, 1))
          cvt_img = cvt_img[:, np.newaxis, :, :][np.newaxis, :, :, :, :]
          torch_inputs = np.concatenate((tensor_fifo[:, :, 1:, :, :],
                                         cvt_img), 2)
    
    opened by erwangccc 2
  • MoViNet-A0 block 2 should have 40 out channels

    MoViNet-A0 block 2 should have 40 out channels

    Nit: per the paper, there should be 40 output channels in block 2 for MoViNet-A0, while this implementation currently has 24. https://github.com/Atze00/MoViNet-pytorch/blob/2ad697facd370d01b0f3a6093d38211166e594de/movinets/config.py#L62

    https://arxiv.org/pdf/2103.11511.pdf

    opened by SMHendryx 2
  • Accuracy drop when using causalConv

    Accuracy drop when using causalConv

    When I use the model without causalConv, I get the results that I expected. However when I set causal to True, the training/validation loss drops very suddenly and eventually turns into NaN.

    This leads to very undesirable results and accuracy, whilst not having any insight into the loss. Are there any ways to solve this?

    opened by yuridekim 2
  • F.ToFloatTensorInZeroOne not exist

    F.ToFloatTensorInZeroOne not exist

    Did you define you own transforms? I can't import ToFloatTensorInZeroOne() from torchvision.transforms. And I find that type of video returned from torchvision.datasets.HMDB51() is tensor, dtype=torch.uint8, so just divide 255.0 ? Snipaste_2022-04-01_16-09-01

    opened by haowei2020 1
  • Very low validation accuracy with pretrained models!

    Very low validation accuracy with pretrained models!

    Hello, thanks for the PyTorch implementation! I noticed that the pre-trained models have very low accuracy (0.36%) on the validation set. Have the weights changed? I'm just trying the colab tutorial.

    opened by ghazalehtrb 1
  • Using MoViNet in a dataset with variable-length videos

    Using MoViNet in a dataset with variable-length videos

    Hi!

    Thanks for the work to bring this paper to PyTorch.

    I was wondering how can we use MoViNet training with the stream buffer when we don't have videos with the same number of frames.

    Authors of the paper claimed, in Section 4, that they used the method in Charades Dataset: "[...] Charades [53], which has variable-length videos with 157 action classes where a video can contain multiple class annotations." However, they don't specify which policy they used when training with variable-length videos. They also report results on the EPIC Kitchens Dataset which is also variable-length.

    Do you have any insights into how they may have trained these models? The main issue here is how to build a batch when temporal dimensions are not the same...

    Thank you!

    opened by alexlopezcifuentes 1
  • got wrong results during test

    got wrong results during test

    Hi, thanks for your great work in movinet. I met a problem when testing hmdb51 videos. For example, the inference result for "brush hair" seems weird, in some frames, the result shows "brush hair", while in other frames, it shows "kick ball". Did you met this problem before? In the code, 16 frames are divided into 2 clips, each clip with 8 frames, but during test phase, the first clip's prediction is different from the second one's, and the ultimate prediction used the second. Is there anything wrong with this? image

    opened by poincarelee 0
  • How can we access the stream buffer?

    How can we access the stream buffer?

    Hi, thanks for this repo. it's amazing. I am interested in analyzing the stream buffer but couldn't find it. Can someone ponit me in the right direction?

    Thanks

    opened by bf2harven 1
  • Kinetics 400 models

    Kinetics 400 models

    Thanks for your Pytorch implementation for MoViNet. Could you please provide a model trained on Kinetics 400 as Table 9 in [1]. It is quite important for our future works. Looking forward to your reply.

    [1] MoViNets: Mobile Video Networks for Efficient Video Recognition

    opened by lovelyczli 1
  • Modifying for binary classification

    Modifying for binary classification

    Hi, trying to solve a binary classification problem with these movinets and was wondering how I would go about creating a MoVinet with some augmentation with the aim of reducing the TFLOPS required and decreasing the inference time

    opened by ekuluke 2
Owner
null
Official repository for MixFaceNets: Extremely Efficient Face Recognition Networks

MixFaceNets This is the official repository of the paper: MixFaceNets: Extremely Efficient Face Recognition Networks. (Accepted in IJCB2021) https://i

Fadi Boutros 51 Dec 13, 2022
Lyapunov-guided Deep Reinforcement Learning for Stable Online Computation Offloading in Mobile-Edge Computing Networks

PyTorch code to reproduce LyDROO algorithm [1], which is an online computation offloading algorithm to maximize the network data processing capability subject to the long-term data queue stability and average power constraints. It applies Lyapunov optimization to decouple the multi-stage stochastic MINLP into deterministic per-frame MINLP subproblems and solves each subproblem via DROO algorithm. It includes:

Liang HUANG 87 Dec 28, 2022
Simple implementation of Mobile-Former on Pytorch

Simple-implementation-of-Mobile-Former At present, only the model but no trained. There may be some bug in the code, and some details may be different

Acheung 103 Dec 31, 2022
This repo is official PyTorch implementation of MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices(CVPRW 2021).

Github Code of "MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices" Introduction This repo is official PyTorch implementatio

Choi Sang Bum 203 Jan 5, 2023
Unofficial PyTorch implementation of MobileViT based on paper "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer".

MobileViT RegNet Unofficial PyTorch implementation of MobileViT based on paper MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TR

Hong-Jia Chen 91 Dec 2, 2022
Efficient-GlobalPointer - Pytorch Efficient GlobalPointer

引言 感谢苏神带来的模型,原文地址:https://spaces.ac.cn/archives/8877 如何运行 对应模型EfficientGlobalPoi

powerycy 40 Dec 14, 2022
[AAAI 2021] MVFNet: Multi-View Fusion Network for Efficient Video Recognition

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

Wenhao Wu 114 Nov 27, 2022
AdaFocus (ICCV 2021) Adaptive Focus for Efficient Video Recognition

AdaFocus (ICCV 2021) This repo contains the official code and pre-trained models for AdaFocus. Adaptive Focus for Efficient Video Recognition Referenc

Rainforest Wang 115 Dec 21, 2022
AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition [ArXiv] [Project Page] This repository is the official implementation of AdaMML:

International Business Machines 43 Dec 26, 2022
MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021)

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

null 2 Jan 29, 2022
A PyTorch implementation of "Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks" (KDD 2019).

ClusterGCN ⠀⠀ A PyTorch implementation of "Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks" (KDD 2019). A

Benedek Rozemberczki 697 Dec 27, 2022
PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

PyTorch implementation of [1611.06440 Pruning Convolutional Neural Networks for Resource Efficient Inference] This demonstrates pruning a VGG16 based

Jacob Gildenblat 836 Dec 26, 2022
PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

Vision Transformer for Fast and Efficient Scene Text Recognition (ICDAR 2021) ViTSTR is a simple single-stage model that uses a pre-trained Vision Tra

Rowel Atienza 198 Dec 27, 2022
Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

STCN Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang [a

Rex Cheng 456 Dec 12, 2022
This is an official implementation for "ResT: An Efficient Transformer for Visual Recognition".

ResT By Qing-Long Zhang and Yu-Bin Yang [State Key Laboratory for Novel Software Technology at Nanjing University] This repo is the official implement

zhql 222 Dec 13, 2022
PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

Sangchun Ha 24 Nov 24, 2022
A PyTorch implementation of Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks

SVHNClassifier-PyTorch A PyTorch implementation of Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks If

Potter Hsu 182 Jan 3, 2023
Eff video representation - Efficient video representation through neural fields

Neural Residual Flow Fields for Efficient Video Representations 1. Download MPI

null 41 Jan 6, 2023
AdaFocus V2: End-to-End Training of Spatial Dynamic Networks for Video Recognition

AdaFocusV2 This repo contains the official code and pre-trained models for AdaFo

null 79 Dec 26, 2022