MoViNets PyTorch implementation: Mobile Video Networks for Efficient Video Recognition;

Last update: Dec 20, 2022

Related tags

Overview

MoViNet-pytorch

Pytorch unofficial implementation of MoViNets: Mobile Video Networks for Efficient Video Recognition.
Authors: Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Mingxing Tan, Matthew Brown, Boqing Gong (Google Research)
[Authors' Implementation]

Stream Buffer

Clean stream buffer

It is required to clean the buffer after all the clips of the same video have been processed.

model.clean_activation_buffers()

Usage

Click on "Open in Colab" to open an example of training on HMDB-51

installation

pip install git+https://github.com/Atze00/MoViNet-pytorch.git

How to build a model

Use causal = True to use the model with stream buffer, causal = False will use standard convolutions

from movinets import MoViNet
from movinets.config import _C

MoViNetA0 = MoViNet(_C.MODEL.MoViNetA0, causal = True, pretrained = True )
MoViNetA1 = MoViNet(_C.MODEL.MoViNetA1, causal = True, pretrained = True )
...

Load weights

Use pretrained = True to use the model with pretrained weights

    """
    If pretrained is True:
        num_classes is set to 600,
        conv_type is set to "3d" if causal is False, "2plus1d" if causal is True
        tf_like is set to True
    """
model = MoViNet(_C.MODEL.MoViNetA0, causal = True, pretrained = True )
model = MoViNet(_C.MODEL.MoViNetA0, causal = False, pretrained = True )

Training loop examples

Training loop with stream buffer

def train_iter(model, optimz, data_load, n_clips = 5, n_clip_frames=8):
    """
    In causal mode with stream buffer a single video is fed to the network
    using subclips of lenght n_clip_frames. 
    n_clips*n_clip_frames should be equal to the total number of frames presents
    in the video.
    
    n_clips : number of clips that are used
    n_clip_frames : number of frame contained in each clip
    """
    
    #clean the buffer of activations
    model.clean_activation_buffers()
    optimz.zero_grad()
    for i, data, target in enumerate(data_load):
        #backward pass for each clip
        for j in range(n_clips):
          out = F.log_softmax(model(data[:,:,(n_clip_frames)*(j):(n_clip_frames)*(j+1)]), dim=1)
          loss = F.nll_loss(out, target)/n_clips
          loss.backward()
        optimz.step()
        optimz.zero_grad()
        
        #clean the buffer of activations
        model.clean_activation_buffers()

Training loop with standard convolutions

def train_iter(model, optimz, data_load):

    optimz.zero_grad()
    for i, (data,_ , target) in enumerate(data_load):
        out = F.log_softmax(model(data), dim=1)
        loss = F.nll_loss(out, target)
        loss.backward()
        optimz.step()
        optimz.zero_grad()

Pretrained models

Weights

The weights are loaded from the tensorflow models released by the authors, trained on kinetics.

Base Models

Base models implement standard 3D convolutions without stream buffers.

Model Name	Top-1 Accuracy*	Top-5 Accuracy*	Input Shape
MoViNet-A0-Base	72.28	90.92	50 x 172 x 172
MoViNet-A1-Base	76.69	93.40	50 x 172 x 172
MoViNet-A2-Base	78.62	94.17	50 x 224 x 224
MoViNet-A3-Base	81.79	95.67	120 x 256 x 256
MoViNet-A4-Base	83.48	96.16	80 x 290 x 290
MoViNet-A5-Base	84.27	96.39	120 x 320 x 320

Model Name	Top-1 Accuracy*	Top-5 Accuracy*	Input Shape**
MoViNet-A0-Stream	72.05	90.63	50 x 172 x 172
MoViNet-A1-Stream	76.45	93.25	50 x 172 x 172
MoViNet-A2-Stream	78.40	94.05	50 x 224 x 224

**In streaming mode, the number of frames correspond to the total accumulated duration of the 10-second clip.

*Accuracy reported on the official repository for the dataset kinetics 600, It has not been tested by me. It should be the same since the tf models and the reimplemented pytorch models output the same results [Test].

I currently haven't tested the speed of the streaming models, feel free to test and contribute.

Status

Currently are available the pretrained models for the following architectures:

I currently have no plans to include streaming version of A3,A4,A5. Those models are too slow for most mobile applications.

Testing

I recommend to create a new environment for testing and run the following command to install all the required packages:
pip install -r tests/test_requirements.txt

Citations

@article{kondratyuk2021movinets,
  title={MoViNets: Mobile Video Networks for Efficient Video Recognition},
  author={Dan Kondratyuk, Liangzhe Yuan, Yandong Li, Li Zhang, Matthew Brown, and Boqing Gong},
  journal={arXiv preprint arXiv:2103.11511},
  year={2021}
}

Comments

Neural network arch displayed by Netron is wrong
Hi, @Atze00 i saved MoViNtes-A0 model to 'pth' and look through model by Netron, but the structure is a little strange. maybe there is something wrong at '_forward_impl' at class 'class MoViNet(nn.Module)'.

My code is as follows:

model = MoViNet(_C.MODEL.MoViNetA0, causal=True, pretrained=False, num_classes=num_class) ... torch.save(model, '/path/to/*.pth') path = '/path/to/*.pth' model = torch.load(path, map_location='cpu')

Please let me know if i did something wrong.
opened by erwangccc 7
When pretrained stream version will be available?

It looks like there is a bug in the current causal padding of the official TF models, I have filed an issue and a pull request to fix it. I am currently waiting for feedbacks on the PR. There is the possibility to obtain the same behavior even without the fix but I'm not considering it at the current moment. Reference to the issue: https://github.com/tensorflow/models/issues/10062

opened by Atze00 5
Training code

Thank you for your good implementation! Could you provide more detail training code or example?

Originally posted by @zhangyuan1994511 in https://github.com/Atze00/MoViNet-pytorch/issues/2#issuecomment-842952447

opened by Atze00 5
Error in running movinet-pytorch to onnx converted model

Hi @Atze00

I succesfully transfer trained this repos movinets

But i want to deploy it else where into tensorrt format

So i first converted pytorch model to onnx using torch.onnx.export

I wanted to verify whether model is correctly ported . So I ran it in onnxruntime

It is throwing error like in the screenshot

Suggest any solutions

kindly help me here

let me know if any more screenshots or anything else

opened by papasanimohansrinivas 4
About frames number of dataset

According to paper, "For all datasets, we train with 64 frames (except when the inference frames are fewer) at various frame-rates, and run inference with the same frame-rate". Frames of training dataset are 64. but frames of one video from my dataset may be less than 32, may be it's 8, 10,13, 17 etc. so is it ok?

Thanks in advance.

opened by erwangccc 4
Test model based on 'evaluate_stream' is ok, but do inference frame by frame is very different?
Train 'a0' model by clip=1, Tclip=8, and the acc on my custom test dataset is good, but when the model do inference for online video stream, recognition is not good as test dataset. In fact, test dataset if's from the video stream i use, so i believe their domain is same.

Online demo read one frame from video stream and concatente with previous 7 frames, then input to model, so the model do inference with 8 frames too.

code for online video stream is as follows:

if isinstance(new_frame, torch.Tensor): torch_inputs = torch.cat((tensor_fifo[:, :, 1:, :, :], new_frame), 2) else: cvt_img = cv2.cvtColor(new_frame, cv2.COLOR_BGR2RGB) cvt_img = np.transpose(cvt_img, (2, 0, 1)) cvt_img = cvt_img[:, np.newaxis, :, :][np.newaxis, :, :, :, :] torch_inputs = np.concatenate((tensor_fifo[:, :, 1:, :, :], cvt_img), 2)
opened by erwangccc 2
MoViNet-A0 block 2 should have 40 out channels

Nit: per the paper, there should be 40 output channels in block 2 for MoViNet-A0, while this implementation currently has 24. https://github.com/Atze00/MoViNet-pytorch/blob/2ad697facd370d01b0f3a6093d38211166e594de/movinets/config.py#L62

https://arxiv.org/pdf/2103.11511.pdf

opened by SMHendryx 2
Accuracy drop when using causalConv

When I use the model without causalConv, I get the results that I expected. However when I set causal to True, the training/validation loss drops very suddenly and eventually turns into NaN.

This leads to very undesirable results and accuracy, whilst not having any insight into the loss. Are there any ways to solve this?

opened by yuridekim 2
F.ToFloatTensorInZeroOne not exist

Did you define you own transforms? I can't import ToFloatTensorInZeroOne() from torchvision.transforms. And I find that type of video returned from torchvision.datasets.HMDB51() is tensor, dtype=torch.uint8, so just divide 255.0 ?

opened by haowei2020 1
Very low validation accuracy with pretrained models!

Hello, thanks for the PyTorch implementation! I noticed that the pre-trained models have very low accuracy (0.36%) on the validation set. Have the weights changed? I'm just trying the colab tutorial.

opened by ghazalehtrb 1
Using MoViNet in a dataset with variable-length videos

Hi!

Thanks for the work to bring this paper to PyTorch.

I was wondering how can we use MoViNet training with the stream buffer when we don't have videos with the same number of frames.

Authors of the paper claimed, in Section 4, that they used the method in Charades Dataset: "[...] Charades [53], which has variable-length videos with 157 action classes where a video can contain multiple class annotations." However, they don't specify which policy they used when training with variable-length videos. They also report results on the EPIC Kitchens Dataset which is also variable-length.

Do you have any insights into how they may have trained these models? The main issue here is how to build a batch when temporal dimensions are not the same...

Thank you!

opened by alexlopezcifuentes 1
got wrong results during test

Hi, thanks for your great work in movinet. I met a problem when testing hmdb51 videos. For example, the inference result for "brush hair" seems weird, in some frames, the result shows "brush hair", while in other frames, it shows "kick ball". Did you met this problem before? In the code, 16 frames are divided into 2 clips, each clip with 8 frames, but during test phase, the first clip's prediction is different from the second one's, and the ultimate prediction used the second. Is there anything wrong with this?

opened by poincarelee 0
How can we access the stream buffer?

Hi, thanks for this repo. it's amazing. I am interested in analyzing the stream buffer but couldn't find it. Can someone ponit me in the right direction?

Thanks

opened by bf2harven 1
Kinetics 400 models

Thanks for your Pytorch implementation for MoViNet. Could you please provide a model trained on Kinetics 400 as Table 9 in [1]. It is quite important for our future works. Looking forward to your reply.

[1] MoViNets: Mobile Video Networks for Efficient Video Recognition

opened by lovelyczli 1
Modifying for binary classification

Hi, trying to solve a binary classification problem with these movinets and was wondering how I would go about creating a MoVinet with some augmentation with the aim of reducing the TFLOPS required and decreasing the inference time

opened by ekuluke 2

Owner

GitHub

Official repository for MixFaceNets: Extremely Efficient Face Recognition Networks

MixFaceNets This is the official repository of the paper: MixFaceNets: Extremely Efficient Face Recognition Networks. (Accepted in IJCB2021) https://i

51 Dec 13, 2022

Lyapunov-guided Deep Reinforcement Learning for Stable Online Computation Offloading in Mobile-Edge Computing Networks

PyTorch code to reproduce LyDROO algorithm [1], which is an online computation offloading algorithm to maximize the network data processing capability subject to the long-term data queue stability and average power constraints. It applies Lyapunov optimization to decouple the multi-stage stochastic MINLP into deterministic per-frame MINLP subproblems and solves each subproblem via DROO algorithm. It includes:

87 Dec 28, 2022

Simple implementation of Mobile-Former on Pytorch

Simple-implementation-of-Mobile-Former At present, only the model but no trained. There may be some bug in the code, and some details may be different

103 Dec 31, 2022

This repo is official PyTorch implementation of MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices(CVPRW 2021).

Github Code of "MobileHumanPose: Toward real-time 3D human pose estimation in mobile devices" Introduction This repo is official PyTorch implementatio

203 Jan 5, 2023

Unofficial PyTorch implementation of MobileViT based on paper "MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer".

MobileViT RegNet Unofficial PyTorch implementation of MobileViT based on paper MOBILEVIT: LIGHT-WEIGHT, GENERAL-PURPOSE, AND MOBILE-FRIENDLY VISION TR

91 Dec 2, 2022

Efficient-GlobalPointer - Pytorch Efficient GlobalPointer

引言感谢苏神带来的模型，原文地址：https://spaces.ac.cn/archives/8877 如何运行对应模型EfficientGlobalPoi

40 Dec 14, 2022

[AAAI 2021] MVFNet: Multi-View Fusion Network for Efficient Video Recognition

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

114 Nov 27, 2022

AdaFocus (ICCV 2021) Adaptive Focus for Efficient Video Recognition

AdaFocus (ICCV 2021) This repo contains the official code and pre-trained models for AdaFocus. Adaptive Focus for Efficient Video Recognition Referenc

115 Dec 21, 2022

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition [ArXiv] [Project Page] This repository is the official implementation of AdaMML:

43 Dec 26, 2022

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021)

MVFNet: Multi-View Fusion Network for Efficient Video Recognition (AAAI 2021) Overview We release the code of the MVFNet (Multi-View Fusion Network).

2 Jan 29, 2022

A PyTorch implementation of "Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks" (KDD 2019).

ClusterGCN ⠀⠀ A PyTorch implementation of "Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks" (KDD 2019). A

697 Dec 27, 2022

PyTorch Implementation of [1611.06440] Pruning Convolutional Neural Networks for Resource Efficient Inference

PyTorch implementation of [1611.06440 Pruning Convolutional Neural Networks for Resource Efficient Inference] This demonstrates pruning a VGG16 based

836 Dec 26, 2022

PyTorch code of my ICDAR 2021 paper Vision Transformer for Fast and Efficient Scene Text Recognition (ViTSTR)

Vision Transformer for Fast and Efficient Scene Text Recognition (ICDAR 2021) ViTSTR is a simple single-stage model that uses a pre-trained Vision Tra

198 Dec 27, 2022

Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation

STCN Rethinking Space-Time Networks with Improved Memory Coverage for Efficient Video Object Segmentation Ho Kei Cheng, Yu-Wing Tai, Chi-Keung Tang [a

456 Dec 12, 2022

This is an official implementation for "ResT: An Efficient Transformer for Visual Recognition".

ResT By Qing-Long Zhang and Yu-Bin Yang [State Key Laboratory for Novel Software Technology at Nanjing University] This repo is the official implement

222 Dec 13, 2022

PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)

ContextNet ContextNet has CNN-RNN-transducer architecture and features a fully convolutional encoder that incorporates global context information into

24 Nov 24, 2022

A PyTorch implementation of Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks

SVHNClassifier-PyTorch A PyTorch implementation of Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks If