Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Overview

Temporally Efficient Vision Transformer for Video Instance Segmentation

Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR 2022, Oral)

by Shusheng Yang1,3, Xinggang Wang1 📧 , Yu Li4, Yuxin Fang1, Jiemin Fang1,2, Wenyu Liu1, Xun Zhao3, Ying Shan3.

1 School of EIC, HUST, 2 AIA, HUST, 3 ARC Lab, Tencent PCG, 4 IDEA.

( 📧 ) corresponding author.


  • This repo provides code, models and training/inference recipes for TeViT(Temporally Efficient Vision Transformer for Video Instance Segmentation).
  • TeViT is a transformer-based end-to-end video instance segmentation framework. We build our framework upon the query-based instance segmentation methods, i.e., QueryInst.
  • We propose a messenger shift mechanism in the transformer backbone, as well as a spatiotemporal query interaction head in the instance heads. These two designs fully utlizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost.

Overall Arch

Models and Main Results

  • We provide both checkpoints and codalab server submissions on YouTube-VIS-2019 dataset.
Name AP AP@50 AP@75 AR@1 AR@10 model submission
TeViT_MsgShifT 46.3 70.6 50.9 45.2 54.3 link link
TeViT_MsgShifT_MST 46.9 70.1 52.9 45.0 53.4 link link
  • We have conducted multiple runs due to the training instability and checkpoints above are all the best one among multiple runs. The average performances are reported in our paper.
  • Besides basic models, we also provide TeViT with ResNet-50 and Swin-L backbone, models are also trained on YouTube-VIS-2019 dataset.
  • MST denotes multi-scale traning.
Name AP AP@50 AP@75 AR@1 AR@10 model submission
TeViT_R50 42.1 67.8 44.8 41.3 49.9 link link
TeViT_Swin-L_MST 56.8 80.6 63.1 52.0 63.3 link link
  • Due to backbone limitations, TeViT models with ResNet-50 and Swin-L backbone are conducted with STQI Head only (i.e., without our proposed messenger shift mechanism).
  • With Swin-L as backbone network, we apply more instance queries (i.e., from 100 to 300) and stronger data augmentation strategies. Both of them can further boost the final performance.

Installation

Prerequisites

  • Linux
  • Python 3.7+
  • CUDA 10.2+
  • GCC 5+

Prepare

  • Clone the repository locally:
git clone https://github.com/hustvl/TeViT.git
  • Create a conda virtual environment and activate it:
conda create --name tevit python=3.7.7
conda activate tevit
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI
  • Install Python requirements
torch==1.9.0
torchvision==0.10.0
mmcv==1.4.8
pip install -r requirements.txt
  • Please follow Docs to install MMDetection
python setup.py develop
  • Download YouTube-VIS 2019 dataset from here, and organize dataset as follows:
TeViT
├── data
│   ├── youtubevis
│   │   ├── train
│   │   │   ├── 003234408d
│   │   │   ├── ...
│   │   ├── val
│   │   │   ├── ...
│   │   ├── annotations
│   │   │   ├── train.json
│   │   │   ├── valid.json

Inference

python tools/test_vis.py configs/tevit/tevit_msgshift.py $PATH_TO_CHECKPOINT

After inference process, the predicted results is stored in results.json, submit it to the evaluation server to get the final performance.

Training

  • Download the COCO pretrained QueryInst with PVT-B1 backbone from here.
  • Train TeViT with 8 GPUs:
./tools/dist_train.sh configs/tevit/tevit_msgshift.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT
  • Train TeViT with multi-scale data augmentation:
./tools/dist_train.sh configs/tevit/tevit_msgshift_mstrain.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT
  • The whole training process will cost about three hours with 8 TESLA V100 GPUs.
  • To train TeViT with ResNet-50 or Swin-L backbone, please download the COCO pretrained weights from QueryInst.

Acknowledgement ❤️

This code is mainly based on mmdetection and QueryInst, thanks for their awesome work and great contributions to the computer vision community!

Citation

If you find our paper and code useful in your research, please consider giving a star and citation 📝 :

@inproceedings{yang2022tevit,
  title={Temporally Efficient Vision Transformer for Video Instance Segmentation,
  author={Yang, Shusheng and Wang, Xinggang and Li, Yu and Fang, Yuxin and Fang, Jiemin and Liu and Zhao, Xun and Shan, Ying},
  booktitle =   {Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)},
  year      =   {2022}
}
Comments
  • I can't find pycocotools.ytvos

    I can't find pycocotools.ytvos

    Traceback (most recent call last): File "./tools/train.py", line 17, in from mmdet.apis import init_random_seed, set_random_seed, train_detector File "/home/hss/TeViT-main/mmdet/apis/init.py", line 2, in from .inference import (async_inference_detector, inference_detector, File "/home/hss/TeViT-main/mmdet/apis/inference.py", line 12, in from mmdet.datasets import replace_ImageToTensor File "/home/hss/TeViT-main/mmdet/datasets/init.py", line 18, in Traceback (most recent call last): File "./tools/train.py", line 17, in from .youtubevis import YoutubeVISDataset File "/home/hss/TeViT-main/mmdet/datasets/youtubevis.py", line 9, in from pycocotools.ytvos import YTVOS ModuleNotFoundError: No module named 'pycocotools.ytvos' from mmdet.apis import init_random_seed, set_random_seed, train_detector File "/home/hss/TeViT-main/mmdet/apis/init.py", line 2, in from .inference import (async_inference_detector, inference_detector, File "/home/hss/TeViT-main/mmdet/apis/inference.py", line 12, in from mmdet.datasets import replace_ImageToTensor File "/home/hss/TeViT-main/mmdet/datasets/init.py", line 18, in from .youtubevis import YoutubeVISDataset File "/home/hss/TeViT-main/mmdet/datasets/youtubevis.py", line 9, in from pycocotools.ytvos import YTVOS

    documentation 
    opened by czj942650673 5
  • Ran out of input

    Ran out of input

    Thanks for your work. I'm struggling for days with this error. Can you please provide some solutions to overcome it.

    (base) root@78845bc2a82a:/mmdetection/Tevit# python tools/test_vis.py configs/tevit/tevit_msgshift.py checkpoint/tevit_r50.pth load checkpoint from local path: checkpoint/tevit_r50.pth Traceback (most recent call last): File "tools/test_vis.py", line 137, in main(args) File "tools/test_vis.py", line 54, in main cfg_options=args.cfg_options) File "/opt/conda/lib/python3.7/site-packages/mmdet/apis/inference.py", line 45, in init_detector checkpoint = load_checkpoint(model, checkpoint, map_location='cpu') File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 581, in load_checkpoint checkpoint = _load_checkpoint(filename, map_location, logger) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 520, in _load_checkpoint return CheckpointLoader.load_checkpoint(filename, map_location, logger) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 285, in load_checkpoint return checkpoint_loader(filename, map_location) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 302, in load_from_local checkpoint = torch.load(filename, map_location=map_location) File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 585, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 755, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) EOFError: Ran out of input

    opened by assia855 2
  • Evaluation code for bbox

    Evaluation code for bbox

    Thanks for your great work. I am not familiar with the youtubevis API for bbox evaluation and want to learn about the evaluation procedure base on the provided trainset annotation. Besides the evaluation code on segmentation, can you provide a code that can store the bbox prediction result in standard format for evaluation? Thanks.

    opened by alexzeng1206 0
  • About demo test

    About demo test

    Thanks for your excellent work. For the test of image_demo.py(or video_demo.py) in /demo, use the demo.jpg(or demo.mp4) as input, there is a problem. Could you please provide some advice? Looking forward to your reply.

    /TeViT/mmdet/apis/inference.py:50: UserWarning: Class names are not saved in the checkpoint's meta data, use COCO classes by default. warnings.warn('Class names are not saved in the checkpoint's '

    Traceback (most recent call last): File "image_demo.py", line 65, in main(args) File "image_demo.py", line 35, in main result = inference_detector(model, args.img) File "/TeViT/mmdet/apis/inference.py", line 137, in inference_detector data['img_metas'] = [img_metas.data[0] for img_metas in data['img_metas']] TypeError: 'DataContainer' object is not iterable

    File "video_demo.py", line 61, in main() File "video_demo.py", line 47, in main result = inference_detector(model, frame) File "/TeViT/mmdet/apis/inference.py", line 137, in inference_detector data['img_metas'] = [img_metas.data[0] for img_metas in data['img_metas']] TypeError: 'DataContainer' object is not iterable

    opened by LuoYingzhao 1
  • Are you interested in creating a PR about this work for mmtracking?

    Are you interested in creating a PR about this work for mmtracking?

    Wonderful work! We're very interested in your work. VIS is a future key development direction in mmtracking. We'll appreciate it if you can create a PR about this work in mmtracking.

    opened by JingweiZhang12 1
  •  'TeViT is not in the models registry'

    'TeViT is not in the models registry'

    Hi, I am trying to run this repo and i have followed all the steps mentioned in the description.

    While running the inference code, I am getting the following error. python tools/test_vis.py configs/tevit/tevit_msgshift.py checkpoints/tevit_r50.pth Traceback (most recent call last): File "tools/test_vis.py", line 130, in <module> main(args) File "tools/test_vis.py", line 53, in main cfg_options=args.cfg_options) File "/home/quidich/.virtualenvs/tevit/lib/python3.6/site-packages/mmdet/apis/inference.py", line 43, in init_detector model = build_detector(config.model, test_cfg=config.get('test_cfg')) File "/home/quidich/.virtualenvs/tevit/lib/python3.6/site-packages/mmdet/models/builder.py", line 59, in build_detector cfg, default_args=dict(train_cfg=train_cfg, test_cfg=test_cfg)) File "/home/quidich/.virtualenvs/tevit/lib/python3.6/site-packages/mmcv/utils/registry.py", line 212, in build return self.build_func(*args, **kwargs, registry=self) File "/home/quidich/.virtualenvs/tevit/lib/python3.6/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg return build_from_cfg(cfg, registry, default_args) File "/home/quidich/.virtualenvs/tevit/lib/python3.6/site-packages/mmcv/utils/registry.py", line 45, in build_from_cfg f'{obj_type} is not in the {registry.name} registry') KeyError: 'TeViT is not in the models registry'

    Kindly, could you guide me if I want to create a file that directly runs this algorithm on a video or set of images and provide an out in the form of images and json format which gets saved in some folder?

    opened by PoojanPanchal 2
Owner
Hust Visual Learning Team
Hust Visual Learning Team belongs to the Artificial Intelligence Research Institute in the School of EIC in HUST, Lead by @xinggangw
Hust Visual Learning Team
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

Swin Transformer 1.4k Dec 30, 2022
Official Pytorch implementation of "Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video", CVPR 2021

TCMR: Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video Qualtitative result Paper teaser video Introduction This r

Hongsuk Choi 215 Jan 6, 2023
Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds (CVPR 2022, Oral)

Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds (CVPR 2022, Oral) This is the official implementat

Yifan Zhang 259 Dec 25, 2022
[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

Yuqing Wang 687 Jan 7, 2023
[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers Website • STVG Demo • Paper This repository provides the code for our paper. This includes

Antoine Yang 108 Dec 27, 2022
Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORAL)

Scribble-Supervised LiDAR Semantic Segmentation Dataset and code release for the paper Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORA

null 102 Dec 25, 2022
FreeSOLO for unsupervised instance segmentation, CVPR 2022

FreeSOLO: Learning to Segment Objects without Annotations This project hosts the code for implementing the FreeSOLO algorithm for unsupervised instanc

NVIDIA Research Projects 253 Jan 2, 2023
"MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction" (CVPRW 2022) & (Winner of NTIRE 2022 Challenge on Spectral Reconstruction from RGB)

MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction (CVPRW 2022) Yuanhao Cai, Jing Lin, Zudi Lin, Haoqian Wang, Yulun Z

Yuanhao Cai 274 Jan 5, 2023
TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks [Paper] [Project Website] This repository holds the source code, pretra

Humam Alwassel 83 Dec 21, 2022
A PyTorch Reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution

TecoGAN-PyTorch Introduction This is a PyTorch reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution (VSR). Please refer to

null 165 Dec 17, 2022
[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation Prerequisite Please create and activate the following conda envrionment. To r

Qin Wang 87 Jan 8, 2023
[ArXiv 2021] Data-Efficient Instance Generation from Instance Discrimination

InsGen - Data-Efficient Instance Generation from Instance Discrimination Data-Efficient Instance Generation from Instance Discrimination Ceyuan Yang,

GenForce: May Generative Force Be with You 93 Dec 25, 2022
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 12.6k Jan 9, 2023
This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

null 1 Dec 24, 2021
Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation This paper has been accepted and early accessed

Yun Liu 39 Sep 20, 2022
Stratified Transformer for 3D Point Cloud Segmentation (CVPR 2022)

Stratified Transformer for 3D Point Cloud Segmentation Xin Lai*, Jianhui Liu*, Li Jiang, Liwei Wang, Hengshuang Zhao, Shu Liu, Xiaojuan Qi, Jiaya Jia

DV Lab 195 Jan 1, 2023
[CVPR 2022] Official PyTorch Implementation for "Reference-based Video Super-Resolution Using Multi-Camera Video Triplets"

Reference-based Video Super-Resolution (RefVSR) Official PyTorch Implementation of the CVPR 2022 Paper Project | arXiv | RealMCVSR Dataset This repo c

Junyong Lee 151 Dec 30, 2022
Implementation of "Efficient Regional Memory Network for Video Object Segmentation" (Xie et al., CVPR 2021).

RMNet This repository contains the source code for the paper Efficient Regional Memory Network for Video Object Segmentation. Cite this work @inprocee

Haozhe Xie 76 Dec 14, 2022
Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

IC-Conv This repository is an official implementation of the paper Inception Convolution with Efficient Dilation Search. Getting Started Download Imag

Jie Liu 111 Dec 31, 2022