Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Hust Visual Learning Team

Last update: Dec 31, 2022

Related tags

Overview

Temporally Efficient Vision Transformer for Video Instance Segmentation

Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR 2022, Oral)

by Shusheng Yang^1,3, Xinggang Wang^{1

📧}, Yu Li⁴, Yuxin Fang¹, Jiemin Fang^1,2, Wenyu Liu¹, Xun Zhao³, Ying Shan³.

¹ School of EIC, HUST, ² AIA, HUST, ³ ARC Lab, Tencent PCG, ⁴ IDEA.

(^📧) corresponding author.

This repo provides code, models and training/inference recipes for TeViT(Temporally Efficient Vision Transformer for Video Instance Segmentation).
TeViT is a transformer-based end-to-end video instance segmentation framework. We build our framework upon the query-based instance segmentation methods, i.e., QueryInst.
We propose a messenger shift mechanism in the transformer backbone, as well as a spatiotemporal query interaction head in the instance heads. These two designs fully utlizes both frame-level and instance-level temporal context information and obtains strong temporal modeling capacity with negligible extra computational cost.

Models and Main Results

We provide both checkpoints and codalab server submissions on YouTube-VIS-2019 dataset.

Name	AP	AP@50	AP@75	AR@1	AR@10	model	submission
TeViT_MsgShifT	46.3	70.6	50.9	45.2	54.3	link	link
TeViT_MsgShifT_MST	46.9	70.1	52.9	45.0	53.4	link	link

We have conducted multiple runs due to the training instability and checkpoints above are all the best one among multiple runs. The average performances are reported in our paper.
Besides basic models, we also provide TeViT with ResNet-50 and Swin-L backbone, models are also trained on YouTube-VIS-2019 dataset.
MST denotes multi-scale traning.

Name	AP	AP@50	AP@75	AR@1	AR@10	model	submission
TeViT_R50	42.1	67.8	44.8	41.3	49.9	link	link
TeViT_Swin-L_MST	56.8	80.6	63.1	52.0	63.3	link	link

Due to backbone limitations, TeViT models with ResNet-50 and Swin-L backbone are conducted with STQI Head only (i.e., without our proposed messenger shift mechanism).
With Swin-L as backbone network, we apply more instance queries (i.e., from 100 to 300) and stronger data augmentation strategies. Both of them can further boost the final performance.

Installation

Prerequisites

Linux
Python 3.7+
CUDA 10.2+
GCC 5+

Prepare

Clone the repository locally:

git clone https://github.com/hustvl/TeViT.git

Create a conda virtual environment and activate it:

conda create --name tevit python=3.7.7
conda activate tevit

Install YTVOS Version API from youtubevos/cocoapi:

pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI

Install Python requirements

torch==1.9.0
torchvision==0.10.0
mmcv==1.4.8
pip install -r requirements.txt

Please follow Docs to install MMDetection

python setup.py develop

Download YouTube-VIS 2019 dataset from here, and organize dataset as follows:

TeViT
├── data
│   ├── youtubevis
│   │   ├── train
│   │   │   ├── 003234408d
│   │   │   ├── ...
│   │   ├── val
│   │   │   ├── ...
│   │   ├── annotations
│   │   │   ├── train.json
│   │   │   ├── valid.json

Inference

python tools/test_vis.py configs/tevit/tevit_msgshift.py $PATH_TO_CHECKPOINT

After inference process, the predicted results is stored in results.json, submit it to the evaluation server to get the final performance.

Training

Download the COCO pretrained QueryInst with PVT-B1 backbone from here.
Train TeViT with 8 GPUs:

./tools/dist_train.sh configs/tevit/tevit_msgshift.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT

Train TeViT with multi-scale data augmentation:

./tools/dist_train.sh configs/tevit/tevit_msgshift_mstrain.py 8 --no-validate --cfg-options load_from=$PATH_TO_PRETRAINED_WEIGHT

The whole training process will cost about three hours with 8 TESLA V100 GPUs.
To train TeViT with ResNet-50 or Swin-L backbone, please download the COCO pretrained weights from QueryInst.

Acknowledgement ❤️

This code is mainly based on mmdetection and QueryInst, thanks for their awesome work and great contributions to the computer vision community!

Citation

If you find our paper and code useful in your research, please consider giving a star ⭐ and citation 📝 :

@inproceedings{yang2022tevit,
  title={Temporally Efficient Vision Transformer for Video Instance Segmentation,
  author={Yang, Shusheng and Wang, Xinggang and Li, Yu and Fang, Yuxin and Fang, Jiemin and Liu and Zhao, Xun and Shan, Ying},
  booktitle =   {Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)},
  year      =   {2022}
}

Comments

I can't find pycocotools.ytvos

Traceback (most recent call last): File "./tools/train.py", line 17, in from mmdet.apis import init_random_seed, set_random_seed, train_detector File "/home/hss/TeViT-main/mmdet/apis/init.py", line 2, in from .inference import (async_inference_detector, inference_detector, File "/home/hss/TeViT-main/mmdet/apis/inference.py", line 12, in from mmdet.datasets import replace_ImageToTensor File "/home/hss/TeViT-main/mmdet/datasets/init.py", line 18, in Traceback (most recent call last): File "./tools/train.py", line 17, in from .youtubevis import YoutubeVISDataset File "/home/hss/TeViT-main/mmdet/datasets/youtubevis.py", line 9, in from pycocotools.ytvos import YTVOS ModuleNotFoundError: No module named 'pycocotools.ytvos' from mmdet.apis import init_random_seed, set_random_seed, train_detector File "/home/hss/TeViT-main/mmdet/apis/init.py", line 2, in from .inference import (async_inference_detector, inference_detector, File "/home/hss/TeViT-main/mmdet/apis/inference.py", line 12, in from mmdet.datasets import replace_ImageToTensor File "/home/hss/TeViT-main/mmdet/datasets/init.py", line 18, in from .youtubevis import YoutubeVISDataset File "/home/hss/TeViT-main/mmdet/datasets/youtubevis.py", line 9, in from pycocotools.ytvos import YTVOS
documentation

opened by czj942650673 5
Ran out of input

Thanks for your work. I'm struggling for days with this error. Can you please provide some solutions to overcome it.

(base) root@78845bc2a82a:/mmdetection/Tevit# python tools/test_vis.py configs/tevit/tevit_msgshift.py checkpoint/tevit_r50.pth load checkpoint from local path: checkpoint/tevit_r50.pth Traceback (most recent call last): File "tools/test_vis.py", line 137, in main(args) File "tools/test_vis.py", line 54, in main cfg_options=args.cfg_options) File "/opt/conda/lib/python3.7/site-packages/mmdet/apis/inference.py", line 45, in init_detector checkpoint = load_checkpoint(model, checkpoint, map_location='cpu') File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 581, in load_checkpoint checkpoint = _load_checkpoint(filename, map_location, logger) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 520, in _load_checkpoint return CheckpointLoader.load_checkpoint(filename, map_location, logger) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 285, in load_checkpoint return checkpoint_loader(filename, map_location) File "/opt/conda/lib/python3.7/site-packages/mmcv/runner/checkpoint.py", line 302, in load_from_local checkpoint = torch.load(filename, map_location=map_location) File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 585, in load return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args) File "/opt/conda/lib/python3.7/site-packages/torch/serialization.py", line 755, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) EOFError: Ran out of input

opened by assia855 2
Evaluation code for bbox

Thanks for your great work. I am not familiar with the youtubevis API for bbox evaluation and want to learn about the evaluation procedure base on the provided trainset annotation. Besides the evaluation code on segmentation, can you provide a code that can store the bbox prediction result in standard format for evaluation? Thanks.

opened by alexzeng1206 0
About demo test
Thanks for your excellent work. For the test of image_demo.py(or video_demo.py) in /demo, use the demo.jpg(or demo.mp4) as input, there is a problem. Could you please provide some advice? Looking forward to your reply.

/TeViT/mmdet/apis/inference.py:50: UserWarning: Class names are not saved in the checkpoint's meta data, use COCO classes by default. warnings.warn('Class names are not saved in the checkpoint's '

Traceback (most recent call last): File "image_demo.py", line 65, in main(args) File "image_demo.py", line 35, in main result = inference_detector(model, args.img) File "/TeViT/mmdet/apis/inference.py", line 137, in inference_detector data['img_metas'] = [img_metas.data[0] for img_metas in data['img_metas']] TypeError: 'DataContainer' object is not iterable

File "video_demo.py", line 61, in main() File "video_demo.py", line 47, in main result = inference_detector(model, frame) File "/TeViT/mmdet/apis/inference.py", line 137, in inference_detector data['img_metas'] = [img_metas.data[0] for img_metas in data['img_metas']] TypeError: 'DataContainer' object is not iterable
opened by LuoYingzhao 1
Are you interested in creating a PR about this work for mmtracking?

Wonderful work! We're very interested in your work. VIS is a future key development direction in mmtracking. We'll appreciate it if you can create a PR about this work in mmtracking.

opened by JingweiZhang12 1
'TeViT is not in the models registry'

Hi, I am trying to run this repo and i have followed all the steps mentioned in the description.

While running the inference code, I am getting the following error. python tools/test_vis.py configs/tevit/tevit_msgshift.py checkpoints/tevit_r50.pth Traceback (most recent call last): File "tools/test_vis.py", line 130, in <module> main(args) File "tools/test_vis.py", line 53, in main cfg_options=args.cfg_options) File "/home/quidich/.virtualenvs/tevit/lib/python3.6/site-packages/mmdet/apis/inference.py", line 43, in init_detector model = build_detector(config.model, test_cfg=config.get('test_cfg')) File "/home/quidich/.virtualenvs/tevit/lib/python3.6/site-packages/mmdet/models/builder.py", line 59, in build_detector cfg, default_args=dict(train_cfg=train_cfg, test_cfg=test_cfg)) File "/home/quidich/.virtualenvs/tevit/lib/python3.6/site-packages/mmcv/utils/registry.py", line 212, in build return self.build_func(*args, **kwargs, registry=self) File "/home/quidich/.virtualenvs/tevit/lib/python3.6/site-packages/mmcv/cnn/builder.py", line 27, in build_model_from_cfg return build_from_cfg(cfg, registry, default_args) File "/home/quidich/.virtualenvs/tevit/lib/python3.6/site-packages/mmcv/utils/registry.py", line 45, in build_from_cfg f'{obj_type} is not in the {registry.name} registry') KeyError: 'TeViT is not in the models registry'

Kindly, could you guide me if I want to create a file that directly runs this algorithm on a video or set of images and provide an out in the form of images and json format which gets saved in some folder?

opened by PoojanPanchal 2

Owner

Hust Visual Learning Team

Hust Visual Learning Team belongs to the Artificial Intelligence Research Institute in the School of EIC in HUST, Lead by @xinggangw

GitHub https://arxiv.org/abs/2204.08412

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

1.4k Dec 30, 2022

Official Pytorch implementation of "Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video", CVPR 2021

TCMR: Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video Qualtitative result Paper teaser video Introduction This r

215 Jan 6, 2023

Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds (CVPR 2022, Oral)

Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds (CVPR 2022, Oral) This is the official implementat

259 Dec 25, 2022

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

687 Jan 7, 2023

[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers Website • STVG Demo • Paper This repository provides the code for our paper. This includes

108 Dec 27, 2022

Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORAL)

Scribble-Supervised LiDAR Semantic Segmentation Dataset and code release for the paper Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORA

102 Dec 25, 2022

FreeSOLO for unsupervised instance segmentation, CVPR 2022

FreeSOLO: Learning to Segment Objects without Annotations This project hosts the code for implementing the FreeSOLO algorithm for unsupervised instanc

253 Jan 2, 2023

"MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction" (CVPRW 2022) & (Winner of NTIRE 2022 Challenge on Spectral Reconstruction from RGB)

MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction (CVPRW 2022) Yuanhao Cai, Jing Lin, Zudi Lin, Haoqian Wang, Yulun Z

274 Jan 5, 2023

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks [Paper] [Project Website] This repository holds the source code, pretra

83 Dec 21, 2022

A PyTorch Reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution

TecoGAN-PyTorch Introduction This is a PyTorch reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution (VSR). Please refer to

165 Dec 17, 2022

[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation Prerequisite Please create and activate the following conda envrionment. To r

87 Jan 8, 2023

[ArXiv 2021] Data-Efficient Instance Generation from Instance Discrimination

InsGen - Data-Efficient Instance Generation from Instance Discrimination Data-Efficient Instance Generation from Instance Discrimination Ceyuan Yang,

GenForce: May Generative Force Be with You

93 Dec 25, 2022

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

12.6k Jan 9, 2023

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

1 Dec 24, 2021

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation This paper has been accepted and early accessed

39 Sep 20, 2022

Stratified Transformer for 3D Point Cloud Segmentation (CVPR 2022)

Stratified Transformer for 3D Point Cloud Segmentation Xin Lai*, Jianhui Liu*, Li Jiang, Liwei Wang, Hengshuang Zhao, Shu Liu, Xiaojuan Qi, Jiaya Jia

195 Jan 1, 2023

[CVPR 2022] Official PyTorch Implementation for "Reference-based Video Super-Resolution Using Multi-Camera Video Triplets"

Reference-based Video Super-Resolution (RefVSR) Official PyTorch Implementation of the CVPR 2022 Paper Project | arXiv | RealMCVSR Dataset This repo c

151 Dec 30, 2022

Implementation of "Efficient Regional Memory Network for Video Object Segmentation" (Xie et al., CVPR 2021).

RMNet This repository contains the source code for the paper Efficient Regional Memory Network for Video Object Segmentation. Cite this work @inprocee

76 Dec 14, 2022

Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

IC-Conv This repository is an official implementation of the paper Inception Convolution with Efficient Dilation Search. Getting Started Download Imag

111 Dec 31, 2022

Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Related tags

Overview

Temporally Efficient Vision Transformer for Video Instance Segmentation

Models and Main Results

Installation

Prerequisites

Prepare

Inference

Training

Acknowledgement ❤️

Citation

Comments

I can't find pycocotools.ytvos

Ran out of input

Evaluation code for bbox

About demo test

Are you interested in creating a PR about this work for mmtracking?

'TeViT is not in the models registry'

Owner

Hust Visual Learning Team

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Official Pytorch implementation of "Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video", CVPR 2021

Not All Points Are Equal: Learning Highly Efficient Point-based Detectors for 3D LiDAR Point Clouds (CVPR 2022, Oral)

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

Scribble-Supervised LiDAR Semantic Segmentation, CVPR 2022 (ORAL)

FreeSOLO for unsupervised instance segmentation, CVPR 2022

"MST++: Multi-stage Spectral-wise Transformer for Efficient Spectral Reconstruction" (CVPRW 2022) & (Winner of NTIRE 2022 Challenge on Spectral Reconstruction from RGB)

TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks

A PyTorch Reimplementation of TecoGAN: Temporally Coherent GAN for Video Super-Resolution

[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

[ArXiv 2021] Data-Efficient Instance Generation from Instance Discrimination

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation

Stratified Transformer for 3D Point Cloud Segmentation (CVPR 2022)

[CVPR 2022] Official PyTorch Implementation for "Reference-based Video Super-Resolution Using Multi-Camera Video Triplets"

Implementation of "Efficient Regional Memory Network for Video Object Segmentation" (Xie et al., CVPR 2021).

Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).