[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

Yuqing Wang

Last update: Jan 7, 2023

Related tags

Deep Learning VisTR

Overview

VisTR: End-to-End Video Instance Segmentation with Transformers

This is the official implementation of the VisTR paper:

Installation

We provide instructions how to install dependencies via conda. First, clone the repository locally:

git clone https://github.com/Epiphqny/vistr.git

Then, install PyTorch 1.6 and torchvision 0.7:

conda install pytorch==1.6.0 torchvision==0.7.0

Install pycocotools

conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI"

Compile DCN module(requires GCC>=5.3, cuda>=10.0)

cd models/dcn
python setup.py build_ext --inplace

Preparation

Download and extract 2019 version of YoutubeVIS train and val images with annotations from CodeLab or YoutubeVIS. We expect the directory structure to be the following:

VisTR
├── data
│   ├── train
│   ├── val
│   ├── annotations
│   │   ├── instances_train_sub.json
│   │   ├── instances_val_sub.json
├── models
...

Download the pretrained DETR models on COCO and save it to the pretrained path.

Training

Training of the model requires at least 32g memory GPU, we performed the experiment on 32g V100 card.

To train baseline VisTR on a single node with 8 gpus for 16 epochs, run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --backbone resnet101/50 --ytvos_path /path/to/ytvos --masks --pretrained_weights /path/to/pretrained_path

Inference

python inference.py --masks --model_path /path/to/model_weights --save_path /path/to/results.json

Models

We provide baseline VisTR models, and plan to include more in future. AP is computed on YouTubeVIS dataset by submitting the result json file to the CodeLab system, and inference time is calculated by pure model inference time (without data-loading and post-processing).

	name	backbone	FPS	mask AP	model	md5
0	VisTR	R50	69.9	34.4	vistr_r50(Please wait)
1	VisTR	R101	57.7	36.5	vistr_r101	2b8d412225121fb1694427ab69a40656

License

VisTR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Acknowledgement

We would like to thank the DETR open-source project for its awesome work, part of the code are modified from its project.

Citation

Please consider citing our paper in your publications if the project helps your research. BibTeX reference is as follow.

@inproceedings{wang2020end,
  title={End-to-End Video Instance Segmentation with Transformers},
  author={Wang, Yuqing and Xu, Zhaoliang and Wang, Xinlong and Shen, Chunhua and Cheng, Baoshan and Shen, Hao and Xia, Huaxia},
  booktitle =  {Proc. IEEE Conf. Computer Vision and Pattern Recognition (CVPR)},
  year={2021}
}

Comments

ModuleNotFoundError: No module named 'pycocotools.ytvos'

hi！I successfully executed the instructions include in “Install pycocotools” part，but when I perform inference, the above problem occurs. I tried to solve it, but it didn't work. I hope to get your help！

opened by CuberrChen 8
python inference.py --masks --model_path vistr_r50.pth

File "inference.py", line 236, in main(args) File "inference.py", line 172, in main model.load_state_dict(state_dict) File "/root/anaconda3/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1044, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for VisTRsegm: Missing key(s) in state_dict: "vistr.backbone.0.body.layer3.6.conv1.weight",

opened by qslia 6
ModuleNotFoundError: No module named 'pycocotools.ytvos'

i have done this

git clone https://github.com/youtubevos/cocoapi.git

cd PythonAPI

python setup.py build_ext --inplace

python setup.py build_ext install

still No module named 'pycocotools.ytvos'

pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI' pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI"

Looking in indexes: https://repo.huaweicloud.com/repository/pypi/simple, https://pypi.ngc.nvidia.com Collecting git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI Cloning https://github.com/cocodataset/cocoapi.git to /tmp/pip-req-build-x2a4p4jp Running command git clone -q https://github.com/cocodataset/cocoapi.git /tmp/pip-req-build-x2a4p4jp fatal: unable to access 'https://github.com/cocodataset/cocoapi.git/': gnutls_handshake() failed: The TLS connection was non-properly terminated. WARNING: Discarding git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI. Command errored out with exit status 128: git clone -q https://github.com/cocodataset/cocoapi.git /tmp/pip-req-build-x2a4p4jp Check the logs for full command output. ERROR: Command errored out with exit status 128: git clone -q https://github.com/cocodataset/cocoapi.git /tmp/pip-req-build-x2a4p4jp Check the logs for full command output.

opened by qslia 6
Question about Classes Index
Thank you for sharing this great work! I have a question about your classes index. YoutubeVIS has 40 classes and a empty class is involved in VisTR, thus 41 classes in total, right? Then I noticed that in your implementation, the loss_labels() is computed by

target_classes = torch.full(src_logits.shape[:2], self.num_classes, dtype=torch.int64, device=src_logits.device) target_classes[idx] = target_classes_o loss_ce = F.cross_entropy(src_logits.transpose(1, 2), target_classes, self.empty_weight)

Since ytvis's class begins from 1 to 40 and I didn't find any re-index operation in your code, I am confused by the empty class here. Do you use 40 represent empty class (you set self.num_classes=40)? Thank you
opened by lxa9867 6
How to process the video that has more than 36 frames?

Thanks for your excellent work. I have one qustion about the inference process. If the length of video is larger than 36, how to link the tracks from diffenrent clips? looking forward to your reply.

opened by jxiangli 6
The default training params are different between paper and code.

For example lr_backbone in paper 1e-4 and in code 1e-5; epoch 10 in paper and in code 18, if we should adjust it as paper or just use the default params to reproduce the results?

opened by zzzzzz0407 4
Why input size is d × (T · H · W ) ?
Some questions:

In Section 3.1, you use d × (T · H · W ) as input size. Why not T × (d · H · W )?

CNN feature is reshaped to B×C×THW. However, the input token size is HW × B × C for transformer encoder in code. It seems confusing.
opened by WinstonDeng 4
Can VisTR be applied to the task of semantic segmentation of video objects?

Thank you very much for your great work. I have followed you. Can VisTR be applied to the task of video objects semantic segmentation? What changes do we need to make when migrating your network to the task of video objects semantic segmentation? looking forward to your reply.

opened by longmalongma 4
I cound't find the instances_val_train_sun.json file in codalab page

Thanks for your open-source code. Ii is very helpful for my research. But when i want to train the model, i find i counldn't find the corresponding file instance_train_sub.json' or 'instance__val_sub.json' in the Codalab page(https://competitions.codalab.org/competitions/20128#participate-get_data). I only download the image data.But, for annotation file, I just got the test.json or val.json or train.json file. And in those json file, I couldn't fine the annotation information.I would appreciate it if you could help me to correct this problem.

opened by StiphyJay 3
Too many iterations in ONE EPOCH?

Dear authors,

Thanks for your great open-source work. I have a question regarding the training:

In each epoch, the number of iterations equals to the number of images. However, in each iteration, the input is a whole video which contains 36 images. That said, in average, one image is trained 36 times in the same epoch. In general, I think a common way is to set the number of iterations to be equal to the number of videos, such that one image is only seen once in each epoch. I am wondering why there are such many iterations in one epoch? Is it specially set for the method, or just for a convenient implementation?

Thanks a lot! Look forward hearing from you.

opened by JialianW 3
Precomputed results?

Thanks for the work! Besides the pretrained model, can the precomputed results (jsons) also be made available?

I don't have access to a V100, and I would really appreciate having the precomputed results. Thanks again!

opened by hkchengrex 3
Annotations and masks in YouTubeVIS2021 dataset

Hello! There is some difference between a definition of YouTubeVIS2021 dataset from Codalab (https://competitions.codalab.org/competitions/28988#participate-get_data) and annotation files from its links to download. Where is a block annotation{ "id" : int, "video_id" : int, "category_id" : int, "segmentations" : [RLE or [polygon] or None], "areas" : [float or None], "bboxes" : [[x,y,width,height] or None], "iscrowd" : 0 or 1, } in these json files? How will a model be trained on this data without any information about masks, boxes ant etc? Сan you advise something how to train the model with my own classes and masks?

opened by illyyyaaaa 0
frame_id-inds[i]，why do this？

for j in range(self.num_frames): img_path = os.path.join(str(self.img_folder), self.vid_infosvid]['file_names'][frame_id-inds[j]]) frame_id-inds[j], Why you do this subtraction here and reverse the order of inds before?

opened by hrz2000 0
string indices must be integers

when I run the code on the youtube vis 2019 dataset, the issue is as the picture shows, how do I solve this problem? https://drive.google.com/file/d/1k3ikymcD2-3pBlM8SV_236Xgk3KdM3UD/view?usp=sharing

opened by Jess1989 0
from . import _C run error?
I have run the setup.py in /dcn, but still can't run /dcn/deform_conv.py.

from . import _C

ImportError: attempted relative import with no known parent package

How can I save this problem?
opened by linglingling0001215 0
youtube vis dataset problems

Hello,

I downloaded youtube vis 2019 dataset. The format is different for your required format.

Therefore, i am wondering why we used your linked website and got different formats.

Is there any tools we should run to get the format you laid out?

opened by lywang76 0
Sequential processing of video frames in backbone

Hi @Epiphqny,

I have a question regarding your implementation, specifically the way you pass raw data to the backbone. If the input is a video with 36 frames, how are they being processed by a ResNet-50/101?

In the BackboneBase class, forward method, you are passing the tensor list to a backbone, but the backbone is expecting an input of size [64, 3, 7, 7]. As I understand from the paper you are reshaping the videos to [36x300x540] .. so there should be one more preprocessing step from the videos to the backbone.

Can you shed some light on this extra step?

Thanks!

opened by xserban 0

Owner

Yuqing Wang

Computer vision. Instance segmentation.

GitHub https://arxiv.org/abs/2011.14503

Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Official PyTorch Implementation for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'2021, Oral Presentation) HOTR: End-to-

114 Nov 28, 2022

Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Temporally Efficient Vision Transformer for Video Instance Segmentation Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR

203 Dec 31, 2022

[CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers This is the official PyTorch implementation and models for UP-DETR paper: @a

430 Dec 23, 2022

"SOLQ: Segmenting Objects by Learning Queries", SOLQ is an end-to-end instance segmentation framework with Transformer.

SOLQ: Segmenting Objects by Learning Queries This repository is an official implementation of the paper SOLQ: Segmenting Objects by Learning Queries.

179 Jan 2, 2023

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation E2EC: An End-to-End Contour-based Method for High-Quality H

146 Dec 29, 2022

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

156 Jan 9, 2023

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals, CVPR2021

End-to-End Object Detection with Learnable Proposal, CVPR2021

1.2k Dec 27, 2022

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation This paper has been accepted and early accessed

39 Sep 20, 2022

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

End-to-End Optimization of Scene Layout Code release for: End-to-End Optimization of Scene Layout CVPR 2020 (Oral) Project site, Bibtex For help conta

41 Dec 9, 2022

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

196 Dec 13, 2022

[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

EPro-PnP EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation In CVPR 2022 (Oral). [paper] Hanshen

同济大学智能汽车研究所综合感知研究组 ( Comprehensive Perception Research Group under Institute of Intelligent Vehicles, School of Automotive Studies, Tongji University)

842 Jan 4, 2023

[CVPR 2022 Oral] MixFormer: End-to-End Tracking with Iterative Mixed Attention

MixFormer The official implementation of the CVPR 2022 paper MixFormer: End-to-End Tracking with Iterative Mixed Attention [Models and Raw results] (G

Multimedia Computing Group, Nanjing University

235 Jan 3, 2023

[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers Website • STVG Demo • Paper This repository provides the code for our paper. This includes

108 Dec 27, 2022

[CVPR2021 Oral] FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation.

FFB6D This is the official source code for the CVPR2021 Oral work, FFB6D: A Full Flow Biderectional Fusion Network for 6D Pose Estimation. (Arxiv) Tab

201 Dec 28, 2022

The official repo of the CVPR2021 oral paper: Representative Batch Normalization with Feature Calibration

Representative Batch Normalization (RBN) with Feature Calibration The official implementation of the CVPR2021 oral paper: Representative Batch Normali

76 Nov 9, 2022

PyTorch implementation of our Adam-NSCL algorithm from our CVPR2021 (oral) paper "Training Networks in Null Space for Continual Learning"

Adam-NSCL This is a PyTorch implementation of Adam-NSCL algorithm for continual learning from our CVPR2021 (oral) paper: Title: Training Networks in N

34 Dec 21, 2022

Research code for CVPR 2021 paper "End-to-End Human Pose and Mesh Reconstruction with Transformers"

MeshTransformer ✨ This is our research code of End-to-End Human Pose and Mesh Reconstruction with Transformers. MEsh TRansfOrmer is a simple yet effec

473 Dec 31, 2022

[Preprint] "Chasing Sparsity in Vision Transformers: An End-to-End Exploration" by Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang

Chasing Sparsity in Vision Transformers: An End-to-End Exploration Codes for [Preprint] Chasing Sparsity in Vision Transformers: An End-to-End Explora

64 Dec 8, 2022

Source code for "Progressive Transformers for End-to-End Sign Language Production" (ECCV 2020)

Progressive Transformers for End-to-End Sign Language Production Source code for "Progressive Transformers for End-to-End Sign Language Production" (B

58 Dec 21, 2022