SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

Overview

SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

SeqFormer

PWC

SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

Junfeng Wu, Yi Jiang, Wenqing Zhang, Xiang Bai, Song Bai

arXiv 2112.08275

Abstract

In this work, we present SeqFormer, a frustratingly simple model for video instance segmentation. SeqFormer follows the principle of vision transformer that models instance relationships among video frames. Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms should be done with each frame independently. To achieve this, SeqFormer locates an instance in each frame and aggregates temporal information to learn a powerful representation of a video-level instance, which is used to predict the mask sequences on each frame dynamically. Instance tracking is achieved naturally without tracking branches or post-processing. On the YouTube-VIS dataset, SeqFormer achieves 47.4 AP with a ResNet-50 backbone and 49.0 AP with a ResNet-101 backbone without bells and whistles. Such achievement significantly exceeds the previous state-of-the-art performance by 4.6 and 4.4, respectively. In addition, integrated with the recently-proposed Swin transformer, SeqFormer achieves a much higher AP of 59.3. We hope SeqFormer could be a strong baseline that fosters future research in video instance segmentation, and in the meantime, advances this field with a more robust, accurate, neat model.

Visualization results on YouTube-VIS 2019 valid set

Installation

First, clone the repository locally:

git clone https://github.com/wjf5203/SeqFormer.git

Then, install PyTorch 1.7 and torchvision 0.8.

conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 -c pytorch

Install dependencies and pycocotools for VIS:

pip install -r requirements.txt
pip install git+https://github.com/youtubevos/cocoapi.git#"egg=pycocotools&subdirectory=PythonAPI"

Compiling CUDA operators:

cd ./models/ops
sh ./make.sh
# unit test (should see all checking is True)
python test.py

Data Preparation

Download and extract 2019 version of YoutubeVIS train and val images with annotations from CodeLab or YouTubeVIS, and download COCO 2017 datasets. We expect the directory structure to be the following:

SeqFormer
├── datasets
│   ├── coco_keepfor_ytvis19.json
...
ytvis
├── train
├── val
├── annotations
│   ├── instances_train_sub.json
│   ├── instances_val_sub.json
coco
├── train2017
├── val2017
├── annotations
│   ├── instances_train2017.json
│   ├── instances_val2017.json

The modified coco annotations 'coco_keepfor_ytvis19.json' for joint training can be downloaded from [google].

Model zoo

Ablation model

Train on YouTube-VIS 2019, evaluate on YouTube-VIS 2019.

Model AP AP50 AP75 AR1 AR10
SeqFormer_ablation [google] 45.1 66.9 50.5 45.6 54.6

YouTube-VIS model

Train on YouTube-VIS 2019 and COCO, evaluate on YouTube-VIS 2019 val set.

Model AP AP50 AP75 AR1 AR10 Pretrain
SeqFormer_r50 [google] 47.4 69.8 51.8 45.5 54.8 weight
SeqFormer_r101 [google] 49.0 71.1 55.7 46.8 56.9 weight
SeqFormer_x101 [google] 51.2 75.3 58.0 46.5 57.3 weight
SeqFormer_swin_L [google] 59.3 82.1 66.4 51.7 64.4 weight

Training

We performed the experiment on NVIDIA Tesla V100 GPU. All models of SeqFormer are trained with total batch size of 32.

To train SeqFormer on YouTube-VIS 2019 with 8 GPUs , run:

GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 8 ./configs/r50_seqformer_ablation.sh

To train SeqFormer on YouTube-VIS 2019 and COCO 2017 jointly, run:

GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 8 ./configs/r50_seqformer.sh

To train SeqFormer_swin_L on multiple nodes, run:

On node 1:

MASTER_ADDR=
   
     NODE_RANK=0 GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 16 ./configs/swin_seqformer.sh

   

On node 2:

MASTER_ADDR=
   
     NODE_RANK=1 GPUS_PER_NODE=8 ./tools/run_dist_launch.sh 16 ./configs/swin_seqformer.sh

   

Inference & Evaluation

Evaluating on YouTube-VIS 2019:

python3 inference.py  --masks --backbone [backbone] --model_path /path/to/model_weights --save_path results.json 

To get quantitative results, please zip the json file and upload to the codalab server.

Citation

@article{wu2021seqformer,
      title={SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation}, 
      author={Junfeng Wu and Yi Jiang and Wenqing Zhang and Xiang Bai and Song Bai},
      journal={arXiv preprint arXiv:2112.08275},
      year={2021},
}

Acknowledgement

This repo is based on Deformable DETR and VisTR. Thanks for their wonderful works.

Comments
  • Why 42 classes?

    Why 42 classes?

    First of all, congratulations on the nice work! I wanted to ask why the number of classes is 42 if YT-Vis only has 40 classes. One extra class is used for the background but what about the other one?

    I also don't understand why you include the background class if you use focal loss. Original Deformable DeTR focal loss implementation ignores background because it is basically given by the sigmoid probabilities for all the classes being < 0.5.

    Thanks a lot for your help!

    opened by acaelles97 3
  • How to get the pretained weight     --pretrain_weights weights/r50_weight.pth

    How to get the pretained weight --pretrain_weights weights/r50_weight.pth

    I guess that the "--pretrain_weights weights/r50_weight.pth" means the weight pretained on COCO. But I can not find in you repo. Could you upload you weights? Thanks.

    opened by wangbo-zhao 2
  • OOM when training SeqFormer_swin_L on YouTube-VIS 2019 and COCO

    OOM when training SeqFormer_swin_L on YouTube-VIS 2019 and COCO

    Hi, thank you for your interesting work! I was trying to run your code but I meet OOM when training SeqFormer_swin_L on YouTube-VIS 2019 and COCO by your given script and command. I use 2 nodes and each node contains 8 V100 cards. Did I do something wrong?

    opened by JiaDingCN 2
  • Why frustratingly?

    Why frustratingly?

    Frustratingly Simple Few-Shot Object Detection Frustratingly Simple Domain Generalization via Image Stylization

    I‘m just wondering what's the meaning of 'Frustratingly'...

    opened by ykk648 2
  • Hi, can you please realese the remained pretrained weights of Swin Transformer?

    Hi, can you please realese the remained pretrained weights of Swin Transformer?

    It seems the pretrianed weights of Swin variants ['swin_t_p4w7', 'swin_s_p4w7', 'swin_b_p4w7', 'swin_l_p4w7', 'swin_l_p4w12'] is not provided, can you kindly realse these pretrained weights ? Thanks. : )

    opened by xiaocc612 1
  • The test score is 0.0

    The test score is 0.0

    I am using the seqformer you provided well and thank you. The performance when you execute the test code and upload the results.zip to the codalab server is 0.0. The following is the command to conduct the test. python3 inference.py --masks --backbone resnet50 --model_path weights/r50_weight.pth --save_path results.json I used SeqFormer_ablation's .pth file downloaded from model zoo. If I'm doing something wrong, please let me know the answer. Thank you!

    opened by jangeunha1119 1
  • About GPU RAM requirements

    About GPU RAM requirements

    Not an issue, just asking about hardware requirements.

    I am following you job to do VIS researching. But it came out with some limitation on GPU memory. First, I run the SeqFormer/models/ops/test.py but after seconds the GPU memory is used out. 截屏2022-01-14 15 31 55 Then, I run the inference.py, everything went well at the beginning, when processing the 295th video, the GPU memory was used out again. My machine is NVIDIA TITAN xp, can you tell me how much GPU RAM is required during inference time and running SeqFormer/models/ops/test.py?

    opened by wenhe-jia 1
  • format issue of the released r50 model weight

    format issue of the released r50 model weight

    Impressive work on VIS. I met problems in evaluating phase. Any ideas are welcome.

    It seems the released r50 pretrained model cannot be directly used to evaluate YVIS dataset, since its class head may be trained on coco. seq-issue1

    After aligning the class head output dimensionality of the model to the released one, it seems inference still has one issue. I am not sure how to configure the code to address it. seq-issue2

    opened by shepnerd 1
  • An error occurs when training SeqFormer on YouTube-VIS 2019 and COCO 2017 jointly

    An error occurs when training SeqFormer on YouTube-VIS 2019 and COCO 2017 jointly

    Hi Junfeng, Thanks for your excellent work! I meet a problem when I train the SeqFormer on YouTube-VIS 2019 and COCO 2017 jointly. Here is the error information. Traceback (most recent call last): File "main.py", line 331, in main(args) File "main.py", line 278, in main model, criterion, data_loader_train, optimizer, device, epoch, args.clip_max_norm) File "/data/liangzhiyuan/projects/SeqFormer/engine.py", line 48, in train_one_epoch outputs, loss_dict = model(samples, targets, criterion, train=True) File "/home/liangzhiyuan/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/liangzhiyuan/.local/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 619, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/liangzhiyuan/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/liangzhiyuan/projects/SeqFormer/models/segmentation.py", line 166, in forward indices = criterion.matcher(outputs_layer, gt_targets, self.detr.num_frames, valid_ratios) File "/home/liangzhiyuan/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/data/liangzhiyuan/projects/SeqFormer/models/matcher.py", line 113, in forward indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))] File "/data/liangzhiyuan/projects/SeqFormer/models/matcher.py", line 113, in indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))] File "/usr/local/lib/python3.6/dist-packages/scipy/optimize/_lsap.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries") ValueError: matrix contains invalid numeric entries

        It seems that some values of C are nan or inf. Do you meet this problem during training? BTW, the training process using just the YouTube-VIS 2019 dataset works well in my setting.
    
    opened by liangzhiyuanCV 1
  • Remove find_unused_parameters

    Remove find_unused_parameters

    This PR removes the self.output_proj_box from MSDeformAttn if the mode is decode which allows to run torch.nn.parallel.DistributedDataParallel without find_unused_parameters=True. In theory, this should improve training time as the torch backend avoids a forward pass for finding these parameters.

    Edit: Just realised that my editor also removed several white spaces in the respective files. I hope this is okay.

    opened by timmeinhardt 0
  • Default inference paths wrong

    Default inference paths wrong

    Following the README.md, the ytvis dataset folder will be in the root directory of this repository. Excecuting the inference.py script in the same directory will cause an error as it expects ytvis to be in a parent directory:

    https://github.com/wjf5203/SeqFormer/blob/edbfba4503d69b351b111906336498ef9dbce70c/inference.py#L108

    opened by timmeinhardt 0
  • How about the performance when replacing Deformable DETR to the original DETR?

    How about the performance when replacing Deformable DETR to the original DETR?

    Hi, thanks for your good work. I want to know the performance only using the original DETR instead of the improved Deformable DETR for a fair comparison with IFC paper.

    opened by jiangzhengkai 0
  • Not able to reproduce the results

    Not able to reproduce the results

    Congrats for the awesome work.

    I am trying to reproduce the results for resnet-50 backbone. I tried following ,

    1. Train Seqformer on coco dataset (with num_frames=1) for 24 epochs
    2. Train Seqformer on coco+ytvis and ytvis using coco pretrained weights

    Still I am not able to generate the desired numbers.

    Can you please help me out with this ?

    Thanks,

    opened by OmkarThawakar 2
Owner
Junfeng Wu
PhD student, Huazhong University of Science and Technology, Computer Vision
Junfeng Wu
[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

Yuqing Wang 687 Jan 7, 2023
Video Instance Segmentation with a Propose-Reduce Paradigm (ICCV 2021)

Propose-Reduce VIS This repo contains the official implementation for the paper: Video Instance Segmentation with a Propose-Reduce Paradigm Huaijia Li

DV Lab 39 Nov 23, 2022
Code for CMaskTrack R-CNN (proposed in Occluded Video Instance Segmentation)

CMaskTrack R-CNN for OVIS This repo serves as the official code release of the CMaskTrack R-CNN model on the Occluded Video Instance Segmentation data

Q . J . Y 61 Nov 25, 2022
This is the official implementation of the paper "Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation".

[CVPRW 2021] - Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation

Anirudh S Chakravarthy 6 May 3, 2022
Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Temporally Efficient Vision Transformer for Video Instance Segmentation Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR

Hust Visual Learning Team 203 Dec 31, 2022
TorchDistiller - a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

This project is a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

yifan liu 147 Dec 3, 2022
[ArXiv 2021] Data-Efficient Instance Generation from Instance Discrimination

InsGen - Data-Efficient Instance Generation from Instance Discrimination Data-Efficient Instance Generation from Instance Discrimination Ceyuan Yang,

GenForce: May Generative Force Be with You 93 Dec 25, 2022
VIL-100: A New Dataset and A Baseline Model for Video Instance Lane Detection (ICCV 2021)

Preparation Please see dataset/README.md to get more details about our datasets-VIL100 Please see INSTALL.md to install environment and evaluation too

null 82 Dec 15, 2022
In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

Contrastive Learning of Object Representations Supervisor: Prof. Dr. Gemma Roig Institutions: Goethe University CVAI - Computational Vision & Artifici

Dirk Neuhäuser 6 Dec 8, 2022
Implementation of STAM (Space Time Attention Model), a pure and simple attention model that reaches SOTA for video classification

STAM - Pytorch Implementation of STAM (Space Time Attention Model), yet another pure and simple SOTA attention model that bests all previous models in

Phil Wang 109 Dec 28, 2022
Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

null 1 Jan 23, 2022
YolactEdge: Real-time Instance Segmentation on the Edge

YolactEdge, the first competitive instance segmentation approach that runs on small edge devices at real-time speeds. Specifically, YolactEdge runs at up to 30.8 FPS on a Jetson AGX Xavier (and 172.7 FPS on an RTX 2080 Ti) with a ResNet-101 backbone on 550x550 resolution images.

Haotian Liu 1.1k Jan 6, 2023
the code used for the preprint Embedding-based Instance Segmentation of Microscopy Images.

EmbedSeg Introduction This repository hosts the version of the code used for the preprint Embedding-based Instance Segmentation of Microscopy Images.

JugLab 88 Dec 25, 2022
Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation

Unseen Object Clustering: Learning RGB-D Feature Embeddings for Unseen Object Instance Segmentation Introduction In this work, we propose a new method

NVIDIA Research Projects 132 Dec 13, 2022
Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers [CVPR 2021]

Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers [BCNet, CVPR 2021] This is the official pytorch implementation of BCNet built on

Lei Ke 434 Dec 1, 2022
git《USD-Seg:Learning Universal Shape Dictionary for Realtime Instance Segmentation》(2020) GitHub: [fig2]

USD-Seg This project is an implement of paper USD-Seg:Learning Universal Shape Dictionary for Realtime Instance Segmentation, based on FCOS detector f

Ruolin Ye 80 Nov 28, 2022
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

Swin Transformer 1.4k Dec 30, 2022
Implementation of DropLoss for Long-Tail Instance Segmentation in Pytorch

[AAAI 2021]DropLoss for Long-Tail Instance Segmentation [AAAI 2021] DropLoss for Long-Tail Instance Segmentation Ting-I Hsieh*, Esther Robb*, Hwann-Tz

Tim 37 Dec 2, 2022
Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch.

Faster R-CNN and Mask R-CNN in PyTorch 1.0 maskrcnn-benchmark has been deprecated. Please see detectron2, which includes implementations for all model

Facebook Research 9k Jan 4, 2023