Video Instance Segmentation using Inter-Frame Communication Transformers (NeurIPS 2021)

Related tags

Deep Learning IFC
Overview

Video Instance Segmentation using Inter-Frame Communication Transformers (NeurIPS 2021)

Paper

Video Instance Segmentation using Inter-Frame Communication Transformers

Note

Steps

  1. Installation.

Install YouTube-VIS API following the link.
Install the repository by the following command. Follow Detectron2 for details.

git clone https://github.com/sukjunhwang/IFC.git
cd IFC
pip install -e .
  1. Link datasets

COCO

mkdir -p datasets/coco
ln -s /path_to_coco_dataset/annotations datasets/coco/annotations
ln -s /path_to_coco_dataset/train2017 datasets/coco/train2017
ln -s /path_to_coco_dataset/val2017 datasets/coco/val2017

YTVIS 2019

mkdir -p datasets/ytvis_2019
ln -s /path_to_ytvis2019_dataset datasets/ytvis_2019

We expect ytvis_2019 folder to be like

└── ytvis_2019
    ├── train
    │   ├── Annotations
    │   ├── JPEGImages
    │   └── meta.json
    ├── valid
    │   ├── Annotations
    │   ├── JPEGImages
    │   └── meta.json
    ├── test
    │   ├── Annotations
    │   ├── JPEGImages
    │   └── meta.json
    ├── train.json
    ├── valid.json
    └── test.json

Training w/ 8 GPUs (if using AdamW and trying to change the batch size, please refer to https://arxiv.org/abs/1711.00489)

  • Our suggestion is to use 8 GPUs.
  • Pretraining on COCO requires >= 16G GPU memory, while finetuning on YTVIS requires less.
python projects/IFC/train_net.py --num-gpus 8 \
    --config-file projects/IFC/configs/base_ytvis.yaml \
    MODEL.WEIGHTS path/to/model.pth

Evaluating on YTVIS 2019.
We support multi-gpu evaluation and $F_NUM denotes the window size.

python projects/IFC/train_net.py --num-gpus 8 --eval-only \
    --config-file projects/IFC/configs/base_ytvis.yaml \
    MODEL.WEIGHTS path/to/model.pth \
    INPUT.SAMPLING_FRAME_NUM $F_NUM

Model Checkpoints (YTVIS 2019)

Due to the small size of YTVIS dataset, the scores may fluctuate even if retrained with the same configuration.

Note: The provided checkpoints are the ones with highest accuracies from multiple training attempts. If you are planning to cite IFC and its scores, we suggest you to refer to the average scores reported in camera-ready version of NeurIPS.

backbone stride FPS AP AP50 AP75 AR1 AR10 download
ResNet-50 T=5
T=36
46.5
107.1
41.6
42.8
63.2
65.8
45.6
46.8
43.6
43.8
53.0
51.2
model | results
ResNet-101 T=36 89.4 44.6 69.2 49.5 44.0 52.1 model | results

License

IFC is released under the Apache 2.0 license.

Citing

If our work is useful in your project, please consider citing us.

@article{hwang2021video,
  title   = {Video Instance Segmentation using Inter-Frame Communication Transformers},
  author  = {Hwang, Sukjun and Heo, Miran and Oh, Seoung Wug and Kim, Seon Joo},
  journal = {arXiv preprint arXiv:2106.03299},
  year    = {2021}
}

Acknowledgement

We highly appreciate all previous works that influenced our project.
Special thanks to facebookresearch for their wonderful codes that have been publicly released (detectron2, DETR).

Comments
  • Use model at inference

    Use model at inference

    Hey, first things first: Great paper! I am currently trying to run your model at inference and therefor used the script demo/demo.py and passed the arguments --config-file ifc_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml -output <path_to_output_file> --video-input <path_to_input_file> --opts MODEL.WEIGHTS detectron2://COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x/138205316/model_final_a3ec72.pkl

    Everything works fine, but I think thats not using your model right? Putting r101.pth for the WEIGHTS and R101_ytvis.yaml for the config-file does not work ("KeyError: 'Non-existent config key: MODEL.IFC' "). So how can I use your pretrained model at inference just to visualize and test for results?

    documentation 
    opened by oconnor127 7
  • Question about batch size vs num frames

    Question about batch size vs num frames

    Hello again,

    I have one last question that I'm still unclear about. In this implementation, the size of the input being fed into the network is (B x C x H x W) with B being the number of frames correct? Or is it actually (B x F x C x H x W) with F being the number of frames?

    documentation 
    opened by cyrilzakka 4
  • FPS mesurment

    FPS mesurment

    Hi, thanks for the amazing work! I wanted to ask how you compute the FPS on the semi-online setup and how it depends on the stride S and clip_size T used. Taking the T=5 & S=1 scenario (the one reported on the main results table) the model takes as input 5 frames at a time, 4 of which will be overlapping from window to window (is this correct?). This means that the effective new frames predictions from step to step is just 1 frame, as the other 4 are part of the overlap used to compute the matching. Having this in mind how do you compute the FPS? I guess that is not computed taking just the effective 1 frame as the actual frames, as then FPS will be equally proportional to the stride for a fixed clip_size T.

    Thanks a lot for your clarifications!!

    documentation 
    opened by acaelles97 4
  • Questions about Memory

    Questions about Memory

    Thanks for your great work.

    I have two questions about memory_bus and memory_pos.

    The first one: In the paper, memory tokens helps features in different frames communicate with each other. However, In the code, It seems the communications is designed for communication among layers. https://github.com/sukjunhwang/IFC/blob/fb2ee4571dba4700eab3b52f10e147225b763e2a/projects/IFC/ifc/models/transformer.py#L1

            for layer_idx in range(self.num_layers):
                output = torch.cat((output, memory_bus))
    
                output = self.enc_layers[layer_idx](output, src_mask=mask,
                               src_key_padding_mask=src_key_padding_mask, pos=pos)
                output, memory_bus = output[:hw, :, :], output[hw:, :, :]
    
                memory_bus = memory_bus.view(M, bs, t, c).permute(2,1,0,3).flatten(1,2) # TxBMxC
                memory_bus = self.bus_layers[layer_idx](memory_bus)
                memory_bus = memory_bus.view(t, bs, M, c).permute(2,1,0,3).flatten(1,2) # MxBTxC
    

    The second one: It seems self.memory_bus and self.memory_pos are not updated. Intuitively, I guess it will be helpful if it is updated along with frames. https://github.com/sukjunhwang/IFC/blob/fb2ee4571dba4700eab3b52f10e147225b763e2a/projects/IFC/ifc/models/transformer.py#L66

    self.memory_bus = torch.nn.Parameter(torch.randn(num_memory_bus, d_model))
            self.memory_pos = torch.nn.Parameter(torch.randn(num_memory_bus, d_model))
            if num_memory_bus:
                nn.init.kaiming_normal_(self.memory_bus, mode="fan_out", nonlinearity="relu")
                nn.init.kaiming_normal_(self.memory_pos, mode="fan_out", nonlinearity="relu")
    
            self.return_intermediate_dec = return_intermediate_dec
    
            self.d_model = d_model
            self.nhead = nhead
    
        def _reset_parameters(self):
            for p in self.parameters():
                if p.dim() > 1:
                    nn.init.xavier_uniform_(p)
    
        def pad_zero(self, x, pad, dim=0):
            if x is None:
                return None
            pad_shape = list(x.shape)
            pad_shape[dim] = pad
            return torch.cat((x, x.new_zeros(pad_shape)), dim=dim)
    
        def forward(self, src, mask, query_embed, pos_embed, is_train):
            # prepare for enc-dec
            bs = src.shape[0] // self.num_frames if is_train else 1
            t = src.shape[0] // bs
            _, c, h, w = src.shape
    
            memory_bus = self.memory_bus
            memory_pos = self.memory_pos
    
            # encoder
            src = src.view(bs*t, c, h*w).permute(2, 0, 1)               # HW, BT, C
            frame_pos = pos_embed.view(bs*t, c, h*w).permute(2, 0, 1)   # HW, BT, C
            frame_mask = mask.view(bs*t, h*w)                           # BT, HW
    
            src, memory_bus = self.encoder(src, memory_bus, memory_pos, src_key_padding_mask=frame_mask, pos=frame_pos, is_train=is_train)
    
            # decoder
            dec_src = src.view(h*w, bs, t, c).permute(2, 0, 1, 3).flatten(0,1)
            query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)     # Q, B, C
            tgt = torch.zeros_like(query_embed)
    
            dec_pos = pos_embed.view(bs, t, c, h*w).permute(1, 3, 0, 2).flatten(0,1)
            dec_mask = mask.view(bs, t*h*w)                             # B, THW
    
            clip_hs = self.clip_decoder(tgt, dec_src, memory_bus, memory_pos, memory_key_padding_mask=dec_mask,
                                        pos=dec_pos, query_pos=query_embed, is_train=is_train)
    
            ret_memory = src.permute(1,2,0).reshape(bs*t, c, h, w)
    
            return clip_hs, ret_memory
    

    Do I misunderstand something?

    opened by 9p15p 2
  • Code explanation

    Code explanation

    Hello,

    First of all, great paper! I just have one question. Would you mind helping me understand why only the last feature map is used in the transformer? Aren't you losing information by discarding the others?

    https://github.com/sukjunhwang/IFC/blob/fb2ee4571dba4700eab3b52f10e147225b763e2a/projects/IFC/ifc/models/ifc.py#L65

    documentation 
    opened by cyrilzakka 2
  • How does the pre-train process effect the final performance?

    How does the pre-train process effect the final performance?

    Thanks for your wonderful work! I noticed that in your paper, before train IFC on VIS dataset, you firstly add an extra pretrain process on COCO dataset by setting T to 1. This implies the memory token and all bus layers are also pretrained during this process. So I'm wondering how this process influence the final performance on VIS? If we do not pretrain all memory token and bus layers on COCO, what will happen to the final performance on YouTube dataset? Hoping for your reply and thank you again.

    documentation 
    opened by DYNreB51Cx 1
  • evaluation error

    evaluation error

    After running the following command to evaluate:

    python projects/IFC/train_net.py --num-gpus 8 --eval-only --config-file projects/IFC/configs/base_ytvis.yaml MODEL.WEIGHTS pretrained_weights/coco_r50.pth INPUT.SAMPLING_FRAME_NUM 5
    

    An error occurred

      File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/ifc.py", line 221, in forward
        video_output.update(clip_results)
      File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/structures/clip_output.py", line 103, in update
        input_clip.frame_idx] = input_clip.mask_logits[left_idx]
    RuntimeError: shape mismatch: value tensor of shape [100, 5, 45, 80] cannot be broadcast to indexing result of shape [50, 5, 45, 80]
    

    And I change https://github.com/sukjunhwang/IFC/blob/fb2ee4571dba4700eab3b52f10e147225b763e2a/projects/IFC/ifc/structures/clip_output.py#L24 to

    num_max_inst = 100
    

    the error still occurred when update the second clip of the video

      File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/ifc.py", line 221, in forward
        video_output.update(clip_results)
      File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/structures/clip_output.py", line 103, in update
        input_clip.frame_idx] = input_clip.mask_logits[left_idx]
    RuntimeError: shape mismatch: value tensor of shape [5, 5, 45, 80] cannot be broadcast to indexing result of shape [0, 5, 45, 80]
    

    Could you help me to solve it?

    opened by hoyeYang 1
Owner
Sukjun Hwang
Computer vision via deep learning.
Sukjun Hwang
FLAVR is a fast, flow-free frame interpolation method capable of single shot multi-frame prediction

FLAVR is a fast, flow-free frame interpolation method capable of single shot multi-frame prediction. It uses a customized encoder decoder architecture with spatio-temporal convolutions and channel gating to capture and interpolate complex motion trajectories between frames to generate realistic high frame rate videos. This repository contains original source code for the paper accepted to CVPR 2021.

Tarun K 280 Dec 23, 2022
[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

Yuqing Wang 687 Jan 7, 2023
[CVPR'21] Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild

IVOS-W Paper Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild Zhaoyun Yin, Jia Zheng, Weixin Luo, Shenhan Qian, Hanli

SVIP Lab 38 Dec 12, 2022
Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation This paper has been accepted and early accessed

Yun Liu 39 Sep 20, 2022
Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

Introduction Code and data for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning". We cons

Pan Lu 81 Dec 27, 2022
Video Instance Segmentation with a Propose-Reduce Paradigm (ICCV 2021)

Propose-Reduce VIS This repo contains the official implementation for the paper: Video Instance Segmentation with a Propose-Reduce Paradigm Huaijia Li

DV Lab 39 Nov 23, 2022
[ArXiv 2021] Data-Efficient Instance Generation from Instance Discrimination

InsGen - Data-Efficient Instance Generation from Instance Discrimination Data-Efficient Instance Generation from Instance Discrimination Ceyuan Yang,

GenForce: May Generative Force Be with You 93 Dec 25, 2022
an implementation of Revisiting Adaptive Convolutions for Video Frame Interpolation using PyTorch

revisiting-sepconv This is a reference implementation of Revisiting Adaptive Convolutions for Video Frame Interpolation [1] using PyTorch. Given two f

Simon Niklaus 59 Dec 22, 2022
An implementation of Video Frame Interpolation via Adaptive Separable Convolution using PyTorch

This work has now been superseded by: https://github.com/sniklaus/revisiting-sepconv sepconv-slomo This is a reference implementation of Video Frame I

Simon Niklaus 984 Dec 16, 2022
ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

This is the project page for the paper: ISTR: End-to-End Instance Segmentation via Transformers, Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wa

Jie Hu 182 Dec 19, 2022
Dense Unsupervised Learning for Video Segmentation (NeurIPS*2021)

Dense Unsupervised Learning for Video Segmentation This repository contains the official implementation of our paper: Dense Unsupervised Learning for

Visual Inference Lab @TU Darmstadt 173 Dec 26, 2022
MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

MMdnn MMdnn is a comprehensive and cross-framework tool to convert, visualize and diagnose deep learning (DL) models. The "MM" stands for model manage

Microsoft 5.7k Jan 9, 2023
git《Learning Pairwise Inter-Plane Relations for Piecewise Planar Reconstruction》(ECCV 2020) GitHub:

Learning Pairwise Inter-Plane Relations for Piecewise Planar Reconstruction Code for the ECCV 2020 paper by Yiming Qian and Yasutaka Furukawa Getting

null 37 Dec 4, 2022
A package to predict protein inter-residue geometries from sequence data

trRosetta This package is a part of trRosetta protein structure prediction protocol developed in: Improved protein structure prediction using predicte

Ivan Anishchenko 185 Jan 7, 2023
Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".

multilingual-mrc-isdg Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph". This r

Liyan 5 Dec 7, 2022
Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

GNN_PPI Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction". Lear

Ursa Zrimsek 2 Dec 14, 2022
Code for CMaskTrack R-CNN (proposed in Occluded Video Instance Segmentation)

CMaskTrack R-CNN for OVIS This repo serves as the official code release of the CMaskTrack R-CNN model on the Occluded Video Instance Segmentation data

Q . J . Y 61 Nov 25, 2022
Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Temporally Efficient Vision Transformer for Video Instance Segmentation Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR

Hust Visual Learning Team 203 Dec 31, 2022
NeRViS: Neural Re-rendering for Full-frame Video Stabilization

Neural Re-rendering for Full-frame Video Stabilization

Yu-Lun Liu 9 Jun 17, 2022