Video Instance Segmentation using Inter-Frame Communication Transformers (NeurIPS 2021)

Sukjun Hwang

Last update: Dec 29, 2022

Related tags

Deep Learning IFC

Overview

Video Instance Segmentation using Inter-Frame Communication Transformers (NeurIPS 2021)

Paper

Video Instance Segmentation using Inter-Frame Communication Transformers

Note

Based on detectron2 and DETR (Used commit : 76ec0a2).
The codes are under projects/ folder, which follows the convention of detectron2.
You can easily import our project to the latest detectron2 by following below.
- inserting projects/IFC folder
- updating detectron2/projects/__init__.py
- updating setup.py

Steps

Installation.

Install YouTube-VIS API following the link.
Install the repository by the following command. Follow Detectron2 for details.

git clone https://github.com/sukjunhwang/IFC.git
cd IFC
pip install -e .

Link datasets

COCO

mkdir -p datasets/coco
ln -s /path_to_coco_dataset/annotations datasets/coco/annotations
ln -s /path_to_coco_dataset/train2017 datasets/coco/train2017
ln -s /path_to_coco_dataset/val2017 datasets/coco/val2017

YTVIS 2019

mkdir -p datasets/ytvis_2019
ln -s /path_to_ytvis2019_dataset datasets/ytvis_2019

We expect ytvis_2019 folder to be like

└── ytvis_2019
    ├── train
    │   ├── Annotations
    │   ├── JPEGImages
    │   └── meta.json
    ├── valid
    │   ├── Annotations
    │   ├── JPEGImages
    │   └── meta.json
    ├── test
    │   ├── Annotations
    │   ├── JPEGImages
    │   └── meta.json
    ├── train.json
    ├── valid.json
    └── test.json

Training w/ 8 GPUs (if using AdamW and trying to change the batch size, please refer to https://arxiv.org/abs/1711.00489)

Our suggestion is to use 8 GPUs.
Pretraining on COCO requires >= 16G GPU memory, while finetuning on YTVIS requires less.

python projects/IFC/train_net.py --num-gpus 8 \
    --config-file projects/IFC/configs/base_ytvis.yaml \
    MODEL.WEIGHTS path/to/model.pth

Evaluating on YTVIS 2019.
We support multi-gpu evaluation and $F_NUM denotes the window size.

python projects/IFC/train_net.py --num-gpus 8 --eval-only \
    --config-file projects/IFC/configs/base_ytvis.yaml \
    MODEL.WEIGHTS path/to/model.pth \
    INPUT.SAMPLING_FRAME_NUM $F_NUM

Model Checkpoints (YTVIS 2019)

Due to the small size of YTVIS dataset, the scores may fluctuate even if retrained with the same configuration.

Note: The provided checkpoints are the ones with highest accuracies from multiple training attempts. If you are planning to cite IFC and its scores, we suggest you to refer to the average scores reported in camera-ready version of NeurIPS.

backbone	stride	FPS	AP	AP50	AP75	AR1	AR10	download
ResNet-50	T=5 T=36	46.5 107.1	41.6 42.8	63.2 65.8	45.6 46.8	43.6 43.8	53.0 51.2	model \| results
ResNet-101	T=36	89.4	44.6	69.2	49.5	44.0	52.1	model \| results

License

IFC is released under the Apache 2.0 license.

Citing

If our work is useful in your project, please consider citing us.

@article{hwang2021video,
  title   = {Video Instance Segmentation using Inter-Frame Communication Transformers},
  author  = {Hwang, Sukjun and Heo, Miran and Oh, Seoung Wug and Kim, Seon Joo},
  journal = {arXiv preprint arXiv:2106.03299},
  year    = {2021}
}

Acknowledgement

We highly appreciate all previous works that influenced our project.
Special thanks to facebookresearch for their wonderful codes that have been publicly released (detectron2, DETR).

Comments

Use model at inference

Hey, first things first: Great paper! I am currently trying to run your model at inference and therefor used the script demo/demo.py and passed the arguments --config-file ifc_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x.yaml -output <path_to_output_file> --video-input <path_to_input_file> --opts MODEL.WEIGHTS detectron2://COCO-InstanceSegmentation/mask_rcnn_R_101_FPN_3x/138205316/model_final_a3ec72.pkl

Everything works fine, but I think thats not using your model right? Putting r101.pth for the WEIGHTS and R101_ytvis.yaml for the config-file does not work ("KeyError: 'Non-existent config key: MODEL.IFC' "). So how can I use your pretrained model at inference just to visualize and test for results?
documentation

opened by oconnor127 7
Question about batch size vs num frames

Hello again,

I have one last question that I'm still unclear about. In this implementation, the size of the input being fed into the network is (B x C x H x W) with B being the number of frames correct? Or is it actually (B x F x C x H x W) with F being the number of frames?
documentation

opened by cyrilzakka 4
FPS mesurment

Hi, thanks for the amazing work! I wanted to ask how you compute the FPS on the semi-online setup and how it depends on the stride S and clip_size T used. Taking the T=5 & S=1 scenario (the one reported on the main results table) the model takes as input 5 frames at a time, 4 of which will be overlapping from window to window (is this correct?). This means that the effective new frames predictions from step to step is just 1 frame, as the other 4 are part of the overlap used to compute the matching. Having this in mind how do you compute the FPS? I guess that is not computed taking just the effective 1 frame as the actual frames, as then FPS will be equally proportional to the stride for a fixed clip_size T.

Thanks a lot for your clarifications!!
documentation

opened by acaelles97 4

Questions about Memory

Thanks for your great work.

I have two questions about memory_bus and memory_pos.

The first one: In the paper, memory tokens helps features in different frames communicate with each other. However, In the code, It seems the communications is designed for communication among layers. https://github.com/sukjunhwang/IFC/blob/fb2ee4571dba4700eab3b52f10e147225b763e2a/projects/IFC/ifc/models/transformer.py#L1

        for layer_idx in range(self.num_layers):
            output = torch.cat((output, memory_bus))

            output = self.enc_layers[layer_idx](output, src_mask=mask,
                           src_key_padding_mask=src_key_padding_mask, pos=pos)
            output, memory_bus = output[:hw, :, :], output[hw:, :, :]

            memory_bus = memory_bus.view(M, bs, t, c).permute(2,1,0,3).flatten(1,2) # TxBMxC
            memory_bus = self.bus_layers[layer_idx](memory_bus)
            memory_bus = memory_bus.view(t, bs, M, c).permute(2,1,0,3).flatten(1,2) # MxBTxC

The second one: It seems self.memory_bus and self.memory_pos are not updated. Intuitively, I guess it will be helpful if it is updated along with frames. https://github.com/sukjunhwang/IFC/blob/fb2ee4571dba4700eab3b52f10e147225b763e2a/projects/IFC/ifc/models/transformer.py#L66

self.memory_bus = torch.nn.Parameter(torch.randn(num_memory_bus, d_model))
        self.memory_pos = torch.nn.Parameter(torch.randn(num_memory_bus, d_model))
        if num_memory_bus:
            nn.init.kaiming_normal_(self.memory_bus, mode="fan_out", nonlinearity="relu")
            nn.init.kaiming_normal_(self.memory_pos, mode="fan_out", nonlinearity="relu")

        self.return_intermediate_dec = return_intermediate_dec

        self.d_model = d_model
        self.nhead = nhead

    def _reset_parameters(self):
        for p in self.parameters():
            if p.dim() > 1:
                nn.init.xavier_uniform_(p)

    def pad_zero(self, x, pad, dim=0):
        if x is None:
            return None
        pad_shape = list(x.shape)
        pad_shape[dim] = pad
        return torch.cat((x, x.new_zeros(pad_shape)), dim=dim)

    def forward(self, src, mask, query_embed, pos_embed, is_train):
        # prepare for enc-dec
        bs = src.shape[0] // self.num_frames if is_train else 1
        t = src.shape[0] // bs
        _, c, h, w = src.shape

        memory_bus = self.memory_bus
        memory_pos = self.memory_pos

        # encoder
        src = src.view(bs*t, c, h*w).permute(2, 0, 1)               # HW, BT, C
        frame_pos = pos_embed.view(bs*t, c, h*w).permute(2, 0, 1)   # HW, BT, C
        frame_mask = mask.view(bs*t, h*w)                           # BT, HW

        src, memory_bus = self.encoder(src, memory_bus, memory_pos, src_key_padding_mask=frame_mask, pos=frame_pos, is_train=is_train)

        # decoder
        dec_src = src.view(h*w, bs, t, c).permute(2, 0, 1, 3).flatten(0,1)
        query_embed = query_embed.unsqueeze(1).repeat(1, bs, 1)     # Q, B, C
        tgt = torch.zeros_like(query_embed)

        dec_pos = pos_embed.view(bs, t, c, h*w).permute(1, 3, 0, 2).flatten(0,1)
        dec_mask = mask.view(bs, t*h*w)                             # B, THW

        clip_hs = self.clip_decoder(tgt, dec_src, memory_bus, memory_pos, memory_key_padding_mask=dec_mask,
                                    pos=dec_pos, query_pos=query_embed, is_train=is_train)

        ret_memory = src.permute(1,2,0).reshape(bs*t, c, h, w)

        return clip_hs, ret_memory

Do I misunderstand something?

opened by 9p15p 2

Code explanation

Hello,

First of all, great paper! I just have one question. Would you mind helping me understand why only the last feature map is used in the transformer? Aren't you losing information by discarding the others?

https://github.com/sukjunhwang/IFC/blob/fb2ee4571dba4700eab3b52f10e147225b763e2a/projects/IFC/ifc/models/ifc.py#L65
documentation

opened by cyrilzakka 2
How does the pre-train process effect the final performance?

Thanks for your wonderful work! I noticed that in your paper, before train IFC on VIS dataset, you firstly add an extra pretrain process on COCO dataset by setting T to 1. This implies the memory token and all bus layers are also pretrained during this process. So I'm wondering how this process influence the final performance on VIS? If we do not pretrain all memory token and bus layers on COCO, what will happen to the final performance on YouTube dataset? Hoping for your reply and thank you again.
documentation

opened by DYNreB51Cx 1

evaluation error

After running the following command to evaluate:

python projects/IFC/train_net.py --num-gpus 8 --eval-only --config-file projects/IFC/configs/base_ytvis.yaml MODEL.WEIGHTS pretrained_weights/coco_r50.pth INPUT.SAMPLING_FRAME_NUM 5

An error occurred

  File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/ifc.py", line 221, in forward
    video_output.update(clip_results)
  File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/structures/clip_output.py", line 103, in update
    input_clip.frame_idx] = input_clip.mask_logits[left_idx]
RuntimeError: shape mismatch: value tensor of shape [100, 5, 45, 80] cannot be broadcast to indexing result of shape [50, 5, 45, 80]

And I change https://github.com/sukjunhwang/IFC/blob/fb2ee4571dba4700eab3b52f10e147225b763e2a/projects/IFC/ifc/structures/clip_output.py#L24 to

num_max_inst = 100

the error still occurred when update the second clip of the video

  File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/ifc.py", line 221, in forward
    video_output.update(clip_results)
  File "/SSD_DISK/users/yanghongye/projects/rvos/IFC/projects/IFC/ifc/structures/clip_output.py", line 103, in update
    input_clip.frame_idx] = input_clip.mask_logits[left_idx]
RuntimeError: shape mismatch: value tensor of shape [5, 5, 45, 80] cannot be broadcast to indexing result of shape [0, 5, 45, 80]

Could you help me to solve it?

opened by hoyeYang 1

Owner

Sukjun Hwang

Computer vision via deep learning.

GitHub

FLAVR is a fast, flow-free frame interpolation method capable of single shot multi-frame prediction

FLAVR is a fast, flow-free frame interpolation method capable of single shot multi-frame prediction. It uses a customized encoder decoder architecture with spatio-temporal convolutions and channel gating to capture and interpolate complex motion trajectories between frames to generate realistic high frame rate videos. This repository contains original source code for the paper accepted to CVPR 2021.

280 Dec 23, 2022

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

687 Jan 7, 2023

[CVPR'21] Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild

IVOS-W Paper Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild Zhaoyun Yin, Jia Zheng, Weixin Luo, Shenhan Qian, Hanli

38 Dec 12, 2022

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation This paper has been accepted and early accessed

39 Sep 20, 2022

Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

Introduction Code and data for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning". We cons

81 Dec 27, 2022

Video Instance Segmentation with a Propose-Reduce Paradigm (ICCV 2021)

Propose-Reduce VIS This repo contains the official implementation for the paper: Video Instance Segmentation with a Propose-Reduce Paradigm Huaijia Li

39 Nov 23, 2022

[ArXiv 2021] Data-Efficient Instance Generation from Instance Discrimination

InsGen - Data-Efficient Instance Generation from Instance Discrimination Data-Efficient Instance Generation from Instance Discrimination Ceyuan Yang,

GenForce: May Generative Force Be with You

93 Dec 25, 2022

an implementation of Revisiting Adaptive Convolutions for Video Frame Interpolation using PyTorch

revisiting-sepconv This is a reference implementation of Revisiting Adaptive Convolutions for Video Frame Interpolation [1] using PyTorch. Given two f

59 Dec 22, 2022

An implementation of Video Frame Interpolation via Adaptive Separable Convolution using PyTorch

This work has now been superseded by: https://github.com/sniklaus/revisiting-sepconv sepconv-slomo This is a reference implementation of Video Frame I

984 Dec 16, 2022

ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

This is the project page for the paper: ISTR: End-to-End Instance Segmentation via Transformers, Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wa

182 Dec 19, 2022

Dense Unsupervised Learning for Video Segmentation (NeurIPS*2021)

Dense Unsupervised Learning for Video Segmentation This repository contains the official implementation of our paper: Dense Unsupervised Learning for

173 Dec 26, 2022

MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

MMdnn MMdnn is a comprehensive and cross-framework tool to convert, visualize and diagnose deep learning (DL) models. The "MM" stands for model manage

5.7k Jan 9, 2023

git《Learning Pairwise Inter-Plane Relations for Piecewise Planar Reconstruction》(ECCV 2020) GitHub:

Learning Pairwise Inter-Plane Relations for Piecewise Planar Reconstruction Code for the ECCV 2020 paper by Yiming Qian and Yasutaka Furukawa Getting

37 Dec 4, 2022

A package to predict protein inter-residue geometries from sequence data

trRosetta This package is a part of trRosetta protein structure prediction protocol developed in: Improved protein structure prediction using predicte

185 Jan 7, 2023

Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".

multilingual-mrc-isdg Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph". This r

5 Dec 7, 2022

Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

GNN_PPI Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction". Lear

2 Dec 14, 2022

Video Instance Segmentation using Inter-Frame Communication Transformers (NeurIPS 2021)

Related tags

Overview

Video Instance Segmentation using Inter-Frame Communication Transformers (NeurIPS 2021)

Paper

Note

Steps

Model Checkpoints (YTVIS 2019)

License

Citing

Acknowledgement

Comments

Use model at inference

Question about batch size vs num frames

FPS mesurment

Questions about Memory

Code explanation

How does the pre-train process effect the final performance?

evaluation error

Owner

Sukjun Hwang

FLAVR is a fast, flow-free frame interpolation method capable of single shot multi-frame prediction

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

[CVPR'21] Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation

Data and Code for ACL 2021 Paper "Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning"

Video Instance Segmentation with a Propose-Reduce Paradigm (ICCV 2021)

[ArXiv 2021] Data-Efficient Instance Generation from Instance Discrimination

an implementation of Revisiting Adaptive Convolutions for Video Frame Interpolation using PyTorch

An implementation of Video Frame Interpolation via Adaptive Separable Convolution using PyTorch

ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

Dense Unsupervised Learning for Video Segmentation (NeurIPS*2021)

MMdnn is a set of tools to help users inter-operate among different deep learning frameworks. E.g. model conversion and visualization. Convert models between Caffe, Keras, MXNet, Tensorflow, CNTK, PyTorch Onnx and CoreML.

git《Learning Pairwise Inter-Plane Relations for Piecewise Planar Reconstruction》(ECCV 2020) GitHub:

A package to predict protein inter-residue geometries from sequence data

Code for the AAAI 2022 paper "Zero-Shot Cross-Lingual Machine Reading Comprehension via Inter-Sentence Dependency Graph".

Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

Code for CMaskTrack R-CNN (proposed in Occluded Video Instance Segmentation)

Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

NeRViS: Neural Re-rendering for Full-frame Video Stabilization