The official implementation of the paper:
Video Object Segmentation
Language as Queries for Referring Language as Queries for Referring Video Object Segmentation
Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, Ping Luo
Abstract
In this work, we propose a simple and unified framework built upon Transformer, termed ReferFormer. It views the language as queries and directly attends to the most relevant regions in the video frames. Concretely, we introduce a small set of object queries conditioned on the language as the input to the Transformer. In this manner, all the queries are obligated to find the referred objects only. They are eventually transformed into dynamic kernels which capture the crucial object-level information, and play the role of convolution filters to generate the segmentation masks from feature maps. The object tracking is achieved naturally by linking the corresponding queries across frames. This mechanism greatly simplifies the pipeline and the end-to-end framework is significantly different from the previous methods. Extensive experiments on Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences show the effectiveness of ReferFormer.
Requirements
We test the codes in the following environments, other versions may also be compatible:
- CUDA 11.1
- Python 3.7
- Pytorch 1.8.1
Installation
Please refer to install.md for installation.
Data Preparation
Please refer to data.md for data preparation.
We provide the pretrained model for different visual backbones. You may download them here and put them in the directory pretrained_weights
.
After the organization, we expect the directory struture to be the following:
ReferFormer/
├── data/
│ ├── ref-youtube-vos/
│ ├── ref-davis/
│ ├── a2d_sentences/
│ ├── jhmdb_sentences/
├── davis2017/
├── datasets/
├── models/
├── scipts/
├── tools/
├── util/
├── pretrained_weights/
├── eval_davis.py
├── main.py
├── engine.py
├── inference_ytvos.py
├── inference_davis.py
├── opts.py
...
Model Zoo
All the models are trained using 8 NVIDIA Tesla V100 GPU. You may change the --backbone
parameter to use different backbones (see here).
Note: If you encounter the OOM
error, please add the command --use_checkpoint
(we add this command for Swin-L, Video-Swin-S and Video-Swin-B models).
Ref-Youtube-VOS
To evaluate the results, please upload the zip file to the competition server.
Backbone | J&F | CFBI J&F | Pretrain | Model | Submission | CFBI Submission |
---|---|---|---|---|---|---|
ResNet-50 | 55.6 | 59.4 | weight | model | link | link |
ResNet-101 | 57.3 | 60.3 | weight | model | link | link |
Swin-T | 58.7 | 61.2 | weight | model | link | link |
Swin-L | 62.4 | 63.3 | weight | model | link | link |
Video-Swin-T* | 55.8 | - | - | model | link | - |
Video-Swin-T | 59.4 | - | weight | model | link | - |
Video-Swin-S | 60.1 | - | weight | model | link | - |
Video-Swin-B | 62.9 | - | weight | model | link | - |
* indicates the model is trained from scratch.
Ref-DAVIS17
As described in the paper, we report the results using the model trained on Ref-Youtube-VOS without finetune.
Backbone | J&F | J | F | Model |
---|---|---|---|---|
ResNet-50 | 58.5 | 55.8 | 61.3 | model |
Swin-L | 60.5 | 57.6 | 63.4 | model |
Video-Swin-B | 61.1 | 58.1 | 64.1 | model |
A2D-Sentences
The pretrained models are the same as those provided for Ref-Youtube-VOS.
Backbone | Overall IoU | Mean IoU | mAP | Pretrain | Model |
---|---|---|---|---|---|
Video-Swin-T | 77.6 | 69.6 | 52.8 | weight | model | log |
Video-Swin-S | 77.7 | 69.8 | 53.9 | weight | model | log |
Video-Swin-B | 78.6 | 70.3 | 55.0 | weight | model | log |
JHMDB-Sentences
As described in the paper, we report the results using the model trained on A2D-Sentences without finetune.
Backbone | Overall IoU | Mean IoU | mAP | Model |
---|---|---|---|---|
Video-Swin-T | 71.9 | 71.0 | 42.2 | model |
Video-Swin-S | 72.8 | 71.5 | 42.4 | model |
Video-Swin-B | 73.0 | 71.8 | 43.7 | model |
Get Started
Please see Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences for details.
Acknowledgement
This repo is based on Deformable DETR and VisTR. We also refer to the repositories MDETR and MTTR. Thanks for their wonderful works.
Citation
@article{wu2022referformer,
title={Language as Queries for Referring Video Object Segmentation},
author={Jiannan Wu and Yi Jiang and Peize Sun and Zehuan Yuan and Ping Luo},
journal={arXiv preprint arXiv:2201.00487},
year={2022},
}