ReferFormer - Official Implementation of ReferFormer

Overview

License Framework

PWC PWC

The official implementation of the paper:

Language as Queries for Referring
Video Object Segmentation

Language as Queries for Referring Video Object Segmentation

Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, Ping Luo

Abstract

In this work, we propose a simple and unified framework built upon Transformer, termed ReferFormer. It views the language as queries and directly attends to the most relevant regions in the video frames. Concretely, we introduce a small set of object queries conditioned on the language as the input to the Transformer. In this manner, all the queries are obligated to find the referred objects only. They are eventually transformed into dynamic kernels which capture the crucial object-level information, and play the role of convolution filters to generate the segmentation masks from feature maps. The object tracking is achieved naturally by linking the corresponding queries across frames. This mechanism greatly simplifies the pipeline and the end-to-end framework is significantly different from the previous methods. Extensive experiments on Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences show the effectiveness of ReferFormer.

Requirements

We test the codes in the following environments, other versions may also be compatible:

  • CUDA 11.1
  • Python 3.7
  • Pytorch 1.8.1

Installation

Please refer to install.md for installation.

Data Preparation

Please refer to data.md for data preparation.

We provide the pretrained model for different visual backbones. You may download them here and put them in the directory pretrained_weights.

After the organization, we expect the directory struture to be the following:

ReferFormer/
├── data/
│   ├── ref-youtube-vos/
│   ├── ref-davis/
│   ├── a2d_sentences/
│   ├── jhmdb_sentences/
├── davis2017/
├── datasets/
├── models/
├── scipts/
├── tools/
├── util/
├── pretrained_weights/
├── eval_davis.py
├── main.py
├── engine.py
├── inference_ytvos.py
├── inference_davis.py
├── opts.py
...

Model Zoo

All the models are trained using 8 NVIDIA Tesla V100 GPU. You may change the --backbone parameter to use different backbones (see here).

Note: If you encounter the OOM error, please add the command --use_checkpoint (we add this command for Swin-L, Video-Swin-S and Video-Swin-B models).

Ref-Youtube-VOS

To evaluate the results, please upload the zip file to the competition server.

Backbone J&F CFBI J&F Pretrain Model Submission CFBI Submission
ResNet-50 55.6 59.4 weight model link link
ResNet-101 57.3 60.3 weight model link link
Swin-T 58.7 61.2 weight model link link
Swin-L 62.4 63.3 weight model link link
Video-Swin-T* 55.8 - - model link -
Video-Swin-T 59.4 - weight model link -
Video-Swin-S 60.1 - weight model link -
Video-Swin-B 62.9 - weight model link -

* indicates the model is trained from scratch.

Ref-DAVIS17

As described in the paper, we report the results using the model trained on Ref-Youtube-VOS without finetune.

Backbone J&F J F Model
ResNet-50 58.5 55.8 61.3 model
Swin-L 60.5 57.6 63.4 model
Video-Swin-B 61.1 58.1 64.1 model

A2D-Sentences

The pretrained models are the same as those provided for Ref-Youtube-VOS.

Backbone Overall IoU Mean IoU mAP Pretrain Model
Video-Swin-T 77.6 69.6 52.8 weight model | log
Video-Swin-S 77.7 69.8 53.9 weight model | log
Video-Swin-B 78.6 70.3 55.0 weight model | log

JHMDB-Sentences

As described in the paper, we report the results using the model trained on A2D-Sentences without finetune.

Backbone Overall IoU Mean IoU mAP Model
Video-Swin-T 71.9 71.0 42.2 model
Video-Swin-S 72.8 71.5 42.4 model
Video-Swin-B 73.0 71.8 43.7 model

Get Started

Please see Ref-Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB-Sentences for details.

Acknowledgement

This repo is based on Deformable DETR and VisTR. We also refer to the repositories MDETR and MTTR. Thanks for their wonderful works.

Citation

@article{wu2022referformer,
      title={Language as Queries for Referring Video Object Segmentation}, 
      author={Jiannan Wu and Yi Jiang and Peize Sun and Zehuan Yuan and Ping Luo},
      journal={arXiv preprint arXiv:2201.00487},
      year={2022},
}
Comments
  • joint training hyper parameters

    joint training hyper parameters

    Hi,

    Thank you for sharing your work. I am writing to inquire the hyper parameters used for joint training. The arxiv paper mentioned that image which says that the joint training uses 32 V100 GPUs and 2 video clips for each GPU.

    I consider it means 32G V100 GPU. But I think it's not possible to add 2 video clip within 32G memory. I cannot reproduce the result using 8 V100 32G GPU with 1 clip per GPU, would you like to give me some advice? Thank you!

    opened by lxa9867 9
  • No meta.json in ref-youtube-vos dataset

    No meta.json in ref-youtube-vos dataset

    Hi, I download and zip the youtube_vos datasets as guided. But there is not the meta.json in data/ref-youtube-vos/train folder and the code need this. What should I do?Thanks.

    opened by zhuyan129 4
  • Issue w.r.t pretraining models

    Issue w.r.t pretraining models

    Thank you for releasing the codes of ReferFormer and the following update of the pretraining code.

    Can you please also release the scripts for the pre-training process? I have tried to use the hyper params mentioned in the paper (like the multi-step LR scheduler). However, the code you released uses a StepLRScheduler rather than a MultiStep one, and the run got stuck and failed. As such, I'm wondering if the script for the released pretraining code needs a special setup. The pretraining process consumes lots of computation resources and I don't want to waste any of the GPU cards. It would be appreciated if you can help with this.

    Thanks in advance.

    opened by youthHan 3
  • Joint Training Settings

    Joint Training Settings

    Thank you for your great work! I met a problem when I reproduced the result of your joint train. Do any weights of the backbone network need to be loaded during the joint training e.g. video-swin-b-kinetics400-22k. I observed that there is no option to load any weights in the published joint training script. Thanks!

    opened by lszw11 2
  • Cannot compile cuda version of MSDeformAttnFunction.

    Cannot compile cuda version of MSDeformAttnFunction.

    Hi! I tried to compile MSDeformAttnFunction with cuda 11.5.0 and everytime I compile it break and doesn't recognize the cuda module. Is there any other way I can do it ?

    opened by subramanya1997 2
  • No train.zip in ref-youtube-vos dataset

    No train.zip in ref-youtube-vos dataset

    Hi, there is no train.zip in the competition's website (https://competitions.codalab.org/competitions/29139#participate-get_data). I can't get training data. Can you share a copy? Thanks.

    opened by zhuyan129 2
  • Question of Pretraining on RefCOCO/+/g train/val.json

    Question of Pretraining on RefCOCO/+/g train/val.json

    Hi! So glad that you released the pretrain code on Refcoco Datasets. And, in datasets/refexp.py Line 168/169, you divided original json files into train/val splits. However, I download json files on refer. They use many different mode like 'unc', 'berkeley','google'. So, could you please share your json files of RefCOCO/+/g train/val.json?

    Thanks a lot!

    opened by YRlin-12 2
  • Reproducing training, recommended hardware setup

    Reproducing training, recommended hardware setup

    Hi,

    Thanks for releasing the great code! I'm trying to reproduce the training, for now I start with the pre-trained model and just do the fine-tuning on YouTube-VOS. What is the recommended number of GPUs and what run-time should I expect? Is the currently released code able to support multi-node training? So far, I was able to run the YouTube-VOS training on a single machine with 4xA100 GPU and it took ~2 hours per epoch, so 12 hours in total. Please let me know about the recommended hardware setup for the YouTube-VOS fine-tuning and also for the pre-training step (I think for this maybe not all code is released yet?). And if I use a different number of GPUs, can I expect the same result quality, just longer run-time, or will this maybe lead to problems/worse results?

    Thank you!

    Best,

    Paul

    opened by pvoigtlaender 2
  • Where to get the test meta file?

    Where to get the test meta file?

    Hi,

    thanks for releasing the great code! I understand that there have been some changes in the split of the validation and test set and it seems that also the meta file for the test set is no longer available for download on codalab. Hence, I get an error in this line: https://github.com/wjn922/ReferFormer/blob/8024da12cc84cf18a34094c51a156f940fe224b4/inference_ytvos.py#L79 What is the best way to deal with this? Maybe you could please share the json meta file?

    Best,

    Paul

    opened by pvoigtlaender 2
  • Pretraining on RefCOCO Dataset

    Pretraining on RefCOCO Dataset

    Hi Jonas,

    Thanks for your contribution! I noticed that the work supports pretraining on the refcoco dataset. From the code, it seems that it uses a coco-format refcoco dataset for data loading. Is it possible to provide some information about the pretraining process and the dataset conversion tools for refcoco dataset? Thanks in advance!

    opened by ntuLC 2
  • About sentence feature

    About sentence feature

    Hello, the paper states that the sentence feature is obtained by pooling the text features. However, when reading your code, I saw that the sentence feature is actually from the pooler_output of the Roberta model. According to this https://huggingface.co/transformers/v2.9.1/model_doc/roberta.html#robertamodel, the pooler_output has different meaning than pooling over the text features. Have you tried to actually pool the text features to get the sentence level feature? Is it worse than the current way you are doing?

    Thank you

    opened by npmhung 1
  • Finetuning on ref-davis17?

    Finetuning on ref-davis17?

    nice work. In the paper, it says 'most of our experiments follow the pretrain-then-finetune process.' However, in this github, it says 'As described in the paper, we report the results using the model trained on Ref-Youtube-VOS without finetune.'

    did you finetune the pre-trained model on ref-davis17?

    opened by Jay-IPL 5
  • question about ref-davis evaluation

    question about ref-davis evaluation

    Hi, thanks for sharing the great work. I have a question about the ref-davis evaluation. After running the evaluation script, I see that there is a global_results-val.csv file generated for each annotator. How do you get the metrics for the whole dataset as reported in the paper? Do you average the numbers of the four annotators? Thank you!

    opened by joellliu 1
  • The model with Video-Swin-S backbone trained on Ref-YouTube-VOS dataset is missing

    The model with Video-Swin-S backbone trained on Ref-YouTube-VOS dataset is missing

    For non-commercial, research purposes I need to compare my R-VOS model with ReferFormer based on video-swin-s backbone and trained on Ref-YouTube-VOS dataset, but unfortunately, the link you have attached in README is referring to another model. Could you please provide us with the correct model?

    opened by levon-khachatryan 2
  • ImageNet pre-trained checkpoint for Swin-L

    ImageNet pre-trained checkpoint for Swin-L

    Hi,

    which pre-trained checkpoint did you use for Swin-L? I mean the first pre-training step, i.e. ImageNet or something similar, not Ref-COCO. Like the Kinetics checkpoints for VideoSwin were explained here: https://github.com/wjn922/ReferFormer/issues/16 Is it one of the checkpoints from here? https://github.com/microsoft/Swin-Transformer If yes, which one? Thank you!

    opened by pvoigtlaender 0
  • Mismatch between implementation and conceptual explanation

    Mismatch between implementation and conceptual explanation

    Hi All,

    I have read your paper and it is quite interesting. However, I have a couple of questions on referformer for better understanding. In referformer.py, the lines from 235 to 280 compute the cross-modal attention before feeding it to the Deformable transformer, but this module is missing in Figure 2. What is the use of it? Also, the deformable transformer contains a transformer encoder network. Is this the same transformer encoder block (blue) specified in Figure 2? From the code, it looks like there are two transformer encoders. Please clarify.

    Thank You, Raj

    opened by basavaraj-hampiholi 1
Owner
Jonas Wu
The University of Hong Kong. PhD Candidate. Computer Vision.
Jonas Wu
Official implementation of AAAI-21 paper "Label Confusion Learning to Enhance Text Classification Models"

Description: This is the official implementation of our AAAI-21 accepted paper Label Confusion Learning to Enhance Text Classification Models. The str

null 101 Nov 25, 2022
Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

null 49 Nov 23, 2022
The official implementation of NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021]. https://arxiv.org/pdf/2101.12378.pdf

NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021] Release Notes The offical PyTorch implementation of NeMo, p

Angtian Wang 76 Nov 23, 2022
StyleGAN2-ADA - Official PyTorch implementation

Abstract: Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes.

NVIDIA Research Projects 3.2k Dec 30, 2022
Official implementation of the ICLR 2021 paper

You Only Need Adversarial Supervision for Semantic Image Synthesis Official PyTorch implementation of the ICLR 2021 paper "You Only Need Adversarial S

Bosch Research 272 Dec 28, 2022
Official PyTorch implementation of Joint Object Detection and Multi-Object Tracking with Graph Neural Networks

This is the official PyTorch implementation of our paper: "Joint Object Detection and Multi-Object Tracking with Graph Neural Networks". Our project website and video demos are here.

Richard Wang 443 Dec 6, 2022
Official implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis https://arxiv.org/abs/2011.13775

CIPS -- Official Pytorch Implementation of the paper Image Generators with Conditionally-Independent Pixel Synthesis Requirements pip install -r requi

Multimodal Lab @ Samsung AI Center Moscow 201 Dec 21, 2022
Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

HiSD: Image-to-image Translation via Hierarchical Style Disentanglement Official pytorch implementation of paper "Image-to-image Translation

null 364 Dec 14, 2022
Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

IC-Conv This repository is an official implementation of the paper Inception Convolution with Efficient Dilation Search. Getting Started Download Imag

Jie Liu 111 Dec 31, 2022
Official PyTorch Implementation of Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity

UnRigidFlow This is the official PyTorch implementation of UnRigidFlow (IJCAI2019). Here are two sample results (~10MB gif for each) of our unsupervis

Liang Liu 28 Nov 16, 2022
Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

LLA: Loss-aware Label Assignment for Dense Pedestrian Detection This project provides an implementation for "LLA: Loss-aware Label Assignment for Dens

null 35 Dec 6, 2022
Official implementation of Self-supervised Graph Attention Networks (SuperGAT), ICLR 2021.

SuperGAT Official implementation of Self-supervised Graph Attention Networks (SuperGAT). This model is presented at How to Find Your Friendly Neighbor

Dongkwan Kim 127 Dec 28, 2022
An official implementation of "SFNet: Learning Object-aware Semantic Correspondence" (CVPR 2019, TPAMI 2020) in PyTorch.

PyTorch implementation of SFNet This is the implementation of the paper "SFNet: Learning Object-aware Semantic Correspondence". For more information,

CV Lab @ Yonsei University 87 Dec 30, 2022
This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.

BiPointNet: Binary Neural Network for Point Clouds Created by Haotong Qin, Zhongang Cai, Mingyuan Zhang, Yifu Ding, Haiyu Zhao, Shuai Yi, Xianglong Li

Haotong Qin 59 Dec 17, 2022
Official code implementation for "Personalized Federated Learning using Hypernetworks"

Personalized Federated Learning using Hypernetworks This is an official implementation of Personalized Federated Learning using Hypernetworks paper. [

Aviv Shamsian 121 Dec 25, 2022
StyleGAN2 - Official TensorFlow Implementation

StyleGAN2 - Official TensorFlow Implementation

NVIDIA Research Projects 10.1k Dec 28, 2022
Old Photo Restoration (Official PyTorch Implementation)

Bringing Old Photo Back to Life (CVPR 2020 oral)

Microsoft 11.3k Dec 30, 2022
Official implementation of "GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators" (NeurIPS 2020)

GS-WGAN This repository contains the implementation for GS-WGAN: A Gradient-Sanitized Approach for Learning Differentially Private Generators (NeurIPS

null 46 Nov 9, 2022
Official PyTorch implementation of Spatial Dependency Networks.

Spatial Dependency Networks: Neural Layers for Improved Generative Image Modeling Đorđe Miladinović   Aleksandar Stanić   Stefan Bauer   Jürgen Schmid

Djordje Miladinovic 34 Jan 19, 2022