"SOLQ: Segmenting Objects by Learning Queries", SOLQ is an end-to-end instance segmentation framework with Transformer.

Overview

SOLQ: Segmenting Objects by Learning Queries

This repository is an official implementation of the paper SOLQ: Segmenting Objects by Learning Queries.

Introduction

TL; DR. SOLQ is an end-to-end instance segmentation framework with Transformer. It directly outputs the instance masks without any box dependency.

Abstract. In this paper, we propose an end-to-end framework for instance segmentation. Based on the recently introduced DETR, our method, termed SOLQ, segments objects by learning unified queries. In SOLQ, each query represents one object and has multiple representations: class, location and mask. The object queries learned perform classification, box regression and mask encoding simultaneously in an unified vector form. During training phase, the mask vectors encoded are supervised by the compression coding of raw spatial masks. In inference time, mask vectors produced can be directly transformed to spatial masks by the inverse process of compression coding. Experimental results show that SOLQ can achieve state-of-the-art performance, surpassing most of existing approaches. Moreover, the joint learning of unified query representation can greatly improve the detection performance of original DETR. We hope our SOLQ can serve as a strong baseline for the Transformer-based instance segmentation.

Main Results

Method Backbone Dataset Box AP Mask AP Model
SOLQ R50 test-dev 47.8 39.7 google
SOLQ R101 test-dev 48.7 40.9 google
SOLQ Swin-L test-dev 55.4 45.9 google

Installation

The codebase is built on top of Deformable DETR.

Requirements

  • Linux, CUDA>=9.2, GCC>=5.4

  • Python>=3.7

    We recommend you to use Anaconda to create a conda environment:

    conda create -n deformable_detr python=3.7 pip

    Then, activate the environment:

    conda activate deformable_detr
  • PyTorch>=1.5.1, torchvision>=0.6.1 (following instructions here)

    For example, if your CUDA version is 9.2, you could install pytorch and torchvision as following:

    conda install pytorch=1.5.1 torchvision=0.6.1 cudatoolkit=9.2 -c pytorch
  • Other requirements

    pip install -r requirements.txt
  • Build MultiScaleDeformableAttention

    cd ./models/ops
    sh ./make.sh

Usage

Dataset preparation

Please download COCO and organize them as following:

mkdir data && cd data
ln -s /path/to/coco coco

Training and Evaluation

Training on single node

Training SOLQ on 8 GPUs as following:

sh configs/r50_solq_train.sh

Evaluation

You can download the pretrained model of SOLQ (the link is in "Main Results" session), then run following command to evaluate it on COCO 2017 val dataset:

sh configs/r50_solq_eval.sh

Evaluation on COCO 2017 test-dev dataset

You can download the pretrained model of SOLQ (the link is in "Main Results" session), then run following command to evaluate it on COCO 2017 test-dev dataset (submit to server):

sh configs/r50_solq_submit.sh

Visualization on COCO 2017 val dataset

You can visualize on image as follows:

EXP_DIR=/path/to/checkpoint
python visual.py \
       --meta_arch solq \
       --backbone resnet50 \
       --with_vector \
       --with_box_refine \
       --masks \
       --batch_size 2 \
       --vector_hidden_dim 1024 \
       --vector_loss_coef 3 \
       --output_dir ${EXP_DIR} \
       --hidden_dim 384 \
       --resume ${EXP_DIR}/solq_r50_final.pth \
       --eval    

Citing SOLQ

If you find SOLQ useful in your research, please consider citing:

@article{dong2021solq,
  title={SOLQ: Segmenting Objects by Learning Queries},
  author={Bin Dong, Fangao Zeng, Tiancai Wang, Xiangyu Zhang, Yichen Wei},
  journal={arXiv preprint arXiv:2106.02351},
  year={2021}
}
Comments
  • Difference between feature usage between DeformableDETR and DeformableDETRsegm

    Difference between feature usage between DeformableDETR and DeformableDETRsegm

    Could you please comment on difference between feature usage in DeformableDETR and DeformableDETRsegm?

    The first uses all features elements, and the latter only features[-1]...

    Thank you!

    https://github.com/megvii-research/SOLQ/blob/5471f58/models/deformable_detr.py#L136-L162 :

            features, pos = self.backbone(samples)
    
            srcs = []
            masks = []
            for l, feat in enumerate(features):
                src, mask = feat.decompose()
                srcs.append(self.input_proj[l](src))
                masks.append(mask)
                assert mask is not None
            if self.num_feature_levels > len(srcs):
                _len_srcs = len(srcs)
                for l in range(_len_srcs, self.num_feature_levels):
                    if l == _len_srcs:
                        src = self.input_proj[l](features[-1].tensors)
                    else:
                        src = self.input_proj[l](srcs[-1])
                    m = samples.mask
                    mask = F.interpolate(m[None].float(), size=src.shape[-2:]).to(torch.bool)[0]
                    pos_l = self.backbone[1](NestedTensor(src, mask)).to(src.dtype)
                    srcs.append(src)
                    masks.append(mask)
                    pos.append(pos_l)
    
            query_embeds = None
            if not self.two_stage:
                query_embeds = self.query_embed.weight
            hs, init_reference, inter_references, enc_outputs_class, enc_outputs_coord_unact, _, _ = self.transformer(srcs, masks, pos, query_embeds)
    
    

    https://github.com/megvii-research/SOLQ/blob/5471f58/models/segmentation.py#L49-L55 :

            features, pos = self.detr.backbone(samples)
    
            bs = features[-1].tensors.shape[0]
    
            src, mask = features[-1].decompose()
            src_proj = self.detr.input_proj(src)
            hs, memory = self.detr.transformer(src_proj, mask, self.detr.query_embed.weight, pos[-1])
    
    opened by vadimkantorov 4
  • Some questions about potential further experiments.

    Some questions about potential further experiments.

    Hi, I am keeping focusing on DETR related work. This work is really interesting. It can be a good supplement answer to the issue https://github.com/facebookresearch/detr/issues/163.

    Here, I am still have some questions:

    1. I am surprised that UQR could provide more than 7 points improvement on AP^seg, as shown in Table2. May I ask for the detailed comparision about AP^seg_S,AP^seg_M and AP^seg_L? I wonder where the main improvement comes from?
    2. As seen in table 1, SOLQ performs best on AP^box_L(much more better than other methods). I am confused that why its performance is not best on AP^seg_L. Could I regard it as the drawback of the mask compression coding on large objects? I wonder whether SQR could perform well on large objects and UQR performs better on small and medium objects. What's your idea?
    3. If my understand correctly, this method can easily apply to panoptic segmentation. May I ask for some results about panoptic segmentation? With this result, we could better analyse the gap between SOLQ and DETR.
    opened by dddzg 4
  • [bug] util/misc.py needs fixing to support torchvision 0.10.0 as in original DETR

    [bug] util/misc.py needs fixing to support torchvision 0.10.0 as in original DETR

    Version checking is buggy :(

    https://github.com/facebookresearch/detr/commit/b9048ebe86561594f1472139ec42327f00aba699

    Alternatively, all that legacy code can be removed and an assert added: from packaging import version; assert version.parse(torchvision.__version__) >= version.parse('0.8')

    opened by vadimkantorov 3
  • Performance on DETR

    Performance on DETR

    As the SOLQ is built on D-DETR, have you performed the experiment on the original DETR model? I would be very grateful if you can provide the results.

    opened by Epiphqny 2
  • Swin-L Train and Test Image Resizing

    Swin-L Train and Test Image Resizing

    Hi,

    You mentioned:

    Higher performance (Box AP=56.5, Mask AP=46.7) is reported by training with long side 1536 on Swin-L backbone, instead of long side 1333.

    May I know the image resizing strategies during training and testing? I found some commented codes for Swin-L, for training https://github.com/megvii-research/SOLQ/blob/main/datasets/coco.py#L135 and for testing https://github.com/megvii-research/SOLQ/blob/main/datasets/coco.py#L158. But for testing the long side is 1333 instead of 1536. Could you please clarify this? Thank you very much!

    opened by ilovecv 2
  • Semantics of with_vector

    Semantics of with_vector

    1. Do I understand correctly that SOLQ(..., with_vector = False) is equivalent to vanilla DeformableDETR(...)?
    2. Why are postprocessors created only in args.eval? https://github.com/megvii-research/SOLQ/blob/b35360390b1d51d375dd9d03c39dbce663e223a7/models/solq.py#L614 The postprocessors should be applied at regular evaluation during training too, right? In fast_solq code only args.masks is checked: https://github.com/megvii-research/SOLQ/blob/b35360390b1d51d375dd9d03c39dbce663e223a7/models/fast_solq.py#L588. What's the motivation for this difference?

    Thanks!

    opened by vadimkantorov 2
  • Add mask to the matching cost

    Add mask to the matching cost

    Hi, authors, have you tried computing the matching cost considering cls, reg, and mask together ? Is there any difference in the performance? Thanks in advance.

    opened by zhanggang001 2
  • ImportError: cannot import name '_NewEmptyTensorOp' from 'torchvision.ops.misc'

    ImportError: cannot import name '_NewEmptyTensorOp' from 'torchvision.ops.misc'

    Error

    ImportError

    Steps to reproduce the behavior:

    1. Git clone the repository of SOLQ
    2. Update the dataset you want to use.
    3. Update the data paths in the file SOLQ/datasets/coco.py
    4. RUn the bash file configs/r50_solq_train.sh

    Expected behavior

    It should now show the error and move further to run the SOL-Q model.

    Environment

    PyTorch version: 1.9.0+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

    OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final) CMake version: version 3.12.0 Libc version: glibc-2.26

    Python version: 3.7.11 (default, Jul 3 2021, 18:01:19) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic Is CUDA available: True CUDA runtime version: 11.0.221 GPU models and configuration: GPU 0: Tesla T4 Nvidia driver version: 460.32.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5 /usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.4 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.4 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.4 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.4 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.4 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.4 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.4 HIP runtime version: N/A MIOpen runtime version: N/A

    Versions of relevant libraries: [pip3] numpy==1.19.5 [pip3] torch==1.9.0+cu102 [pip3] torchsummary==1.5.1 [pip3] torchtext==0.10.0 [pip3] torchvision==0.10.0+cu102 [conda] Could not collect

    Additional context

    When running the script !bash configs/r50_solq_train.sh it, shows ImportError like shown below :

    Traceback (most recent call last):
      File "main.py", line 22, in <module>
        import datasets
      File "/content/SOLQ/datasets/__init__.py", line 13, in <module>
        from .coco import build as build_coco
      File "/content/SOLQ/datasets/coco.py", line 23, in <module>
        from util.misc import get_local_rank, get_local_size
      File "/content/SOLQ/util/misc.py", line 36, in <module>
        from torchvision.ops.misc import _NewEmptyTensorOp
    
    opened by sagnik1511 2
  • How should SOLQ be extended to panoptic segmentation?

    How should SOLQ be extended to panoptic segmentation?

    Hi authors,

    Thanks for the excellent work! I wish to extend SOLQ to panoptic segmentation. Shall I simply treat stuff and thing equally? Looking forward to your suggestions!

    opened by encounter1997 2
  • DCT visualize

    DCT visualize

    Hi there, I'm trying to visualize the mask after idct, the code is:

    save_dir = './vis'
    gt_mask_len = 512
    n_keep = 128
    processor_dct = ProcessorDCT(n_keep=n_keep, gt_mask_len=gt_mask_len)
    #mask = cv2.imread(img_path, 0).astype(np.float32)
    mask = np.load('mask.npy')
    new_mask = np.array((mask==1)).astype(np.float32)
    
    new_mask = cv2.resize(new_mask, (gt_mask_len, gt_mask_len))
    coeffs = cv2.dct(new_mask)
    cv2.imwrite(os.path.join(save_dir, '{}_coeffs.png'.format(name.split('.')[0])), coeffs)
    
    idct = np.zeros((gt_mask_len**2))
    vectors = torch.from_numpy(coeffs).flatten()
    vectors = vectors[torch.tensor(processor_dct.zigzag_table)]
    
    idct[:n_keep] = vectors.cpu().numpy()
    idct = processor_dct.inverse_zigzag(idct, gt_mask_len, gt_mask_len)
    cv2.imwrite(os.path.join(save_dir, '{}_i_coeffs.png'.format(name.split('.')[0])), idct)
    re_mask = cv2.idct(idct)
    max_v = np.max(re_mask)
    min_v = np.min(re_mask)
    re_mask = np.where(re_mask>(max_v+min_v) / 2., 255, 0)
    cv2.imwrite(os.path.join(save_dir, '{}_recover.png'.format(name.split('.')[0])), re_mask)
    
    
    plt.figure(1)
    plt.imshow(new_mask)
    plt.figure(2)
    plt.imshow(re_mask)
    

    new_mask shows: image

    re_mask shows: image

    The re_mask looks different with the original mask.

    I was wondering why this is and if there is anything I am doing wrong.

    opened by ztjsw 2
  • training time

    training time

    “Thanks for your attention on SOLQ! It will take about 1.5 days and 2.0 days to train SOLQ with R50 and R101 backbones, respectively. As for the Swin-Large backbone, it will take nearly four days to train due to the large computation cost.”

    Excuse me, do you use 2 GPUs or 8 GPUs for training for 1.5 days? the read me you provided uses 8 GPUs. so I confused

    opened by roar-1128 1
  • A question about the implementation of segmentation mask for Deformable Detr

    A question about the implementation of segmentation mask for Deformable Detr

    Hello,

    First of all, thank you for your great work.

    I want to ask a point in the implementation of input projection before the MaskHeadSmallConv in segmentation.py . The implementation applies stride 2 to the features which makes the best stride 8. However, for the segmentation tasks, it is possible to get better result when the stride 4 is utilized for the mask creation. The original segmentation head implementation of DETR also utilizes in that way. Therefore, I want to ask that what is the reason for utilizing that stride in your implementation?

    Thanks in advance

    opened by artest08 1
  • Swin-L pretrained checkpoints used

    Swin-L pretrained checkpoints used

    Hi @dbofseuofhust, @vaesl!

    Can't find in the code the URLs to the ImageNet-pretrained Swin-L. Which checkpoints did you use? https://github.com/microsoft/Swin-Transformer provides many different ones.

    Could you please publish a config for training using Swin-L?

    Are your modifications to swin_transformer.py upstreamed anywhere?

    I also wonder, have you tried other Swin backbones like Swin-S or Swin-B? ESViT repo publishes some self-sup trained Swin, but they are only for Swin-S/T/B: https://github.com/microsoft/esvit ...

    Thank you!

    opened by vadimkantorov 1
  • Attention visualization code: Fig. 4 from Appendix A.3

    Attention visualization code: Fig. 4 from Appendix A.3

    Hi!

    Would you have any guidance / code on reproducing the decoder attention visualization? (fig. 4 from A.3). I'm worried of making some padding-related mistakes while working with sampled_locations

    Thanks!

    opened by vadimkantorov 0
Owner
MEGVII Research
Power Human with AI. 持续创新拓展认知边界 非凡科技成就产品价值
MEGVII Research
ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

This is the project page for the paper: ISTR: End-to-End Instance Segmentation via Transformers, Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wa

Jie Hu 182 Dec 19, 2022
E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation E2EC: An End-to-End Contour-based Method for High-Quality H

zhangtao 146 Dec 29, 2022
🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Rendi Chevi 156 Jan 9, 2023
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

Swin Transformer 1.4k Dec 30, 2022
Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation This paper has been accepted and early accessed

Yun Liu 39 Sep 20, 2022
nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation ". Please

jsguo 610 Dec 28, 2022
Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Temporally Efficient Vision Transformer for Video Instance Segmentation Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR

Hust Visual Learning Team 203 Dec 31, 2022
TorchDistiller - a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

This project is a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

yifan liu 147 Dec 3, 2022
[ArXiv 2021] Data-Efficient Instance Generation from Instance Discrimination

InsGen - Data-Efficient Instance Generation from Instance Discrimination Data-Efficient Instance Generation from Instance Discrimination Ceyuan Yang,

GenForce: May Generative Force Be with You 93 Dec 25, 2022
[CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

TransFuser This repository contains the code for the CVPR 2021 paper Multi-Modal Fusion Transformer for End-to-End Autonomous Driving. If you find our

null 695 Jan 5, 2023
This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object Tracking with TRansformer.

MOTR: End-to-End Multiple-Object Tracking with TRansformer This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object

null 348 Jan 7, 2023
Code & Models for 3DETR - an End-to-end transformer model for 3D object detection

3DETR: An End-to-End Transformer Model for 3D Object Detection PyTorch implementation and models for 3DETR. 3DETR (3D DEtection TRansformer) is a simp

Facebook Research 487 Dec 31, 2022
METER: Multimodal End-to-end TransformER

METER Code and pre-trained models will be publicized soon. Citation @article{dou2021meter, title={An Empirical Study of Training End-to-End Vision-a

Zi-Yi Dou 257 Jan 6, 2023
Pytorch library for end-to-end transformer models training and serving

Pytorch library for end-to-end transformer models training and serving

Mikhail Grankin 768 Jan 1, 2023
Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

InfoPro-Pytorch The Information Propagation algorithm for training deep networks with local supervision. (ICLR 2021) Revisiting Locally Supervised Lea

null 78 Dec 27, 2022
Demo for the paper "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation"

Streaming speaker diarization Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation by Juan Manuel Coria, Hervé

Juanma Coria 187 Jan 6, 2023
End-to-end image segmentation kit based on PaddlePaddle.

English | 简体中文 PaddleSeg PaddleSeg has released the new version including the following features: Our team won the AutoNUE@CVPR 2021 challenge, where

null 6.2k Jan 2, 2023
VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

Jiezhang Cao 225 Nov 13, 2022
An end-to-end PyTorch framework for image and video classification

What's New: March 2021: Added RegNetZ models November 2020: Vision Transformers now available, with training recipes! 2020-11-20: Classy Vision v0.5 R

Facebook Research 1.5k Dec 31, 2022