"SOLQ: Segmenting Objects by Learning Queries", SOLQ is an end-to-end instance segmentation framework with Transformer.

MEGVII Research

Last update: Jan 2, 2023

Related tags

Overview

SOLQ: Segmenting Objects by Learning Queries

This repository is an official implementation of the paper SOLQ: Segmenting Objects by Learning Queries.

Introduction

TL; DR. SOLQ is an end-to-end instance segmentation framework with Transformer. It directly outputs the instance masks without any box dependency.

Abstract. In this paper, we propose an end-to-end framework for instance segmentation. Based on the recently introduced DETR, our method, termed SOLQ, segments objects by learning unified queries. In SOLQ, each query represents one object and has multiple representations: class, location and mask. The object queries learned perform classification, box regression and mask encoding simultaneously in an unified vector form. During training phase, the mask vectors encoded are supervised by the compression coding of raw spatial masks. In inference time, mask vectors produced can be directly transformed to spatial masks by the inverse process of compression coding. Experimental results show that SOLQ can achieve state-of-the-art performance, surpassing most of existing approaches. Moreover, the joint learning of unified query representation can greatly improve the detection performance of original DETR. We hope our SOLQ can serve as a strong baseline for the Transformer-based instance segmentation.

Main Results

Method	Backbone	Dataset	Box AP	Mask AP	Model
SOLQ	R50	test-dev	47.8	39.7	google
SOLQ	R101	test-dev	48.7	40.9	google
SOLQ	Swin-L	test-dev	55.4	45.9	google

Installation

The codebase is built on top of Deformable DETR.

Requirements

Linux, CUDA>=9.2, GCC>=5.4
Python>=3.7

We recommend you to use Anaconda to create a conda environment:
```
conda create -n deformable_detr python=3.7 pip
```
Then, activate the environment:
```
conda activate deformable_detr
```
PyTorch>=1.5.1, torchvision>=0.6.1 (following instructions here)

For example, if your CUDA version is 9.2, you could install pytorch and torchvision as following:
```
conda install pytorch=1.5.1 torchvision=0.6.1 cudatoolkit=9.2 -c pytorch
```
Other requirements
```
pip install -r requirements.txt
```
Build MultiScaleDeformableAttention
```
cd ./models/ops
sh ./make.sh
```

Usage

Dataset preparation

Please download COCO and organize them as following:

mkdir data && cd data
ln -s /path/to/coco coco

Training and Evaluation

Training on single node

Training SOLQ on 8 GPUs as following:

sh configs/r50_solq_train.sh

Evaluation

You can download the pretrained model of SOLQ (the link is in "Main Results" session), then run following command to evaluate it on COCO 2017 val dataset:

sh configs/r50_solq_eval.sh

Evaluation on COCO 2017 test-dev dataset

You can download the pretrained model of SOLQ (the link is in "Main Results" session), then run following command to evaluate it on COCO 2017 test-dev dataset (submit to server):

sh configs/r50_solq_submit.sh

Visualization on COCO 2017 val dataset

You can visualize on image as follows:

EXP_DIR=/path/to/checkpoint
python visual.py \
       --meta_arch solq \
       --backbone resnet50 \
       --with_vector \
       --with_box_refine \
       --masks \
       --batch_size 2 \
       --vector_hidden_dim 1024 \
       --vector_loss_coef 3 \
       --output_dir ${EXP_DIR} \
       --hidden_dim 384 \
       --resume ${EXP_DIR}/solq_r50_final.pth \
       --eval

Citing SOLQ

If you find SOLQ useful in your research, please consider citing:

@article{dong2021solq,
  title={SOLQ: Segmenting Objects by Learning Queries},
  author={Bin Dong, Fangao Zeng, Tiancai Wang, Xiangyu Zhang, Yichen Wei},
  journal={arXiv preprint arXiv:2106.02351},
  year={2021}
}

Comments

Difference between feature usage between DeformableDETR and DeformableDETRsegm

Could you please comment on difference between feature usage in DeformableDETR and DeformableDETRsegm?

The first uses all features elements, and the latter only features[-1]...

Thank you!

https://github.com/megvii-research/SOLQ/blob/5471f58/models/deformable_detr.py#L136-L162 :

        features, pos = self.backbone(samples)

        srcs = []
        masks = []
        for l, feat in enumerate(features):
            src, mask = feat.decompose()
            srcs.append(self.input_proj[l](src))
            masks.append(mask)
            assert mask is not None
        if self.num_feature_levels > len(srcs):
            _len_srcs = len(srcs)
            for l in range(_len_srcs, self.num_feature_levels):
                if l == _len_srcs:
                    src = self.input_proj[l](features[-1].tensors)
                else:
                    src = self.input_proj[l](srcs[-1])
                m = samples.mask
                mask = F.interpolate(m[None].float(), size=src.shape[-2:]).to(torch.bool)[0]
                pos_l = self.backbone[1](NestedTensor(src, mask)).to(src.dtype)
                srcs.append(src)
                masks.append(mask)
                pos.append(pos_l)

        query_embeds = None
        if not self.two_stage:
            query_embeds = self.query_embed.weight
        hs, init_reference, inter_references, enc_outputs_class, enc_outputs_coord_unact, _, _ = self.transformer(srcs, masks, pos, query_embeds)

https://github.com/megvii-research/SOLQ/blob/5471f58/models/segmentation.py#L49-L55 :

        features, pos = self.detr.backbone(samples)

        bs = features[-1].tensors.shape[0]

        src, mask = features[-1].decompose()
        src_proj = self.detr.input_proj(src)
        hs, memory = self.detr.transformer(src_proj, mask, self.detr.query_embed.weight, pos[-1])

opened by vadimkantorov 4

Some questions about potential further experiments.
Hi, I am keeping focusing on DETR related work. This work is really interesting. It can be a good supplement answer to the issue https://github.com/facebookresearch/detr/issues/163.

Here, I am still have some questions:

I am surprised that UQR could provide more than 7 points improvement on AP^seg, as shown in Table2. May I ask for the detailed comparision about AP^seg_S,AP^seg_M and AP^seg_L? I wonder where the main improvement comes from?

As seen in table 1, SOLQ performs best on AP^box_L(much more better than other methods). I am confused that why its performance is not best on AP^seg_L. Could I regard it as the drawback of the mask compression coding on large objects? I wonder whether SQR could perform well on large objects and UQR performs better on small and medium objects. What's your idea?

If my understand correctly, this method can easily apply to panoptic segmentation. May I ask for some results about panoptic segmentation? With this result, we could better analyse the gap between SOLQ and DETR.
opened by dddzg 4
[bug] util/misc.py needs fixing to support torchvision 0.10.0 as in original DETR

Version checking is buggy :(

https://github.com/facebookresearch/detr/commit/b9048ebe86561594f1472139ec42327f00aba699

Alternatively, all that legacy code can be removed and an assert added: from packaging import version; assert version.parse(torchvision.__version__) >= version.parse('0.8')

opened by vadimkantorov 3
Performance on DETR

As the SOLQ is built on D-DETR, have you performed the experiment on the original DETR model? I would be very grateful if you can provide the results.

opened by Epiphqny 2
Swin-L Train and Test Image Resizing

Hi,

You mentioned:

Higher performance (Box AP=56.5, Mask AP=46.7) is reported by training with long side 1536 on Swin-L backbone, instead of long side 1333.

May I know the image resizing strategies during training and testing? I found some commented codes for Swin-L, for training https://github.com/megvii-research/SOLQ/blob/main/datasets/coco.py#L135 and for testing https://github.com/megvii-research/SOLQ/blob/main/datasets/coco.py#L158. But for testing the long side is 1333 instead of 1536. Could you please clarify this? Thank you very much!

opened by ilovecv 2
Semantics of with_vector
Do I understand correctly that SOLQ(..., with_vector = False) is equivalent to vanilla DeformableDETR(...)?

Why are postprocessors created only in args.eval? https://github.com/megvii-research/SOLQ/blob/b35360390b1d51d375dd9d03c39dbce663e223a7/models/solq.py#L614 The postprocessors should be applied at regular evaluation during training too, right? In fast_solq code only args.masks is checked: https://github.com/megvii-research/SOLQ/blob/b35360390b1d51d375dd9d03c39dbce663e223a7/models/fast_solq.py#L588. What's the motivation for this difference?

Thanks!
opened by vadimkantorov 2
Add mask to the matching cost

Hi, authors, have you tried computing the matching cost considering cls, reg, and mask together ? Is there any difference in the performance? Thanks in advance.

opened by zhanggang001 2
ImportError: cannot import name '_NewEmptyTensorOp' from 'torchvision.ops.misc'
Error

ImportError

Steps to reproduce the behavior:

Git clone the repository of SOLQ

Update the dataset you want to use.

Update the data paths in the file SOLQ/datasets/coco.py

RUn the bash file configs/r50_solq_train.sh

Expected behavior

It should now show the error and move further to run the SOL-Q model.

Environment

PyTorch version: 1.9.0+cu102 Is debug build: False CUDA used to build PyTorch: 10.2 ROCM used to build PyTorch: N/A

OS: Ubuntu 18.04.5 LTS (x86_64) GCC version: (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0 Clang version: 6.0.0-1ubuntu2 (tags/RELEASE_600/final) CMake version: version 3.12.0 Libc version: glibc-2.26

Python version: 3.7.11 (default, Jul 3 2021, 18:01:19) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-5.4.104+-x86_64-with-Ubuntu-18.04-bionic Is CUDA available: True CUDA runtime version: 11.0.221 GPU models and configuration: GPU 0: Tesla T4 Nvidia driver version: 460.32.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.7.6.5 /usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.4 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.4 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.4 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.4 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.4 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.4 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.4 HIP runtime version: N/A MIOpen runtime version: N/A

Versions of relevant libraries: [pip3] numpy==1.19.5 [pip3] torch==1.9.0+cu102 [pip3] torchsummary==1.5.1 [pip3] torchtext==0.10.0 [pip3] torchvision==0.10.0+cu102 [conda] Could not collect

Additional context

When running the script !bash configs/r50_solq_train.sh it, shows ImportError like shown below :

Traceback (most recent call last): File "main.py", line 22, in <module> import datasets File "/content/SOLQ/datasets/__init__.py", line 13, in <module> from .coco import build as build_coco File "/content/SOLQ/datasets/coco.py", line 23, in <module> from util.misc import get_local_rank, get_local_size File "/content/SOLQ/util/misc.py", line 36, in <module> from torchvision.ops.misc import _NewEmptyTensorOp
opened by sagnik1511 2
How should SOLQ be extended to panoptic segmentation?

Hi authors,

Thanks for the excellent work! I wish to extend SOLQ to panoptic segmentation. Shall I simply treat stuff and thing equally? Looking forward to your suggestions!

opened by encounter1997 2

DCT visualize

Hi there, I'm trying to visualize the mask after idct, the code is:

save_dir = './vis'
gt_mask_len = 512
n_keep = 128
processor_dct = ProcessorDCT(n_keep=n_keep, gt_mask_len=gt_mask_len)
#mask = cv2.imread(img_path, 0).astype(np.float32)
mask = np.load('mask.npy')
new_mask = np.array((mask==1)).astype(np.float32)

new_mask = cv2.resize(new_mask, (gt_mask_len, gt_mask_len))
coeffs = cv2.dct(new_mask)
cv2.imwrite(os.path.join(save_dir, '{}_coeffs.png'.format(name.split('.')[0])), coeffs)

idct = np.zeros((gt_mask_len**2))
vectors = torch.from_numpy(coeffs).flatten()
vectors = vectors[torch.tensor(processor_dct.zigzag_table)]

idct[:n_keep] = vectors.cpu().numpy()
idct = processor_dct.inverse_zigzag(idct, gt_mask_len, gt_mask_len)
cv2.imwrite(os.path.join(save_dir, '{}_i_coeffs.png'.format(name.split('.')[0])), idct)
re_mask = cv2.idct(idct)
max_v = np.max(re_mask)
min_v = np.min(re_mask)
re_mask = np.where(re_mask>(max_v+min_v) / 2., 255, 0)
cv2.imwrite(os.path.join(save_dir, '{}_recover.png'.format(name.split('.')[0])), re_mask)


plt.figure(1)
plt.imshow(new_mask)
plt.figure(2)
plt.imshow(re_mask)

new_mask shows:

re_mask shows:

The re_mask looks different with the original mask.

I was wondering why this is and if there is anything I am doing wrong.

opened by ztjsw 2

training time

“Thanks for your attention on SOLQ! It will take about 1.5 days and 2.0 days to train SOLQ with R50 and R101 backbones, respectively. As for the Swin-Large backbone, it will take nearly four days to train due to the large computation cost.”

Excuse me, do you use 2 GPUs or 8 GPUs for training for 1.5 days? the read me you provided uses 8 GPUs. so I confused

opened by roar-1128 1
A question about the implementation of segmentation mask for Deformable Detr

Hello,

First of all, thank you for your great work.

I want to ask a point in the implementation of input projection before the MaskHeadSmallConv in segmentation.py . The implementation applies stride 2 to the features which makes the best stride 8. However, for the segmentation tasks, it is possible to get better result when the stride 4 is utilized for the mask creation. The original segmentation head implementation of DETR also utilizes in that way. Therefore, I want to ask that what is the reason for utilizing that stride in your implementation?

Thanks in advance

opened by artest08 1
Swin-L pretrained checkpoints used

Hi @dbofseuofhust, @vaesl!

Can't find in the code the URLs to the ImageNet-pretrained Swin-L. Which checkpoints did you use? https://github.com/microsoft/Swin-Transformer provides many different ones.

Could you please publish a config for training using Swin-L?

Are your modifications to swin_transformer.py upstreamed anywhere?

I also wonder, have you tried other Swin backbones like Swin-S or Swin-B? ESViT repo publishes some self-sup trained Swin, but they are only for Swin-S/T/B: https://github.com/microsoft/esvit ...

Thank you!

opened by vadimkantorov 1
Attention visualization code: Fig. 4 from Appendix A.3

Hi!

Would you have any guidance / code on reproducing the decoder attention visualization? (fig. 4 from A.3). I'm worried of making some padding-related mistakes while working with sampled_locations

Thanks!

opened by vadimkantorov 0

Owner

MEGVII Research

Power Human with AI. 持续创新拓展认知边界非凡科技成就产品价值

GitHub

ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

This is the project page for the paper: ISTR: End-to-End Instance Segmentation via Transformers, Jie Hu, Liujuan Cao, Yao Lu, ShengChuan Zhang, Yan Wa

182 Dec 19, 2022

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation E2EC: An End-to-End Contour-based Method for High-Quality H

146 Dec 29, 2022

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

156 Jan 9, 2023

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

1.4k Dec 30, 2022

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation This paper has been accepted and early accessed

39 Sep 20, 2022

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation ". Please

610 Dec 28, 2022

Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

Temporally Efficient Vision Transformer for Video Instance Segmentation Temporally Efficient Vision Transformer for Video Instance Segmentation (CVPR

203 Dec 31, 2022

TorchDistiller - a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

This project is a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

147 Dec 3, 2022

[ArXiv 2021] Data-Efficient Instance Generation from Instance Discrimination

InsGen - Data-Efficient Instance Generation from Instance Discrimination Data-Efficient Instance Generation from Instance Discrimination Ceyuan Yang,

GenForce: May Generative Force Be with You

93 Dec 25, 2022

[CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

TransFuser This repository contains the code for the CVPR 2021 paper Multi-Modal Fusion Transformer for End-to-End Autonomous Driving. If you find our

695 Jan 5, 2023

This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object Tracking with TRansformer.

MOTR: End-to-End Multiple-Object Tracking with TRansformer This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object

348 Jan 7, 2023

Code & Models for 3DETR - an End-to-end transformer model for 3D object detection

3DETR: An End-to-End Transformer Model for 3D Object Detection PyTorch implementation and models for 3DETR. 3DETR (3D DEtection TRansformer) is a simp

487 Dec 31, 2022

METER: Multimodal End-to-end TransformER

METER Code and pre-trained models will be publicized soon. Citation @article{dou2021meter, title={An Empirical Study of Training End-to-End Vision-a

257 Jan 6, 2023

Pytorch library for end-to-end transformer models training and serving

768 Jan 1, 2023

Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

InfoPro-Pytorch The Information Propagation algorithm for training deep networks with local supervision. (ICLR 2021) Revisiting Locally Supervised Lea

78 Dec 27, 2022

"SOLQ: Segmenting Objects by Learning Queries", SOLQ is an end-to-end instance segmentation framework with Transformer.

Related tags

Overview

SOLQ: Segmenting Objects by Learning Queries

Introduction

Main Results

Installation

Requirements

Usage

Dataset preparation

Training and Evaluation

Training on single node

Evaluation

Evaluation on COCO 2017 test-dev dataset

Visualization on COCO 2017 val dataset

Citing SOLQ

Comments

Error

ImportError

Expected behavior

Environment

Additional context

Owner

MEGVII Research

ISTR: End-to-End Instance Segmentation with Transformers (https://arxiv.org/abs/2105.00637)

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Leveraging Instance-, Image- and Dataset-Level Information for Weakly Supervised Instance Segmentation

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

Temporally Efficient Vision Transformer for Video Instance Segmentation, CVPR 2022, Oral

TorchDistiller - a collection of the open source pytorch code for knowledge distillation, especially for the perception tasks, including semantic segmentation, depth estimation, object detection and instance segmentation.

[ArXiv 2021] Data-Efficient Instance Generation from Instance Discrimination

[CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

This repository is an official implementation of the paper MOTR: End-to-End Multiple-Object Tracking with TRansformer.

Code & Models for 3DETR - an End-to-end transformer model for 3D object detection

METER: Multimodal End-to-end TransformER

Pytorch library for end-to-end transformer models training and serving

Learning recognition/segmentation models without end-to-end training. 40%-60% less GPU memory footprint. Same training time. Better performance.

Demo for the paper "Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation"

End-to-end image segmentation kit based on PaddlePaddle.

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

An end-to-end PyTorch framework for image and video classification