Fast convergence of detr with spatially modulated co-attention

Overview

Fast convergence of detr with spatially modulated co-attention

Usage

There are no extra compiled components in SMCA DETR and package dependencies are minimal, so the code is very simple to use. We provide instructions how to install dependencies via conda. First, clone the repository locally:

git clone https://github.com/facebookresearch/detr.git

Then, install PyTorch 1.5+ and torchvision 0.6+:

conda install -c pytorch pytorch torchvision

Install pycocotools (for evaluation on COCO) and scipy (for training):

conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

That's it, should be good to train and evaluate detection models.

(optional) to work with panoptic install panopticapi:

pip install git+https://github.com/cocodataset/panopticapi.git

Data preparation

Download and extract COCO 2017 train and val images with annotations from http://cocodataset.org. We expect the directory structure to be the following:

path/to/coco/
  annotations/  # annotation json files
  train2017/    # train images
  val2017/      # val images

Training

To train Single Scale SMCA on a single node with 8 gpus for 300 epochs run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py --coco_path /path/to/coco --batch_size 2 --lr_drop 40 --num_queries 300 --epochs 50 --dynamic_scale type3 --output_dir smca_single_scale


A single epoch takes 30 minutes, so 50 epoch training takes around 25 hours on a single machine with 8 V100 cards.

Object Detection

Model Zoo

name dataset backbone schedule box AP
0 SMCA(single scale) MSCOCO R50 50 41.0
1 SMCA-Container(single scale) MSCOCO Container-S-Light 50 44.2
2 SMCA-Container(single scale) MSCOCO Container-M 50 47.3
3 SMCA(single scale) MSCOCO R50 108 42.7
4 SMCA(single scale) MSCOCO R50 250 43.5
5 SMCA(multi scale) MSCOCO R50 50 43.7
6 SMCA(New multi scale) MSCOCO R50 50 44.4
7 SMCA Visual Genome R50 50 coming soon

Panoptic Segmentation

Model Zoo

name dataset backbone schedule PQ SQ RQ
1 MASK-Former(single scale) MSCOCO R50 500 46.5 80.4 56.8
2 SMCA-MASK-Former(single scale) MSCOCO R50 50 46.0 80.4 56.0
## Original SMCA code submission during ICCV review period. https://github.com/abc403/SMCA-replication

Release Steps

  1. Single-scale SMCA
  2. Single-scale SMCA with Container-Small
  3. Single-scale SMCA with Container-Medium
  4. New Multi-scale SMCA (Newly added Multi_scale_SMCA.zip, 9th Sep)
  5. SMCA-DETR for Fast Convergence of Panoptic Segmentation

Citation

If you find this repository useful, please consider citing our work:

@article{gao2021fast,
  title={Fast convergence of detr with spatially modulated co-attention},
  author={Gao, Peng and Zheng, Minghang and Wang, Xiaogang and Dai, Jifeng and Li, Hongsheng},
  journal={arXiv preprint arXiv:2101.07448},
  year={2021}
}
@article{gao2021container,
  title={Container: Context Aggregation Network},
  author={Gao, Peng and Lu, Jiasen and Li, Hongsheng and Mottaghi, Roozbeh and Kembhavi, Aniruddha},
  journal={arXiv preprint arXiv:2106.01401},
  year={2021}
}
@article{zheng2020end,
  title={End-to-end object detection with adaptive clustering transformer},
  author={Zheng, Minghang and Gao, Peng and Wang, Xiaogang and Li, Hongsheng and Dong, Hao},
  journal={arXiv preprint arXiv:2011.09315},
  year={2020}
}

Contributor

Peng Gao, Qiu Han, Minghang Zeng

Acknowledege

The project are borrowed heavily from DETR. Partially motivated by Sparse RCNN.

Comments
  • Some question about the SMCA module and the code

    Some question about the SMCA module and the code

    Hi,I'm very interested in your work about the newly decoder of DETR。I have some questions about your code:

    1. When I debugged your source code, I didn't find the use of encoder in FPN, such as Intra-Scale self-attention and multi-scale self- attention mentioned in the paper; And the Scale Selection network in the decoder。

    2. Does the 'type1'-'type4' parameters in the code only change how to generate the Gaussian-like weights map?

    3. In the decoder layer in Transformer,the propagate process: out = self.norm4(tgt + query_pos) point_sigmoid_offset = self.point2(out) Is the point_sigmoid_offset parameter corresponds to sw and sh in the paper?

      if self.layer_index == 0: point_sigmoid_ref_inter = self.point1(out) point_sigmoid_ref = point_sigmoid_ref_inter.sigmoid() point_sigmoid_ref = (h_w - 0) * point_sigmoid_ref / 32 point_sigmoid_ref = point_sigmoid_ref.repeat(1, 1, 8) # [100, bs, 2] -> [100, bs, 16] else: point_sigmoid_ref = point_ref_previous point = point_sigmoid_ref + point_sigmoid_offset

    Is the point_sigmoid_ref parameter corresponds to cw and ch in the paper? Why do these two parameters need to be added? distance = (point.unsqueeze(1) - grid.unsqueeze(0)).pow(2) This step corresponds to (i-cw) ^ 2 + (j-ch) ^ 2 in G (I, J) in the paper;

      if self.dynamic_scale == "type1":
            scale = 1
            distance = distance.sum(-1) * scale 
        elif self.dynamic_scale == "type2": # 对于type2:对out2再做一次线性映射 
            scale = self.point3(out)  # [100, bs, 256] -> [100, bs, 8]
            scale = scale * scale     # 
            scale = scale.reshape(tgt_len, -1).unsqueeze(1)
            distance = distance.sum(-1) * scale
        elif self.dynamic_scale == "type3":
            scale = self.point3(out)
            scale = scale * scale
            scale = scale.reshape(tgt_len, -1, 2).unsqueeze(1)
            distance = (distance * scale).sum(-1)
        elif self.dynamic_scale == "type4":
            scale = self.point3(out)
            scale = scale * scale
            scale = scale.reshape(tgt_len, -1, 3).unsqueeze(1)
            distance = torch.cat([distance, torch.prod(distance, dim=-1, keepdim=True)], dim=-1)
            distance = (distance * scale).sum(-1)
        # generate Gaussian-like weight map
        gaussian = -(distance - 0).abs() / self.smooth
    

    According to the operation in the code, it seems that G (I, J) in the paper can not be obtained, I don't know what these steps meaning? 4. In addition, the paper says that logGi' needs to be added to generate the Co-attention weights map in the Co-attention of the decoder in the transformer, but the code is set as follows: attn_output_weights = attn_output_weights + gaussian[0].permute(2, 0, 1)

    I sincerely hope you can help me solve these problems . Thanks !

    opened by Huzhen757 33
  • About input format

    About input format

    at this line code:

    if isinstance(samples[0], (list, torch.Tensor)):
                samples[0] = nested_tensor_from_tensor_list(samples[0])
    

    Since samples already is NestedTensor, why still need nested_tensor_from_tensor?

    Besides, have u tested d2 version of SMCA? It's not work... the input is quit different from original DETR

    opened by jinfagang 20
  • OOM for ResNet50-DC5 even for 32GB GPUs

    OOM for ResNet50-DC5 even for 32GB GPUs

    Hi,

    I tried to run ResNet50-DC5 for SMCA-DETR on 8 GPUs with 32 GB memory each. It shows OOM error when using ResNet50-DC5 with batch size 16. Why it uses so much memory than other models?

    opened by yformer 11
  • IndexError: list index out of range when running inference

    IndexError: list index out of range when running inference

    Hello,

    When I run a modified d2go version of the code, it shows index error. Did you see this error before?

    I wonder if my h_w is not set correctly? I feel confused about the samples[0], samples[1] (target)? It seems h_w is a concatenation of original images? Then why we need a samples[1] to obtain h_w? So I change the code with the following code,

    h_w = torch.stack([torch.tensor([inst.shape[-2] for inst in samples]), torch.tensor([inst.shape[-1] for inst in samples])], dim=-1).

    opened by yformer 10
  • ulti_head_attention_forward     gaussian[0].permute(2, 0, 1) RuntimeError: number of dims don't match in permute

    ulti_head_attention_forward gaussian[0].permute(2, 0, 1) RuntimeError: number of dims don't match in permute

    ulti_head_attention_forward gaussian[0].permute(2, 0, 1) RuntimeError: number of dims don't match in permute

    Does h_w shape is 1,Bs, 2?

    for example: torch.Size([1, 10, 2])?

    I got this error when run your code.

    opened by jinfagang 8
  • ACT model compute flops

    ACT model compute flops

    Dir sir, thanks for your work on ACT model in transformer.I tried to integrate the ACT module into the task and modify the network following the pattern of SMCA-DETR/Adaptive_Cluster_Transformer/.My task is an object detection task, modified based on DETR.After act module was added, the overall training time of the task decreased by 2 hours, and the decrease of mAP just declare 0.6 as the paper say.However, when compute FLOPS, the total FLOPS on the network increased after ACT module added. I think it may be my compute code error. My flop calculation code is modified based on the original detr FLOPS calculation code, I don't know if there is any extra consideration for attention calculation.I would appreciate it if you could share information about the FLOP calculation code for the ACT module or point out what went wrong with my calculation code.

    opened by OBVIOUSDAWN 5
  • Searching for help about the visiulization image.

    Searching for help about the visiulization image.

    It's A nice job. Could you guide/tell me how to plot the visualization co-attention image of FIgure 2/3 in your paper. I want to visualise some my own images in your repo. Thanks very much.

    opened by Tchuanm 4
  • Training coco metrics seems to be good but when trained model validated using --eval mode, only garbage values are printed.

    Training coco metrics seems to be good but when trained model validated using --eval mode, only garbage values are printed.

    Thank you for the amazing work. I have trained model for 50 epochs and while training, periodic eval seems to be reasonable and improving.

    image

    But then, when I use trained model on same validation data by using --eval argument and defining path for "checkpoint.pth" in d2/configs/detr_256_6_6_torchvision.yaml config, I am getting all zeros.

    my ground truth val.json in (xywh) format. After investigating more,

    I found that DataLoader loads ground truth boxes in normalize form but predicted output boxes, those are in unnormalized form because of following lines from engine.py in evaluate function.

    " orig_target_sizes = torch.stack([t["orig_size"] for t in targets], dim=0) results = postprocessors['bbox'](outputs, orig_target_sizes) "

    even after fixing predicted bbox to normalize or unnormalize groud truth bbox in DataLoader class, still I am getting all zeros in coco detection metrics.

    I am not sure though training script call same function for periodic evaluation, why using --eval argument on same data produces completely different result?

    Am I using some different model instead of trained one? I changed weight field in d2/configs/detr_256_6_6_torchvision.yaml config to model's "checkpoint.pth" .

    Please let me know if you need any additional information. Thanks.

    opened by purvang3 3
  • Question about the FLOPs in Table 1

    Question about the FLOPs in Table 1

    Hi, why the FLOPs is much smaller than the DETR? As the SMCA uses multiple-level features, it should have larger FLOPs intuitively.

    And what operations make the lower FLOPs have a slower inference speed? For example, SMCA-DC5 153GFLOPs 0.100s vs. DETR 187GFLOPs 0.079.

    image

    opened by tangjiuqi097 3
  • 40.38 AP for ResNet50 single level feature

    40.38 AP for ResNet50 single level feature

    Hi,

    I trained the model with 8 nodes, each of which has 8 GPUs. I can only obtain 38.46 AP. I wonder if you have trained those models with multiple nodes and larger batch size, 8 x 16?

    opened by yformer 2
  • Question about use multi-GPU for training

    Question about use multi-GPU for training

    Hi, I want to try use 4 gpus to train your model DMS_MH_GMCA_resnet50 by myself, but it's report a warning,,like this: 683984ba833dcb8dd9fa60d93ca70df

    Will this warning affect the accuracy of training? Thanks !

    opened by Huzhen757 2
  • Pre-trained Model on VG

    Pre-trained Model on VG

    Hi,

    Thanks for your work. I noticed in your repository description that you may have experiments on the VG dataset, and I was wondering if you would have a pre-trained model on VG available to share. Thank you for your attention.

    Kind regards, Romero

    opened by RomeroBarata 0
Owner
peng gao
Young Scientist at Shanghai AI Lab
peng gao
[ICLR 2022] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

DAB-DETR This is the official pytorch implementation of our ICLR 2022 paper DAB-DETR. Authors: Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi

null 336 Dec 25, 2022
[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

Large Scale Image Completion via Co-Modulated Generative Adversarial Networks, ICLR 2021 (Spotlight) Demo | Paper [NEW!] Time to play with our interac

Shengyu Zhao 373 Jan 2, 2023
Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

Hila Chefer 489 Jan 7, 2023
Deformable DETR is an efficient and fast-converging end-to-end object detector.

Deformable DETR: Deformable Transformers for End-to-End Object Detection.

null 2k Jan 5, 2023
Spatially-Adaptive Pixelwise Networks for Fast Image Translation, CVPR 2021

Image Translation with ASAPNets Spatially-Adaptive Pixelwise Networks for Fast Image Translation, CVPR 2021 Webpage | Paper | Video Installation insta

Tamar Rott Shaham 100 Dec 28, 2022
Seach Losses of our paper 'Loss Function Discovery for Object Detection via Convergence-Simulation Driven Search', accepted by ICLR 2021.

CSE-Autoloss Designing proper loss functions for vision tasks has been a long-standing research direction to advance the capability of existing models

Peidong Liu(刘沛东) 54 Dec 17, 2022
Pytorch implementation of AngularGrad: A New Optimization Technique for Angular Convergence of Convolutional Neural Networks

AngularGrad Optimizer This repository contains the oficial implementation for AngularGrad: A New Optimization Technique for Angular Convergence of Con

mario 124 Sep 16, 2022
SmallInitEmb - LayerNorm(SmallInit(Embedding)) in a Transformer to improve convergence

SmallInitEmb LayerNorm(SmallInit(Embedding)) in a Transformer I find that when t

PENG Bo 11 Dec 25, 2022
[CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers This is the official PyTorch implementation and models for UP-DETR paper: @a

dddzg 430 Dec 23, 2022
Implementation of Cross Transformer for spatially-aware few-shot transfer, in Pytorch

Cross Transformers - Pytorch (wip) Implementation of Cross Transformer for spatially-aware few-shot transfer, in Pytorch Install $ pip install cross-t

Phil Wang 40 Dec 22, 2022
PyTorch Implementation of Spatially Consistent Representation Learning(SCRL)

Spatially Consistent Representation Learning (CVPR'21) Official PyTorch implementation of Spatially Consistent Representation Learning (SCRL). This re

Kakao Brain 102 Nov 3, 2022
CVPR 2021: "The Spatially-Correlative Loss for Various Image Translation Tasks"

Spatially-Correlative Loss arXiv | website We provide the Pytorch implementation of "The Spatially-Correlative Loss for Various Image Translation Task

Chuanxia Zheng 89 Jan 4, 2023
PED: DETR for Crowd Pedestrian Detection

PED: DETR for Crowd Pedestrian Detection Code for PED: DETR For (Crowd) Pedestrian Detection Paper PED: DETR for Crowd Pedestrian Detection Installati

null 36 Sep 13, 2022
Implementation of CVPR 2021 paper "Spatially-invariant Style-codes Controlled Makeup Transfer"

SCGAN Implementation of CVPR 2021 paper "Spatially-invariant Style-codes Controlled Makeup Transfer" Prepare The pre-trained model is avaiable at http

null 118 Dec 12, 2022
Moment-DETR code and QVHighlights dataset

Moment-DETR QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries Jie Lei, Tamara L. Berg, Mohit Bansal For dataset de

Jie Lei 雷杰 133 Dec 22, 2022
Toward Spatially Unbiased Generative Models (ICCV 2021)

Toward Spatially Unbiased Generative Models Implementation of Toward Spatially Unbiased Generative Models (ICCV 2021) Overview Recent image generation

Jooyoung Choi 88 Dec 1, 2022
Official PyTorch code for Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution (MANet, ICCV2021)

Mutual Affine Network for Spatially Variant Kernel Estimation in Blind Image Super-Resolution (MANet, ICCV2021) This repository is the official PyTorc

Jingyun Liang 139 Dec 29, 2022
Dynamical movement primitives (DMPs), probabilistic movement primitives (ProMPs), spatially coupled bimanual DMPs.

Movement Primitives Movement primitives are a common group of policy representations in robotics. There are many different types and variations. This

DFKI Robotics Innovation Center 63 Jan 6, 2023
Simple Tensorflow implementation of Toward Spatially Unbiased Generative Models (ICCV 2021)

Spatial unbiased GANs — Simple TensorFlow Implementation [Paper] : Toward Spatially Unbiased Generative Models (ICCV 2021) Abstract Recent image gener

Junho Kim 16 Apr 15, 2022