[CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers


This is the official PyTorch implementation and models for UP-DETR paper:

  author  = {Zhigang Dai and Bolun Cai and Yugeng Lin and Junying Chen},
  title   = {UP-DETR: Unsupervised Pre-training for Object Detection with Transformers},
  journal = {arXiv preprint arXiv:2011.09094},
  year    = {2020},

In UP-DETR, we introduce a novel pretext named random query patch detection to pre-train transformers for object detection. UP-DETR inherits from DETR with the same ResNet-50 backbone, same Transformer encoder, decoder and same codebase. With unsupervised pre-training CNN, the whole UP-DETR model doesn't require any human annotations. UP-DETR achieves 43.1 AP on COCO with 300 epochs fine-tuning. The AP of open-source version is a little higher than paper report.


Model Zoo

We provide pre-training UP-DETR and fine-tuning UP-DETR models on COCO, and plan to include more in future. The evaluation metric is same to DETR.

Here is the UP-DETR model pre-trained on ImageNet without labels. The CNN weight is initialized from SwAV, which is fixed during the transformer pre-training:

name backbone epochs url size md5
UP-DETR R50 (SwAV) 60 model | logs 164Mb 49f01f8b

Comparision with DETR:

name backbone (pre-train) epochs box AP url size
DETR R50 (Supervised) 500 42.0 - 159Mb
DETR R50 (SwAV) 300 42.1 - 159Mb
UP-DETR R50 (SwAV) 300 43.1 model | logs 159Mb

COCO val5k evaluation results of UP-DETR can be found in this gist.

Usage - Object Detection

There are no extra compiled components in UP-DETR and package dependencies are same to DETR. We provide instructions how to install dependencies via conda:

git clone tbd
conda install -c pytorch pytorch torchvision
conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

UP-DETR follows two steps: pre-training and fine-tuning. We present the model pre-trained on ImageNet and then fine-tuned on COCO.

Unsupervised Pre-training

Data Preparation

Download and extract ILSVRC2012 train dataset.

We expect the directory structure to be the following:

  n06785654/  # caterogey directory
    n06785654_16140.JPEG # images
  n04584207/  # caterogey directory
    n04584207_14322.JPEG # images

Images can be organized disorderly because our pre-training is unsupervised.


To pr-train UP-DETR on a single node with 8 gpus for 60 epochs, run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py \
    --lr_drop 40 \
    --epochs 60 \
    --pre_norm \
    --num_patches 10 \
    --batch_size 32 \
    --feature_recon \
    --fre_cnn \
    --imagenet_path path/to/imagenet \
    --output_dir path/to/save_model

As the size of pre-training images is relative small, so we can set a large batch size.

It takes about 2 hours for a epoch, so 60 epochs pre-training takes about 5 days with 8 V100 gpus.

In our further ablation experiment, we found that object query shuffle is not helpful. So, we remove it in the open-source version.


Data Preparation

Download and extract COCO 2017 dataset train and val dataset.

The directory structure is expected as follows:

  annotations/  # annotation json files
  train2017/    # train images
  val2017/      # val images


To fine-tune UP-DETR with 8 gpus for 300 epochs, run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env detr_main.py \
    --lr_drop 200 \
    --epochs 300 \
    --lr_backbone 5e-4 \
    --pre_norm \
    --coco_path path/to/coco \
    --pretrain path/to/save_model/checkpoint.pth

The fine-tuning cost is exactly same to DETR, which takes 28 minutes with 8 V100 gpus. So, 300 epochs training takes about 6 days.

The model can also extended to panoptic segmentation, checking more details on DETR.


We provide a notebook in colab to get the visualization result in the paper:

  • Visualization Notebook: This notebook shows how to perform query patch detection with the pre-training model (without any annotations fine-tuning).



UP-DETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

  • How to support batch learning for one-shot object detection training?

    How to support batch learning for one-shot object detection training?

    So in the paper you suggest training UP-DETR for the task of one-shot object detection and provided interesting results on VOC.

    As you don't seem to provide any code in this github related to the one-shot object detection training (please correct me if I'm mistaken), I tried to implement it myself. That being said, I confronted an obstacle when it came to supporting batch learning. This is because, if we have a minibatch of N target images, each of them will have a corresponding query patch, so a total of N query patches in this minibatch. How would you apply GAP and add the features of these N query patches to the object queries in the decoder? It doesn't seem to me that adding the features of the ith query patch to the object queries while forwarding a batch containing the jth target image through the decoder (where the jth target image isn't related to the ith query object) is the correct thing to do.

    So, my question is, were you able to support batch learning for one-shot object detection? If so, how?

    opened by JosephAssaker 8
  • num_classes


    Hi, why did you set the number of categories to 2 in the code. Can I set it to 1 or any integer in pre-train stage? Any advice is greatly appreciated.

    if args.dataset_file=="ImageNet": num_classes = 2

    opened by rgbd-zml 6
  • Cannot reproduce the author's results with the pre-trained models

    Cannot reproduce the author's results with the pre-trained models

    Hi there,

    I'm currently experimenting with some Few/One/Zero-Shot for object detection and classification. For one of the tasks, your paper has been experimented with.

    Unfortunately, I haven't been able to reproduce your results with the pre-trained models you have made available. I also noticed that the inference code you made available does not work out of the box. To support my points, here some details:

    1. At the moment is not possible to use the latest PyTorch with the latest TorchVision. The latter should be pinned to version 0.9.0.
    2. For the ImageNet pre-trained model
    • In your code samples, you use 6 patches only, but the model has been trained with the default 100 queries and 10 patches. The README file needs adjustments


    • ImageNet pre-trained model (I duplicated some patches to make sure I had 10, same kittens image used)

    Patches image

    Detections image

    • COCO pre-trained model (custom image used)

    Patches image

    Detections image

    Hardware used

    • MacBook Air M1
    • NVIDIA GeForce RTX2080i

    Yeah, I tried with both CPU and a CUDA compliant device.

    Are you sure you have uploaded the rights checkpoint files?

    Thanks in advance and looking to hear from you.

    opened by wilderrodrigues 6
  • Some questions about your code

    Some questions about your code

    Hi, I'm very interested in your work about the newly object query in decoder of Transformer through the cropped patches form original images, but when I debug the code, I find it's report the error, like this: 6aad2771224a98ff46a22c2d74df0db

    In the code, I didn't find anything about the generation of patches and the call of forward propagation process,due to the forward function of UP-DETR need the patches inputs. Besides, I use COCO2017-train dataset for pre-training dataset, I find the process of finetune is absolutely same as DETR,so I want to study the pre-training process,in other words,I want to look the UP-DETR how works,especially in the decoder part.

    I sincerely hope you can give some solutions, Thanks !

    opened by Huzhen757 5
  • 训练模型不收敛


    图片 在自己的训练集上(已转换成coco格式,单类别检测)训练了170个epoch,loss基本不降,验证集AP也是0 训练命令:python -m torch.distributed.launch --nproc_per_node=1 --use_env detr_main.py --lr_drop 200 --epochs 300 --lr_backbone 5e-4 --pre_norm --coco_path /home/work/mnt/project/up-detr/data/coco --pretrain /home/work/mnt/project/up-detr/checkpoints/up-detr-pre-training-60ep-imagenet.pth

    opened by secortot 5
  • What file does

    What file does "files" in def plot_precision_recall(files, naming_scheme='iter') refer to and which file do I need?

    Hello, I see that plot_utils.py can plot PR curves, but it requires a "files", what does this "files" refer to? https://github.com/dddzg/up-detr/blob/97fee88358ad2bdfcc6e3d4fa6892b4600fae089/util/plot_utils.py#L83 Is it ./outputs/log.txt file? Or is it any of the ./outputs/checkpoint.pth? Or should I need the ./outputs folder?

    Thank you.

    opened by zxsitu 4
  • How to get the inference time (speed) ?

    How to get the inference time (speed) ?

    ❓ How to get the inference time or speed using UP-DETR?

    I want to make some comparisons between some models, how can I export (print) the inference time when testing or evaling UP-DETR?

    opened by zxsitu 4
  • Unexpected keys in dict when running evaluation

    Unexpected keys in dict when running evaluation

    How can I evaluate the provided model? I'm trying to use the DETR's recipe and getting a checkpoint loading error.

    Thank you!

    git clone https://github.com/dddzg/up-detr
    cd up-detr
    GOOGLE_DRIVE_FILE_ID=$(echo $UPDETRCKPTURL | rev | cut -d'/' -f1 | rev)
    CONFIRM=$(wget --quiet --save-cookies googlecookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=$GOOGLE_DRIVE_FILE_ID" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')
    wget -O $UPDETRCKPT --load-cookies googlecookies.txt "https://docs.google.com/uc?export=download&confirm=$CONFIRM&id=$GOOGLE_DRIVE_FILE_ID"
    python main_detr.py --batch_size 2 --no_aux_loss --eval --resume $UPDETRCKPT --coco_path $DATASETROOT
    Not using distributed mode
      sha: 00be9b996f52324335e0cc3fe7a59bfba9f43540, status: clean, branch: master
    Namespace(aux_loss=False, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='/specific/netapp5_2/gamir/lab/vadim/foo/../selfsupslots/data/common/coco/', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=300, eval=True, frozen_weights=None, giou_loss_coef=2, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir='', position_embedding='sine', pre_norm=False, pretrain='', remove_difficult=False, resume='up-detr-coco-fine-tuned-300ep.pt', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=1)
    number of params: 41302368
    loading annotations into memory...
    Done (t=32.57s)
    creating index...
    index created!
    loading annotations into memory...
    Done (t=4.40s)
    creating index...
    index created!
    Traceback (most recent call last):
      File "detr_main.py", line 267, in <module>
      File "detr_main.py", line 197, in main
      File ".../vadim/prefix/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in load_state_dict
        raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
    RuntimeError: Error(s) in loading state_dict for DETR:
            Unexpected key(s) in state_dict: "transformer.encoder.norm.weight", "transformer.encoder.norm.bias".
    opened by vadimkantorov 4
  • Question about experitment on one-shot object detection

    Question about experitment on one-shot object detection

    Hi~ UP-DETR is an interesting work. I wonder if you experimented with COCO dataset about one-shot object detection. If you have done experiments on COCO, would you like to provide the results. Thanks a lot~

    opened by suilin0432 3
  • Random Crop

    Random Crop

    Can we randomly crop from other image and paste it on the training picture, and also use the randomly cropped as a pseudo-label, that is, find the cropped block in the original image?

    opened by DoublePan-Oh 2
  • Class Loss

    Class Loss

    image I didn't find this parameter in your code. Can you tell me which one?

    As I know, the CNN backbone does not participate in training, but is only used to extract image features. Can CNN and transformer be separated. For example, first use Resnet to extract image features, and then randomly crop patch at the feature level. I mean starting from features. I plan to use this idea to do video tasks, but I can't directly manipulate the video itself, I can only start from the video features. I don't know if this is possible.

    opened by DoublePan-Oh 2
  • Unexpected key(s) in state_dict:

    Unexpected key(s) in state_dict: "feature_align.layers.0.weight", "feature_align.layers.0.bias", "feature_align.layers.1.weight", "feature_align.layers.1.bias", "patch2query.weight", "patch2query.bias".

    excuse me, I Fine-tuning in own dataset and evaluation

    This is mine warning after evaluation in pycharm(win 10),pytorch==1.12.1,torchvision==0.13.1,cuda==11.7,3070ti

    Unexpected key(s) in state_dict: "feature_align.layers.0.weight", "feature_align.layers.0.bias", "feature_align.layers.1.weight", "feature_align.layers.1.bias", "patch2query.weight", "patch2query.bias".

    I dont know how to solve this problem.I tested the following methods 1.pop this weight and bias before the Fine-tuning,but the evaluation result is 0,yes ,all IOU is 0 . 2.pop pop this weight and bias after the Fine-tuning,all IOU is 0 .

    Please give me some advice

    opened by Zoeun 0
  • Getting access to the one-shot object detection training code

    Getting access to the one-shot object detection training code

    Hello there!

    As the code for the one-shot object detection task is not available in this repository, would there be any way to access it? If not would it be possible for you to share with me this code?

    I tried to re-implement the ideas presented in your paper on top of DETR, but was unsuccessful in replicating the results shown in the paper. In fact, I was not able to build a model that "learns", as the loss remains high throughout the training without ever showing a consistent downwards trend.

    What I've done in detail is the following: I took DETR's architecture, added to it the queries as input, passed the queries through the same backbone CNN as the target image, forwarded the resulting embedding to an average pooling layer to reduce the H*W dimensions to 1 (nn.AdaptiveAvgPool2d((1, 1))), forwarded the resulting vector to a projection linear layer (nn.Linear(backbone.num_channels, hidden_dim)) to project the features from an N-dimensional space to an M-dimensional space (where N is the channels dimension of the CNN backbone and M is the dimension within the encoder-decoder transformers), and finally, repeated the resulting vector X times (X being the number of object queries in the architecture) and added that to the object queries vectors (according to our discussion in #24 ).

    My goal was to replicate the results (shown below) of "DETR" (without pretraining) in your paper for one-shot object detection on PASCAL VOC.

    2022-07-18 09_38_03-2011 09094 pdf

    Unfortunately, I was not able to replicate these results, and in fact have not had a converging model that learned the task at all (loss is always high and oscillating). I Tried various backbone learning rates, such as 1e-4, 5e-5, 1e-5, and 0 and all resulted in approximately the same results. Lastly, I tried to also add to my code your proposed feature reconstruction loss (both with backbone lr = 0 and > 0), but that also didn't help.

    Thank you for your time, and I'm looking forward to hearing back from you!

    opened by JosephAssaker 0
  • A blog about UP-DETR

    A blog about UP-DETR

    Hi authors,

    This is not about an issue :smile:

    UP-DETR is great! I just wrote a blog about it, see here: https://medium.com/analytics-vidhya/up-detr-unsupervised-pre-training-for-object-detection-with-transformers-paper-explained-84611e27a144

    You could consider to put its link in your ReadMe, so that readers could understand it even easier :smile:

    Thanks! Best regards

    opened by HaoWei-TomTom 4
MSc student at SCUT
