[CVPR2021 Oral] UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

dddzg

Last update: Dec 23, 2022

Related tags

Deep Learning up-detr

Overview

UP-DETR: Unsupervised Pre-training for Object Detection with Transformers

This is the official PyTorch implementation and models for UP-DETR paper:

@article{dai2020up-detr,
  author  = {Zhigang Dai and Bolun Cai and Yugeng Lin and Junying Chen},
  title   = {UP-DETR: Unsupervised Pre-training for Object Detection with Transformers},
  journal = {arXiv preprint arXiv:2011.09094},
  year    = {2020},
}

In UP-DETR, we introduce a novel pretext named random query patch detection to pre-train transformers for object detection. UP-DETR inherits from DETR with the same ResNet-50 backbone, same Transformer encoder, decoder and same codebase. With unsupervised pre-training CNN, the whole UP-DETR model doesn't require any human annotations. UP-DETR achieves 43.1 AP on COCO with 300 epochs fine-tuning. The AP of open-source version is a little higher than paper report.

Model Zoo

We provide pre-training UP-DETR and fine-tuning UP-DETR models on COCO, and plan to include more in future. The evaluation metric is same to DETR.

Here is the UP-DETR model pre-trained on ImageNet without labels. The CNN weight is initialized from SwAV, which is fixed during the transformer pre-training:

name	backbone	epochs	url	size	md5
UP-DETR	R50 (SwAV)	60	model \| logs	164Mb	`49f01f8b`

Comparision with DETR:

name	backbone (pre-train)	epochs	box AP	url	size
DETR	R50 (Supervised)	500	42.0	-	159Mb
DETR	R50 (SwAV)	300	42.1	-	159Mb
UP-DETR	R50 (SwAV)	300	43.1	model \| logs	159Mb

COCO val5k evaluation results of UP-DETR can be found in this gist.

Usage - Object Detection

There are no extra compiled components in UP-DETR and package dependencies are same to DETR. We provide instructions how to install dependencies via conda:

git clone tbd
conda install -c pytorch pytorch torchvision
conda install cython scipy
pip install -U 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'

UP-DETR follows two steps: pre-training and fine-tuning. We present the model pre-trained on ImageNet and then fine-tuned on COCO.

Unsupervised Pre-training

Data Preparation

Download and extract ILSVRC2012 train dataset.

We expect the directory structure to be the following:

path/to/imagenet/
  n06785654/  # caterogey directory
    n06785654_16140.JPEG # images
  n04584207/  # caterogey directory
    n04584207_14322.JPEG # images

Images can be organized disorderly because our pre-training is unsupervised.

Pre-training

To pr-train UP-DETR on a single node with 8 gpus for 60 epochs, run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env main.py \
    --lr_drop 40 \
    --epochs 60 \
    --pre_norm \
    --num_patches 10 \
    --batch_size 32 \
    --feature_recon \
    --fre_cnn \
    --imagenet_path path/to/imagenet \
    --output_dir path/to/save_model

As the size of pre-training images is relative small, so we can set a large batch size.

It takes about 2 hours for a epoch, so 60 epochs pre-training takes about 5 days with 8 V100 gpus.

In our further ablation experiment, we found that object query shuffle is not helpful. So, we remove it in the open-source version.

Fine-tuning

Data Preparation

Download and extract COCO 2017 dataset train and val dataset.

The directory structure is expected as follows:

path/to/coco/
  annotations/  # annotation json files
  train2017/    # train images
  val2017/      # val images

Fine-tuning

To fine-tune UP-DETR with 8 gpus for 300 epochs, run:

python -m torch.distributed.launch --nproc_per_node=8 --use_env detr_main.py \
    --lr_drop 200 \
    --epochs 300 \
    --lr_backbone 5e-4 \
    --pre_norm \
    --coco_path path/to/coco \
    --pretrain path/to/save_model/checkpoint.pth

The fine-tuning cost is exactly same to DETR, which takes 28 minutes with 8 V100 gpus. So, 300 epochs training takes about 6 days.

The model can also extended to panoptic segmentation, checking more details on DETR.

Notebook

We provide a notebook in colab to get the visualization result in the paper:

Visualization Notebook: This notebook shows how to perform query patch detection with the pre-training model (without any annotations fine-tuning).

License

UP-DETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Comments

How to support batch learning for one-shot object detection training?

So in the paper you suggest training UP-DETR for the task of one-shot object detection and provided interesting results on VOC.

As you don't seem to provide any code in this github related to the one-shot object detection training (please correct me if I'm mistaken), I tried to implement it myself. That being said, I confronted an obstacle when it came to supporting batch learning. This is because, if we have a minibatch of N target images, each of them will have a corresponding query patch, so a total of N query patches in this minibatch. How would you apply GAP and add the features of these N query patches to the object queries in the decoder? It doesn't seem to me that adding the features of the ith query patch to the object queries while forwarding a batch containing the jth target image through the decoder (where the jth target image isn't related to the ith query object) is the correct thing to do.

So, my question is, were you able to support batch learning for one-shot object detection? If so, how?

opened by JosephAssaker 8
num_classes

Hi, why did you set the number of categories to 2 in the code. Can I set it to 1 or any integer in pre-train stage? Any advice is greatly appreciated.

if args.dataset_file=="ImageNet": num_classes = 2

opened by rgbd-zml 6
Cannot reproduce the author's results with the pre-trained models
Hi there,

I'm currently experimenting with some Few/One/Zero-Shot for object detection and classification. For one of the tasks, your paper has been experimented with.

Unfortunately, I haven't been able to reproduce your results with the pre-trained models you have made available. I also noticed that the inference code you made available does not work out of the box. To support my points, here some details:

At the moment is not possible to use the latest PyTorch with the latest TorchVision. The latter should be pinned to version 0.9.0.

For the ImageNet pre-trained model

In your code samples, you use 6 patches only, but the model has been trained with the default 100 queries and 10 patches. The README file needs adjustments

Results:

ImageNet pre-trained model (I duplicated some patches to make sure I had 10, same kittens image used)

Patches

Detections

COCO pre-trained model (custom image used)

Patches

Detections

Hardware used

MacBook Air M1

NVIDIA GeForce RTX2080i

Yeah, I tried with both CPU and a CUDA compliant device.

Are you sure you have uploaded the rights checkpoint files?

Thanks in advance and looking to hear from you.
opened by wilderrodrigues 6
Some questions about your code

Hi, I'm very interested in your work about the newly object query in decoder of Transformer through the cropped patches form original images, but when I debug the code, I find it's report the error, like this:

In the code, I didn't find anything about the generation of patches and the call of forward propagation process，due to the forward function of UP-DETR need the patches inputs. Besides, I use COCO2017-train dataset for pre-training dataset, I find the process of finetune is absolutely same as DETR，so I want to study the pre-training process，in other words，I want to look the UP-DETR how works，especially in the decoder part.

I sincerely hope you can give some solutions, Thanks !

opened by Huzhen757 5
训练模型不收敛

在自己的训练集上（已转换成coco格式，单类别检测）训练了170个epoch，loss基本不降，验证集AP也是0 训练命令：python -m torch.distributed.launch --nproc_per_node=1 --use_env detr_main.py --lr_drop 200 --epochs 300 --lr_backbone 5e-4 --pre_norm --coco_path /home/work/mnt/project/up-detr/data/coco --pretrain /home/work/mnt/project/up-detr/checkpoints/up-detr-pre-training-60ep-imagenet.pth

opened by secortot 5
What file does "files" in def plot_precision_recall(files, naming_scheme='iter') refer to and which file do I need?

Hello, I see that plot_utils.py can plot PR curves, but it requires a "files", what does this "files" refer to? https://github.com/dddzg/up-detr/blob/97fee88358ad2bdfcc6e3d4fa6892b4600fae089/util/plot_utils.py#L83 Is it ./outputs/log.txt file? Or is it any of the ./outputs/checkpoint.pth? Or should I need the ./outputs folder?

Thank you.

opened by zxsitu 4
How to get the inference time (speed) ?

❓ How to get the inference time or speed using UP-DETR?

I want to make some comparisons between some models, how can I export (print) the inference time when testing or evaling UP-DETR?

opened by zxsitu 4

Unexpected keys in dict when running evaluation

How can I evaluate the provided model? I'm trying to use the DETR's recipe and getting a checkpoint loading error.

Thank you!

DATASETROOT=$PWD/path/to/coco/

UPDETRCKPTURL='https://drive.google.com/file/d/1_YNtzKKaQbgFfd6m2ZUCO6LWpKqd7o7X'
UPDETRCKPT='up-detr-coco-fine-tuned-300ep.pt'

git clone https://github.com/dddzg/up-detr
cd up-detr

GOOGLE_DRIVE_FILE_ID=$(echo $UPDETRCKPTURL | rev | cut -d'/' -f1 | rev)
CONFIRM=$(wget --quiet --save-cookies googlecookies.txt --keep-session-cookies --no-check-certificate "https://docs.google.com/uc?export=download&id=$GOOGLE_DRIVE_FILE_ID" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')
wget -O $UPDETRCKPT --load-cookies googlecookies.txt "https://docs.google.com/uc?export=download&confirm=$CONFIRM&id=$GOOGLE_DRIVE_FILE_ID"

python main_detr.py --batch_size 2 --no_aux_loss --eval --resume $UPDETRCKPT --coco_path $DATASETROOT

Not using distributed mode
git:
  sha: 00be9b996f52324335e0cc3fe7a59bfba9f43540, status: clean, branch: master

Namespace(aux_loss=False, backbone='resnet50', batch_size=2, bbox_loss_coef=5, clip_max_norm=0.1, coco_panoptic_path=None, coco_path='/specific/netapp5_2/gamir/lab/vadim/foo/../selfsupslots/data/common/coco/', dataset_file='coco', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_url='env://', distributed=False, dropout=0.1, enc_layers=6, eos_coef=0.1, epochs=300, eval=True, frozen_weights=None, giou_loss_coef=2, hidden_dim=256, lr=0.0001, lr_backbone=1e-05, lr_drop=200, mask_loss_coef=1, masks=False, nheads=8, num_queries=100, num_workers=2, output_dir='', position_embedding='sine', pre_norm=False, pretrain='', remove_difficult=False, resume='up-detr-coco-fine-tuned-300ep.pt', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, start_epoch=0, weight_decay=0.0001, world_size=1)
number of params: 41302368
loading annotations into memory...

Done (t=32.57s)
creating index...
index created!
loading annotations into memory...
Done (t=4.40s)
creating index...
index created!
Traceback (most recent call last):
  File "detr_main.py", line 267, in <module>
    main(args)
  File "detr_main.py", line 197, in main
    model_without_ddp.load_state_dict(checkpoint['model'])
  File ".../vadim/prefix/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DETR:
        Unexpected key(s) in state_dict: "transformer.encoder.norm.weight", "transformer.encoder.norm.bias".

opened by vadimkantorov 4

Question about experitment on one-shot object detection

Hi~ UP-DETR is an interesting work. I wonder if you experimented with COCO dataset about one-shot object detection. If you have done experiments on COCO, would you like to provide the results. Thanks a lot~

opened by suilin0432 3
Random Crop

Can we randomly crop from other image and paste it on the training picture, and also use the randomly cropped as a pseudo-label, that is, find the cropped block in the original image？

opened by DoublePan-Oh 2
Class Loss

I didn't find this parameter in your code. Can you tell me which one?

As I know, the CNN backbone does not participate in training, but is only used to extract image features. Can CNN and transformer be separated. For example, first use Resnet to extract image features, and then randomly crop patch at the feature level. I mean starting from features. I plan to use this idea to do video tasks, but I can't directly manipulate the video itself, I can only start from the video features. I don't know if this is possible.

opened by DoublePan-Oh 2
Unexpected key(s) in state_dict: "feature_align.layers.0.weight", "feature_align.layers.0.bias", "feature_align.layers.1.weight", "feature_align.layers.1.bias", "patch2query.weight", "patch2query.bias".

excuse me, I Fine-tuning in own dataset and evaluation

This is mine warning after evaluation in pycharm(win 10)，pytorch==1.12.1，torchvision==0.13.1，cuda==11.7，3070ti

Unexpected key(s) in state_dict: "feature_align.layers.0.weight", "feature_align.layers.0.bias", "feature_align.layers.1.weight", "feature_align.layers.1.bias", "patch2query.weight", "patch2query.bias".

I dont know how to solve this problem.I tested the following methods 1.pop this weight and bias before the Fine-tuning，but the evaluation result is 0，yes ，all IOU is 0 . 2.pop pop this weight and bias after the Fine-tuning，all IOU is 0 .

Please give me some advice

opened by Zoeun 0
Getting access to the one-shot object detection training code

Hello there!

As the code for the one-shot object detection task is not available in this repository, would there be any way to access it? If not would it be possible for you to share with me this code?

I tried to re-implement the ideas presented in your paper on top of DETR, but was unsuccessful in replicating the results shown in the paper. In fact, I was not able to build a model that "learns", as the loss remains high throughout the training without ever showing a consistent downwards trend.

What I've done in detail is the following: I took DETR's architecture, added to it the queries as input, passed the queries through the same backbone CNN as the target image, forwarded the resulting embedding to an average pooling layer to reduce the H*W dimensions to 1 (nn.AdaptiveAvgPool2d((1, 1))), forwarded the resulting vector to a projection linear layer (nn.Linear(backbone.num_channels, hidden_dim)) to project the features from an N-dimensional space to an M-dimensional space (where N is the channels dimension of the CNN backbone and M is the dimension within the encoder-decoder transformers), and finally, repeated the resulting vector X times (X being the number of object queries in the architecture) and added that to the object queries vectors (according to our discussion in #24 ).

My goal was to replicate the results (shown below) of "DETR" (without pretraining) in your paper for one-shot object detection on PASCAL VOC.

Unfortunately, I was not able to replicate these results, and in fact have not had a converging model that learned the task at all (loss is always high and oscillating). I Tried various backbone learning rates, such as 1e-4, 5e-5, 1e-5, and 0 and all resulted in approximately the same results. Lastly, I tried to also add to my code your proposed feature reconstruction loss (both with backbone lr = 0 and > 0), but that also didn't help.

Thank you for your time, and I'm looking forward to hearing back from you!

opened by JosephAssaker 0
A blog about UP-DETR

Hi authors,

This is not about an issue :smile:

UP-DETR is great! I just wrote a blog about it, see here: https://medium.com/analytics-vidhya/up-detr-unsupervised-pre-training-for-object-detection-with-transformers-paper-explained-84611e27a144

You could consider to put its link in your ReadMe, so that readers could understand it even easier :smile:

Thanks! Best regards

opened by HaoWei-TomTom 4

Owner

dddzg

MSc student at SCUT

GitHub

[ICLR 2022] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR

DAB-DETR This is the official pytorch implementation of our ICLR 2022 paper DAB-DETR. Authors: Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi

336 Dec 25, 2022

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

687 Jan 7, 2023

PyTorch implementation of our Adam-NSCL algorithm from our CVPR2021 (oral) paper "Training Networks in Null Space for Continual Learning"

Adam-NSCL This is a PyTorch implementation of Adam-NSCL algorithm for continual learning from our CVPR2021 (oral) paper: Title: Training Networks in N

34 Dec 21, 2022

Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

489 Jan 7, 2023

Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)

Official PyTorch Implementation for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'2021, Oral Presentation) HOTR: End-to-

114 Nov 28, 2022

Official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence".

The DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings and that the spatial embeddings make minor contributions, increasing the need for high-quality content embeddings and thus increasing the training difficulty.

281 Dec 30, 2022

Deformable DETR is an efficient and fast-converging end-to-end object detector.

Deformable DETR: Deformable Transformers for End-to-End Object Detection.

2k Jan 5, 2023

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

248 Dec 4, 2022

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

250 Jan 8, 2023

PED: DETR for Crowd Pedestrian Detection

PED: DETR for Crowd Pedestrian Detection Code for PED: DETR For (Crowd) Pedestrian Detection Paper PED: DETR for Crowd Pedestrian Detection Installati

36 Sep 13, 2022

[CVPR 2022] Official Pytorch code for OW-DETR: Open-world Detection Transformer

OW-DETR: Open-world Detection Transformer (CVPR 2022) [Paper] Akshita Gupta*, Sanath Narayan*, K J Joseph, Salman Khan, Fahad Shahbaz Khan, Mubarak Sh

127 Dec 27, 2022

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

196 Dec 13, 2022

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

TAP: Text-Aware Pre-training TAP: Text-Aware Pre-training for Text-VQA and Text-Caption by Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Flo

61 Nov 14, 2022

[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

Versatile Multi-Modal Pre-Training for Human-Centric Perception Fangzhou Hong1 Liang Pan1 Zhongang Cai1,2,3 Ziwei Liu1* 1S-Lab, Nanyang Technologic

96 Jan 3, 2023

[CVPR2021 Oral] FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation.

FFB6D This is the official source code for the CVPR2021 Oral work, FFB6D: A Full Flow Biderectional Fusion Network for 6D Pose Estimation. (Arxiv) Tab

201 Dec 28, 2022

The official repo of the CVPR2021 oral paper: Representative Batch Normalization with Feature Calibration

Representative Batch Normalization (RBN) with Feature Calibration The official implementation of the CVPR2021 oral paper: Representative Batch Normali

76 Nov 9, 2022

Repo for CVPR2021 paper "QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information"

QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information by Masato Tamura, Hiroki Ohashi, and Tomoaki Yosh

105 Dec 23, 2022

Official implementation of our CVPR2021 paper "OTA: Optimal Transport Assignment for Object Detection" in Pytorch.

OTA: Optimal Transport Assignment for Object Detection This project provides an implementation for our CVPR2021 paper "OTA: Optimal Transport Assignme

217 Jan 3, 2023

Repository of 3D Object Detection with Pointformer (CVPR2021)

3D Object Detection with Pointformer This repository contains the code for the paper 3D Object Detection with Pointformer (CVPR 2021) [arXiv]. This wo

117 Jan 6, 2023