TOOD: Task-aligned One-stage Object Detection, ICCV2021 Oral

Last update: Jan 9, 2023

Related tags

Deep Learning computer-vision object-detection tal anchor-free one-stage-detector tood dense-object-detection iccv2021 t-head iccv21 task-alignment sample-assignment task-aligned-loss anchor-based task-alignment-metric

Overview

TOOD: Task-aligned One-stage Object Detection (ICCV 2021 Oral)

Paper

Introduction

One-stage object detection is commonly implemented by optimizing two sub-tasks: object classification and localization, using heads with two parallel branches, which might lead to a certain level of spatial misalignment in predictions between the two tasks. In this work, we propose a Task-aligned One-stage Object Detection (TOOD) that explicitly aligns the two tasks in a learning-based manner. First, we design a novel Task-aligned Head (T-Head) which offers a better balance between learning task-interactive and task-specific features, as well as a greater flexibility to learn the alignment via a task-aligned predictor. Second, we propose Task Alignment Learning (TAL) to explicitly pull closer (or even unify) the optimal anchors for the two tasks during training via a designed sample assignment scheme and a task-aligned loss. Extensive experiments are conducted on MS-COCO, where TOOD achieves a 51.1 AP at single-model single-scale testing. This surpasses the recent one-stage detectors by a large margin, such as ATSS (47.7 AP), GFL (48.2 AP), and PAA (49.0 AP), with fewer parameters and FLOPs. Qualitative results also demonstrate the effectiveness of TOOD for better aligning the tasks of object classification and localization.

Method overview

Parallel head vs. T-head

Prerequisites

MMDetection version 2.14.0.
Please see get_started.md for installation and the basic usage of MMDetection.

Train

# assume that you are under the root directory of this project,
# and you have activated your virtual environment if needed.
# and with COCO dataset in 'data/coco/'.

./tools/dist_train.sh configs/tood/tood_r50_fpn_1x_coco.py 4

Inference

./tools/dist_test.sh configs/tood/tood_r50_fpn_1x_coco.py work_dirs/tood_r50_fpn_1x_coco/epoch_12.pth 4 --eval bbox

Models

For your convenience, we provide the following trained models (TOOD). All models are trained with 16 images in a mini-batch.

Model	Anchor	MS train	DCN	Lr schd	AP (minival)	AP (test-dev)	Config	Download
TOOD_R_50_FPN_1x	Anchor-free	No	N	1x	42.5	42.7	config	google / baidu
TOOD_R_50_FPN_anchor_based_1x	Anchor-based	No	N	1x	42.4	42.8	config	google / baidu
TOOD_R_101_FPN_2x	Anchor-free	Yes	N	2x	46.2	46.7	config	google / baidu
TOOD_X_101_FPN_2x	Anchor-free	Yes	N	2x	47.6	48.5	config	google / baidu
TOOD_R_101_dcnv2_FPN_2x	Anchor-free	Yes	Y	2x	49.2	49.6	config	google / baidu
TOOD_X_101_dcnv2_FPN_2x	Anchor-free	Yes	Y	2x	50.5	51.1	config	google / baidu

[0] All results are obtained with a single model and without any test time data augmentation such as multi-scale, flipping and etc..
[1] dcnv2 denotes deformable convolutional networks v2.
[2] Refer to more details in config files in config/tood/.
[3] Extraction code of baidu netdisk: tood.

Acknowledgement

Thanks MMDetection team for the wonderful open source project!

Citation

If you find TOOD useful in your research, please consider citing:

@inproceedings{feng2021tood,
    title={TOOD: Task-aligned One-stage Object Detection},
    author={Feng, Chengjian and Zhong, Yujie and Gao, Yu and Scott, Matthew R and Huang, Weilin},
    booktitle={ICCV},
    year={2021}
}

Comments

Layer Attention instead of Channel Attention?

Why did you choose Layer Attention instead of normal Channel Attention?
Task-interactive features are concatenated after N consecutive Conv layers, then using Channel Attention could further separate each channels to specific task, instead of Layer Attention, which also conduct separation on channel dim, but can only separate in group of 6?

opened by iumyx2612 5
How about the ATSS assigner as initial static assignment method?

In the initial stage of training, the scores of task alignment learning metric are so small, so that theirs value are almost zero, because of low ious and classification scores. As my point of view, using the ATSS is aim to select positive samples closing to gt center points in order to accelerate the model convergence in the early training?

opened by ttjjmm 2
Tood's onnx file request

Hello, I was not able to export the onnx file successfully using mmdet.Is there a Tood onnx file available? I would like to further visualize the network structure for learning, thank you!

opened by hjfdsssdg 2

i changed the number of ratios, then model can not train ,where should i have to modify futher?

i modify the ratios=[1] to ratios=[2.444, 3.182, 1.574, 1.721, 0.994, 1.163, 0.751, 0.534] then have a error like this：

2022-03-29 15:12:23,844 - mmdet - INFO - workflow: [('train', 1)], max: 100 epochs
2022-03-29 15:12:23,844 - mmdet - INFO - Checkpoints will be saved to E:\Object-Detection\Github\radar-detection\work_dirs\radar_tood by HardDiskBackend.
D:\App\anaconda\envs\swin-t\lib\site-packages\torch\nn\functional.py:718: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  ..\c10/core/TensorImpl.h:1156.)
  return torch.max_pool2d(input, kernel_size, stride, padding, dilation, ceil_mode)
Traceback (most recent call last):
  File "D:\App\anaconda\envs\swin-t\lib\site-packages\mmcv\runner\epoch_based_runner.py", line 50, in train
    self.run_iter(data_batch, train_mode=True, **kwargs)
  File "D:\App\anaconda\envs\swin-t\lib\site-packages\mmcv\runner\epoch_based_runner.py", line 30, in run_iter
    **kwargs)
  File "D:\App\anaconda\envs\swin-t\lib\site-packages\mmcv\parallel\data_parallel.py", line 75, in train_step
    return self.module.train_step(*inputs[0], **kwargs[0])
  File "D:\App\anaconda\envs\swin-t\lib\site-packages\mmdet\models\detectors\base.py", line 248, in train_step
    losses = self(**data)
  File "D:\App\anaconda\envs\swin-t\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\App\anaconda\envs\swin-t\lib\site-packages\mmcv\runner\fp16_utils.py", line 98, in new_func
    return old_func(*args, **kwargs)
  File "D:\App\anaconda\envs\swin-t\lib\site-packages\mmdet\models\detectors\base.py", line 172, in forward
    return self.forward_train(img, img_metas, **kwargs)
  File "D:\App\anaconda\envs\swin-t\lib\site-packages\mmdet\models\detectors\single_stage.py", line 84, in forward_train
    gt_labels, gt_bboxes_ignore)
  File "D:\App\anaconda\envs\swin-t\lib\site-packages\mmdet\models\dense_heads\base_dense_head.py", line 330, in forward_train
    outs = self(x)
  File "D:\App\anaconda\envs\swin-t\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\App\anaconda\envs\swin-t\lib\site-packages\mmdet\models\dense_heads\tood_head.py", line 263, in forward
    b, h, w, 4).permute(0, 3, 1, 2) / stride[0]
RuntimeError: shape '[8, 32, 168, 4]' is invalid for input of size 1376256

and this is my config file

dataset_type = 'CocoDataset'
data_root = 'data/coco/'
img_norm_cfg = dict(
    mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(type='LoadAnnotations', with_bbox=True),
    dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
    dict(type='RandomFlip', flip_ratio=0.5),
    dict(
        type='Normalize',
        mean=[123.675, 116.28, 103.53],
        std=[58.395, 57.12, 57.375],
        to_rgb=True),
    dict(type='Pad', size_divisor=32),
    dict(type='DefaultFormatBundle'),
    dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
]
test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(
        type='MultiScaleFlipAug',
        img_scale=(1333, 800),
        flip=False,
        transforms=[
            dict(type='Resize', keep_ratio=True),
            dict(type='RandomFlip'),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='ImageToTensor', keys=['img']),
            dict(type='Collect', keys=['img'])
        ])
]
data = dict(
    samples_per_gpu=8,
    workers_per_gpu=1,
    train=dict(
        type='CocoDataset',
        ann_file='E:/Object-Detection/data_radar/devkit/voc07_train.json',
        img_prefix='E:/Object-Detection/data_radar/devkit/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(type='LoadAnnotations', with_bbox=True),
            dict(type='Resize', img_scale=(1333, 800), keep_ratio=True),
            dict(type='RandomFlip', flip_ratio=0.5),
            dict(
                type='Normalize',
                mean=[123.675, 116.28, 103.53],
                std=[58.395, 57.12, 57.375],
                to_rgb=True),
            dict(type='Pad', size_divisor=32),
            dict(type='DefaultFormatBundle'),
            dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])
        ],
        classes=('loose_l', 'loose_s', 'poor_l', 'porous')),
    val=dict(
        type='CocoDataset',
        ann_file='E:/Object-Detection/data_radar/devkit/voc07_val.json',
        img_prefix='E:/Object-Detection/data_radar/devkit/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        classes=('loose_l', 'loose_s', 'poor_l', 'porous')),
    test=dict(
        type='CocoDataset',
        ann_file='E:/Object-Detection/data_radar/devkit/voc07_test.json',
        img_prefix='E:/Object-Detection/data_radar/devkit/',
        pipeline=[
            dict(type='LoadImageFromFile'),
            dict(
                type='MultiScaleFlipAug',
                img_scale=(1333, 800),
                flip=False,
                transforms=[
                    dict(type='Resize', keep_ratio=True),
                    dict(type='RandomFlip'),
                    dict(
                        type='Normalize',
                        mean=[123.675, 116.28, 103.53],
                        std=[58.395, 57.12, 57.375],
                        to_rgb=True),
                    dict(type='Pad', size_divisor=32),
                    dict(type='ImageToTensor', keys=['img']),
                    dict(type='Collect', keys=['img'])
                ])
        ],
        classes=('loose_l', 'loose_s', 'poor_l', 'porous')))
evaluation = dict(interval=1, metric='bbox')
optimizer = dict(type='SGD', lr=0.01, momentum=0.9, weight_decay=0.0001)
optimizer_config = dict(grad_clip=None)
lr_config = dict(
    policy='step',
    warmup='linear',
    warmup_iters=500,
    warmup_ratio=0.001,
    step=[8, 11])
runner = dict(type='EpochBasedRunner', max_epochs=100)
checkpoint_config = dict(interval=10)
log_config = dict(interval=50, hooks=[dict(type='TextLoggerHook')])
custom_hooks = [dict(type='SetEpochInfoHook')]
dist_params = dict(backend='nccl')
log_level = 'INFO'
load_from = None
resume_from = None
workflow = [('train', 1)]
opencv_num_threads = 0
mp_start_method = 'fork'
model = dict(
    type='TOOD',
    backbone=dict(
        type='ResNet',
        depth=50,
        num_stages=4,
        out_indices=(0, 1, 2, 3),
        frozen_stages=1,
        norm_cfg=dict(type='BN', requires_grad=True),
        norm_eval=True,
        style='pytorch',
        init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet50')),
    neck=dict(
        type='FPN',
        in_channels=[256, 512, 1024, 2048],
        out_channels=256,
        start_level=1,
        add_extra_convs='on_output',
        num_outs=5),
    bbox_head=dict(
        type='TOODHead',
        num_classes=4,
        in_channels=256,
        stacked_convs=6,
        feat_channels=256,
        anchor_type='anchor_based',
        anchor_generator=dict(
            type='AnchorGenerator',
            ratios=[2.444, 3.182, 1.574, 1.721, 0.994, 1.163, 0.751, 0.534],
            octave_base_scale=1,
            scales_per_octave=1,
            strides=[8, 16, 32, 64, 128]),
        bbox_coder=dict(
            type='DeltaXYWHBBoxCoder',
            target_means=[0.0, 0.0, 0.0, 0.0],
            target_stds=[0.1, 0.1, 0.2, 0.2]),
        initial_loss_cls=dict(
            type='FocalLoss',
            use_sigmoid=True,
            activated=True,
            gamma=2.0,
            alpha=0.25,
            loss_weight=1.0),
        loss_cls=dict(
            type='QualityFocalLoss',
            use_sigmoid=True,
            activated=True,
            beta=2.0,
            loss_weight=1.0),
        loss_bbox=dict(type='GIoULoss', loss_weight=2.0)),
    train_cfg=dict(
        initial_epoch=4,
        initial_assigner=dict(type='ATSSAssigner', topk=9),
        assigner=dict(type='TaskAlignedAssigner', topk=13),
        alpha=1,
        beta=6,
        allowed_border=-1,
        pos_weight=-1,
        debug=False),
    test_cfg=dict(
        nms_pre=1000,
        min_bbox_size=0,
        score_thr=0.05,
        nms=dict(type='nms', iou_threshold=0.6),
        max_per_img=100))
classes = ('loose_l', 'loose_s', 'poor_l', 'porous')
work_dir = './work_dirs\radar_tood'
auto_resume = False
gpu_ids = [0]

opened by joeyslv 0

Plot result

Hi, thanks for your wonderful work, I have a question how do you plot the detection result in Figure1? especially the prediction score map and localization map.

opened by qdd1234 9