DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Overview

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Created by Yongming Rao*, Wenliang Zhao*, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, Jiwen Lu,

This repository contains PyTorch implementation for DenseCLIP.

DenseCLIP is a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP. Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models. By further using the contextual information from the image to prompt the language model, we are able to facilitate our model to better exploit the pre-trained knowledge. Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones including both CLIP models and ImageNet pre-trained models.

intro

Our code is based on mmsegmentation and mmdetection and timm.

[Project Page] [arXiv]

Usage

Requirements

  • torch>=1.8.0
  • torchvision
  • timm
  • mmcv-full==1.3.17
  • mmseg==0.19.0
  • mmdet==2.17.0
  • fvcore

To use our code, please first install the mmcv-full and mmseg/mmdet following the official guidelines (mmseg, mmdet) and prepare the datasets accordingly.

Pre-trained CLIP Models

Download the pre-trained CLIP models (RN50.pt, RN101.pt, VIT-B-16.pt) and save them to the pretrained folder.

Segmentation

Model Zoo

We provide DenseCLIP models for Semantic FPN framework.

Model FLOPs (G) Params (M) mIoU(SS) mIoU(MS) config url
RN50-CLIP 248.8 31.0 36.9 43.5 config -
RN50-DenseCLIP 269.2 50.3 43.5 44.7 config Tsinghua Cloud
RN101-CLIP 326.6 50.0 42.7 44.3 config -
RN101-DenseCLIP 346.3 67.8 45.1 46.5 config Tsinghua Cloud
ViT-B-CLIP 1037.4 100.8 49.4 50.3 config -
ViT-B-DenseCLIP 1043.1 105.3 50.6 51.3 config Tsinghua Cloud

Training & Evaluation on ADE20K

To train the DenseCLIP model based on CLIP ResNet-50, run:

bash dist_train.sh configs/denseclip_fpn_res50_512x512_80k.py 8

To evaluate the performance with multi-scale testing, run:

bash dist_test.sh configs/denseclip_fpn_res50_512x512_80k.py /path/to/checkpoint 8 --eval mIoU --aug-test

To better measure the complexity of the models, we provide a tool based on fvcore to accurately compute the FLOPs of torch.einsum and other operations:

python get_flops.py /path/to/config --fvcore

You can also remove the --fvcore flag to obtain the FLOPs measured by mmcv for comparisons.

Detection

Model Zoo

We provide models for both RetinaNet and Mask-RCNN framework.

RetinaNet
Model FLOPs (G) Params (M) box AP config url
RN50-CLIP 265 38 36.9 config -
RN50-DenseCLIP 285 60 37.8 config Tsinghua Cloud
RN101-CLIP 341 57 40.5 config -
RN101-DenseCLIP 360 78 41.1 config Tsinghua Cloud
Mask R-CNN
Model FLOPs (G) Params (M) box AP mask AP config url
RN50-CLIP 301 44 39.3 36.8 config -
RN50-DenseCLIP 327 67 40.2 37.6 config Tsinghua Cloud
RN101-CLIP 377 63 42.2 38.9 config -
RN101-DenseCLIP 399 84 42.6 39.6 config Tsinghua Cloud

Training & Evaluation on COCO

To train our DenseCLIP-RN50 using RetinaNet framework, run

 bash dist_train.sh configs/retinanet_denseclip_r50_fpn_1x_coco.py 8

To evaluate the box AP of RN50-DenseCLIP (RetinaNet), run

bash dist_test.sh configs/retinanet_denseclip_r50_fpn_1x_coco.py /path/to/checkpoint 8 --eval bbox

To evaluate both the box AP and the mask AP of RN50-DenseCLIP (Mask-RCNN), run

bash dist_test.sh configs/mask_rcnn_denseclip_r50_fpn_1x_coco.py /path/to/checkpoint 8 --eval bbox segm

License

MIT License

Citation

If you find our work useful in your research, please consider citing:

@inproceedings{rao2021denseclip,
  title={DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting},
  author={Rao, Yongming and Zhao, Wenliang and Chen, Guangyi and Tang, Yansong and Zhu, Zheng and Huang, Guan and Zhou, Jie and Lu, Jiwen},
  journal={arXiv preprint arXiv:2112.01518},
  year={2021}
}
Comments
  • Can not reproduce the result of DenseCLIP-R50.

    Can not reproduce the result of DenseCLIP-R50.

    Hi,

    I use your introduction of training denseclip-R50, with batch_size of "4 x 8 GPU".

    However I can not reproduce your results(43.5) in your paper, I only get 42.8 mIoU.

    Can you provide the training log file? or more details (e.g., seed?) to reproduce your paper results. Thanks!

    opened by Richardych 8
  • Some questions of ViT-B-DenseCLIP

    Some questions of ViT-B-DenseCLIP

    image 1 I intend to know the performance of ViT-B-DenseCLIP (VS RN101-DenseCLIP), can you tell me the specific information of it? and how to train ViT-B-DenseCLIP on coco or ADE20K? 2 ViT-B-DenseCLIP is based on ViT-B-16.pt? not ViT-B-32.pt?

    opened by lixiangMindSpore 6
  • question about any backbone experiments on ADE20K segmentation

    question about any backbone experiments on ADE20K segmentation

    Hi, @raoyongming, thanks very much for your great work. I just have some questions about any backbone experiments on ADE20K segmentation in table 5. 我想问一下,针对没有clip 预训练的模型,例如RestNet18, Swintransformer-T/S, 我看到在ADE20k上提升提升不如RN50 显著。你们是直接进行的visual-text 特征交互计算吗? 有用到其他的一些trick吗?thanks!

    opened by wanglixilinx 5
  • A stupid question about auxiliary loss for Object detection & instance segmentation.

    A stupid question about auxiliary loss for Object detection & instance segmentation.

    The paper said, "we do not have ground truth segmentation label.". I can understand there is no segmentation mask for detection, but why is there no segmentation mask for instance segmentation task?

    opened by waxnkw 4
  • Questions about the details of configuration of RN50-CLIP

    Questions about the details of configuration of RN50-CLIP

    I cannot reach the mIoU of RN50-CLIP that was showed in the paper, though I used the configuration mentioned in the README. Could you please tell me what batch size was used and how many GPUs were used. More details of implement are very helpful. I've tried batch size of 16, but only got 38.85 of mIoU. Here is my configuration and log file is putted in the attachment.

    ''' norm_cfg = dict(type='SyncBN', requires_grad=True) model = dict( type='EncoderDecoder', pretrained='pretrained/RN50.pt', backbone=dict( type='CLIPResNet', depth=50, num_stages=4, out_indices=(0, 1, 2, 3), dilations=(1, 1, 1, 1), strides=(1, 2, 2, 2), norm_cfg=dict(type='SyncBN', requires_grad=True), norm_eval=False, style='pytorch', contract_dilation=True, layers=[3, 4, 6, 3]), neck=dict( type='FPN', in_channels=[256, 512, 1024, 2048], out_channels=256, num_outs=4), decode_head=dict( type='FPNHead', in_channels=[256, 256, 256, 256], in_index=[0, 1, 2, 3], feature_strides=[4, 8, 16, 32], channels=256, dropout_ratio=0.1, num_classes=150, norm_cfg=dict(type='SyncBN', requires_grad=True), align_corners=False, loss_decode=dict( type='CrossEntropyLoss', use_sigmoid=False, loss_weight=1.0)), train_cfg=dict(), test_cfg=dict(mode='slide', crop_size=(512, 512), stride=(341, 341))) dataset_type = 'ADE20KDataset' data_root = 'data/ade/ADEChallengeData2016' IMG_MEAN = [122.7709383, 116.7460125, 104.09373615000001] IMG_VAR = [68.5005327, 66.6321579, 70.32316304999999] img_norm_cfg = dict( mean=[122.7709383, 116.7460125, 104.09373615000001], std=[68.5005327, 66.6321579, 70.32316304999999], to_rgb=True) crop_size = (512, 512) train_pipeline = [ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', reduce_zero_label=True), dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)), dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75), dict(type='RandomFlip', prob=0.5), dict(type='PhotoMetricDistortion'), dict( type='Normalize', mean=[122.7709383, 116.7460125, 104.09373615000001], std=[68.5005327, 66.6321579, 70.32316304999999], to_rgb=True), dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_semantic_seg']) ] test_pipeline = [ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(2048, 512), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[122.7709383, 116.7460125, 104.09373615000001], std=[68.5005327, 66.6321579, 70.32316304999999], to_rgb=True), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ] data = dict( samples_per_gpu=4, workers_per_gpu=4, train=dict( type='ADE20KDataset', data_root='data/ade/ADEChallengeData2016', img_dir='images/training', ann_dir='annotations/training', pipeline=[ dict(type='LoadImageFromFile'), dict(type='LoadAnnotations', reduce_zero_label=True), dict(type='Resize', img_scale=(2048, 512), ratio_range=(0.5, 2.0)), dict(type='RandomCrop', crop_size=(512, 512), cat_max_ratio=0.75), dict(type='RandomFlip', prob=0.5), dict(type='PhotoMetricDistortion'), dict( type='Normalize', mean=[122.7709383, 116.7460125, 104.09373615000001], std=[68.5005327, 66.6321579, 70.32316304999999], to_rgb=True), dict(type='Pad', size=(512, 512), pad_val=0, seg_pad_val=255), dict(type='DefaultFormatBundle'), dict(type='Collect', keys=['img', 'gt_semantic_seg']) ]), val=dict( type='ADE20KDataset', data_root='data/ade/ADEChallengeData2016', img_dir='images/validation', ann_dir='annotations/validation', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(2048, 512), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[122.7709383, 116.7460125, 104.09373615000001], std=[68.5005327, 66.6321579, 70.32316304999999], to_rgb=True), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ]), test=dict( type='ADE20KDataset', data_root='data/ade/ADEChallengeData2016', img_dir='images/validation', ann_dir='annotations/validation', pipeline=[ dict(type='LoadImageFromFile'), dict( type='MultiScaleFlipAug', img_scale=(2048, 512), flip=False, transforms=[ dict(type='Resize', keep_ratio=True), dict(type='RandomFlip'), dict( type='Normalize', mean=[122.7709383, 116.7460125, 104.09373615000001], std=[68.5005327, 66.6321579, 70.32316304999999], to_rgb=True), dict(type='ImageToTensor', keys=['img']), dict(type='Collect', keys=['img']) ]) ])) log_config = dict( interval=50, hooks=[dict(type='TextLoggerHook', by_epoch=False)]) dist_params = dict(backend='nccl') log_level = 'INFO' load_from = None resume_from = None workflow = [('train', 1)] cudnn_benchmark = True find_unused_parameters = True optimizer = dict( type='AdamW', lr=0.0001, weight_decay=0.0001, paramwise_cfg=dict( custom_keys=dict( backbone=dict(lr_mult=0.1), norm=dict(decay_mult=0.0)))) optimizer_config = dict() lr_config = dict( policy='poly', power=0.9, min_lr=1e-06, by_epoch=False, warmup='linear', warmup_iters=1500, warmup_ratio=1e-06) runner = dict(type='IterBasedRunner', max_iters=80000) checkpoint_config = dict(by_epoch=False, interval=8000) evaluation = dict(interval=8000, metric='mIoU') work_dir = './work_dirs/fpn_clipres50_test4k' gpu_ids = range(0, 1) '''

    20220320_015954.log

    opened by JasonLin1998 4
  • Single GPU error

    Single GPU error

    Hi,I've modified the settings for single or multiple GPUs

    norm_cfg = dict(type='BN', requires_grad=True)

    But there were still such errors. Uploading image.png… Does that mean I need to modify the training section in the mmseg source?

    opened by Virgilzz 3
  • Question about inference setting

    Question about inference setting

    Hi. Thanks for sharing your work!

    Does DenseCLIP use pre-trained CLIP encoder on inference setting?

    I think pre-trained CLIP encoder needs to compute pixel-text score maps on inference setting. So the model is needed pre-trained CLIP encoder.

    I wonder the CLIP encoder don't use on inference setting.

    Thanks.

    opened by dneirfi 3
  • multi-gpu error

    multi-gpu error

    hello,I want to know whether the code can be trained with multigpu?

    the given command uses multi-gpu,like "bash dist_train.sh configs/retinanet_denseclip_r50_fpn_1x_coco.py 8" but when I run it,it fails,showing the following errors

    [] [] are misaligned params in CLIPResNet [] [] are misaligned params in CLIPResNet [] [] are misaligned params in text encoder [] [] are misaligned params in text encoder

    and I find in the code that writes the note that image

    opened by eternaldolphin 3
  • Query on Inference Setting

    Query on Inference Setting

    Hi,

    Thanks for making the code public !

    I had a general query on the inference setting chosen for this paper -- Why is this paper not targetting zero -shot setting and instead focussed on fully supervised setting ? Is there any reason ? As the power of CLIP lies in zero-shot task transfer, i was wondering why no experiments were done for this ? and instead posed this problem as a multi-modal fully supervised dense detection task ?

    Thanks in advance

    opened by sauradip 3
  • Question about implementation of CLIPResNet

    Question about implementation of CLIPResNet

    Hi, The modified ResNet in CLIP uses attention pooling, and the comments in CLIPResNet class are also noted that. However, I didn't see any related operations in CLIPResNet. I know there is another CLIPResNetWithAttention class, but according the configs, I think it is for DenseCLIP not CLIP?

    opened by littlepenguin89106 2
  • [critical bug] The text encoder is also updated.

    [critical bug] The text encoder is also updated.

    I found out that the text encoder is also updated. The positional embedding of the provided "denseclip_fpn_res50.pth" is tensor([[-0.0013, 0.0003, 0.0007, ..., -0.0027, -0.0091, -0.0024], [-0.0039, -0.0008, -0.0016, ..., -0.0006, -0.0049, -0.0044], [-0.0044, 0.0011, -0.0007, ..., -0.0026, -0.0094, -0.0008], ..., [-0.0002, -0.0002, -0.0012, ..., 0.0007, 0.0013, -0.0002], [-0.0016, -0.0015, -0.0001, ..., -0.0010, -0.0025, -0.0004], [-0.0030, -0.0013, -0.0004, ..., -0.0028, -0.0052, -0.0016]])

    And the first 13 positional embedding of the pretrained RN50 model is tensor([[-0.0012, 0.0003, 0.0008, ..., -0.0027, -0.0090, -0.0024], [-0.0040, -0.0008, -0.0015, ..., -0.0006, -0.0049, -0.0045], [-0.0044, 0.0011, -0.0006, ..., -0.0025, -0.0093, -0.0007], ..., [-0.0002, -0.0002, -0.0011, ..., 0.0006, 0.0011, -0.0003], [-0.0018, -0.0016, -0.0002, ..., -0.0009, -0.0025, -0.0004], [-0.0031, -0.0014, -0.0006, ..., -0.0026, -0.0053, -0.0015]], device='cuda:0', grad_fn=)

    , which is slightly different.

    I guess the reason is that "lr_mult" does not guarantee zero LR. The learning rate of the text encoder may get bigger than 0 due to the internal behavior of the LR scheduler. I think this is quite a critical bug since it may affect the result of the ablation study (Table 2 in the paper).

    Also, I have one more question: Why do you set lr_mult as 0 for 'norm'? As far as I know, the mmcv library tries to set learning_rate as 0 for every module which includes the key "norm". If it is right, every 'normalization layer' in the transformer layer (especially the context decoder) will be 0.

    opened by SeongwoongCho 2
  • Is default training iterations enough to reach the paper performance?

    Is default training iterations enough to reach the paper performance?

    I try to train a DenseCLIP model based on CLIP ResNet-50 using the default configuration, where the number of iterations is 80,000. But I find that after training, the model has mIoU 39.46 in the testing set, which is smaller than 43.5 shown in the paper. The following images are the testing result and training history. image image image

    opened by williamlus 1
  • Open set inference without training?

    Open set inference without training?

    Thanks for the great work. It seems that the released model is trained on the ADE dataset. If we want to test on other text descriptions (classes), we must re-train the model? Is there any other way to do the openest inference without training? For example, can I directly utilize the pre-trained CLIP model to calculate the pixel-wise dot product? Do you have such kind of code support?

    opened by Colin97 6
Owner
Yongming Rao
Yongming Rao
An original implementation of "Noisy Channel Language Model Prompting for Few-Shot Text Classification"

Channel LM Prompting (and beyond) This includes an original implementation of Sewon Min, Mike Lewis, Hannaneh Hajishirzi, Luke Zettlemoyer. "Noisy Cha

Sewon Min 92 Jan 7, 2023
TransPrompt - Towards an Automatic Transferable Prompting Framework for Few-shot Text Classification

TransPrompt This code is implement for our EMNLP 2021's paper 《TransPrompt:Towards an Automatic Transferable Prompting Framework for Few-shot Text Cla

WangJianing 23 Dec 21, 2022
Code for 'Self-Guided and Cross-Guided Learning for Few-shot segmentation. (CVPR' 2021)'

SCL Introduction Code for 'Self-Guided and Cross-Guided Learning for Few-shot segmentation. (CVPR' 2021)' We evaluated our approach using two baseline

null 34 Oct 8, 2022
Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

LLA: Loss-aware Label Assignment for Dense Pedestrian Detection This project provides an implementation for "LLA: Loss-aware Label Assignment for Dens

null 35 Dec 6, 2022
Dense Prediction Transformers

Vision Transformers for Dense Prediction This repository contains code and models for our paper: Vision Transformers for Dense Prediction René Ranftl,

Intel ISL (Intel Intelligent Systems Lab) 1.3k Dec 28, 2022
Implementation of "A MLP-like Architecture for Dense Prediction"

A MLP-like Architecture for Dense Prediction (arXiv) Updates (22/07/2021) Initial release. Model Zoo We provide CycleMLP models pretrained on ImageNet

Shoufa Chen 244 Dec 27, 2022
Dense Prediction Transformers

Vision Transformers for Dense Prediction This repository contains code and models for our paper: Vision Transformers for Dense Prediction René Ranftl,

Intelligent Systems Lab Org 1.3k Jan 2, 2023
This is an official implementation of the High-Resolution Transformer for Dense Prediction.

High-Resolution Transformer for Dense Prediction Introduction This is the official implementation of High-Resolution Transformer (HRT). We present a H

HRNet 403 Dec 13, 2022
[ICCV 2021] FaPN: Feature-aligned Pyramid Network for Dense Image Prediction

FaPN: Feature-aligned Pyramid Network for Dense Image Prediction [arXiv] [Project Page] @inproceedings{ huang2021fapn, title={{FaPN}: Feature-alig

Shihua Huang 23 Jul 22, 2022
MPViT:Multi-Path Vision Transformer for Dense Prediction

MPViT : Multi-Path Vision Transformer for Dense Prediction This repository inlcu

Youngwan Lee 272 Dec 20, 2022
Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CVPR 2021)

Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CAC) Xin Lai*, Zhuotao Tian*, Li Jiang, Shu Liu, Hengshuang Zhao, Li

Jia Research Lab 137 Dec 14, 2022
Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CVPR 2021)

Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CAC) Xin Lai*, Zhuotao Tian*, Li Jiang, Shu Liu, Hengshuang Zhao, Li

DV Lab 137 Dec 14, 2022
IAUnet: Global Context-Aware Feature Learning for Person Re-Identification

IAUnet This repository contains the code for the paper: IAUnet: Global Context-Aware Feature Learning for Person Re-Identification Ruibing Hou, Bingpe

null 30 Jul 14, 2022
Official PyTorch implementation of UACANet: Uncertainty Aware Context Attention for Polyp Segmentation

UACANet: Uncertainty Aware Context Attention for Polyp Segmentation Official pytorch implementation of UACANet: Uncertainty Aware Context Attention fo

Taehun Kim 85 Dec 14, 2022
CAPRI: Context-Aware Interpretable Point-of-Interest Recommendation Framework

CAPRI: Context-Aware Interpretable Point-of-Interest Recommendation Framework This repository contains a framework for Recommender Systems (RecSys), a

RecSys Lab 8 Jul 3, 2022
This is the source code for: Context-aware Entity Typing in Knowledge Graphs.

This is the source code for: Context-aware Entity Typing in Knowledge Graphs.

null 9 Sep 1, 2022
Unofficial implementation of Point-Unet: A Context-Aware Point-Based Neural Network for Volumetric Segmentation

Point-Unet This is an unofficial implementation of the MICCAI 2021 paper Point-Unet: A Context-Aware Point-Based Neural Network for Volumetric Segment

Namt0d 9 Dec 7, 2022
Fast and Context-Aware Framework for Space-Time Video Super-Resolution (VCIP 2021)

Fast and Context-Aware Framework for Space-Time Video Super-Resolution Preparation Dependencies PyTorch 1.2.0 CUDA 10.0 DCNv2 cd model/DCNv2 bash make

Xueheng Zhang 1 Mar 29, 2022
Official Implementation of HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation

HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentation by Lukas Hoyer, Dengxin Dai, and Luc Van Gool [Arxiv] [Paper] Overview Unsup

Lukas Hoyer 149 Dec 28, 2022