Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".

Meta Research

Last update: Jan 4, 2023

Related tags

Deep Learning Detic

Overview

Detecting Twenty-thousand Classes using Image-level Supervision

Detic: A Detector with image classes that can use image-level labels to easily train detectors.

Detecting Twenty-thousand Classes using Image-level Supervision,
Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra,
arXiv technical report (arXiv 2201.02605)

Features

Detects any class given class names (using CLIP).
We train the detector on ImageNet-21K dataset with 21K classes.
Cross-dataset generalization to OpenImages and Objects365 without finetuning.
State-of-the-art results on Open-vocabulary LVIS and Open-vocabulary COCO.
Works for DETR-style detectors.

Installation

See installation instructions.

Demo

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the web demo:

Run our demo using Colab (no GPU needed):

We use the default detectron2 demo interface. For example, to run our 21K model on a messy desk image (image credit David Fouhey) with the lvis vocabulary, run

mkdir models
wget https://dl.fbaipublicfiles.com/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth -O models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth
wget https://web.eecs.umich.edu/~fouhey/fun/desk/desk.jpg
python demo.py --config-file configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml --input desk.jpg --output out.jpg --vocabulary lvis --opts MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

If setup correctly, the output should look like:

The same model can run with other vocabularies (COCO, OpenImages, or Objects365), or a custom vocabulary. For example:

python demo.py --config-file configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml --input desk.jpg --output out2.jpg --vocabulary custom --custom_vocabulary headphone,webcam,paper,coffe --confidence-threshold 0.3 --opts MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

The output should look like:

Note that headphone, paper and coffe (typo intended) are not LVIS classes. Despite the misspelled class name, our detector can produce a reasonable detection for coffe.

Benchmark evaluation and training

Please first prepare datasets, then check our MODEL ZOO to reproduce results in our paper. We highlight key results below:

Open-vocabulary LVIS

mask mAP mask mAP_novel

Box-Supervised 30.2 16.4

Detic 32.4 24.9

	mask mAP	mask mAP_novel
Box-Supervised	30.2	16.4
Detic	32.4	24.9

Standard LVIS

	Detector/ Backbone	mask mAP	mask mAP_rare
Box-Supervised	CenterNet2-ResNet50	31.5	25.6
Detic	CenterNet2-ResNet50	33.2	29.7
Box-Supervised	CenterNet2-SwinB	40.7	35.9
Detic	CenterNet2-SwinB	41.7	41.7

	Detector/ Backbone	box mAP	box mAP_rare
Box-Supervised	DeformableDETR-ResNet50	31.7	21.4
Detic	DeformableDETR-ResNet50	32.5	26.2

Cross-dataset generalization

	Backbone	Objects365 box mAP	OpenImages box mAP50
Box-Supervised	SwinB	19.1	46.2
Detic	SwinB	21.4	55.2

License

The majority of Detic is licensed under the Apache 2.0 license, however portions of the project are available under separate license terms: SWIN-Transformer, CLIP, and TensorFlow Object Detection API are licensed under the MIT license; UniDet is licensed under the Apache 2.0 license; and the LVIS API is licensed under a custom license (https://github.com/lvis-dataset/lvis-api/blob/master/LICENSE)” If you later add other third party code, please keep this license info updated, and please let us know if that component is licensed under something other than CC-BY-NC, MIT, or CC0

Ethical Considerations

Detic's wide range of detection capabilities may introduce similar challenges to many other visual recognition and open-set recognition methods. As the user can define arbitrary detection classes, class design and semantics may impact the model output.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@inproceedings{zhou2021detecting,
  title={Detecting Twenty-thousand Classes using Image-level Supervision},
  author={Zhou, Xingyi and Girdhar, Rohit and Joulin, Armand and Kr{\"a}henb{\"u}hl, Philipp and Misra, Ishan},
  booktitle={arXiv preprint arXiv:2201.02605},
  year={2021}
}

Comments

To confirm some technical questions
Hi, Thanks for your sharing this nice work! I would like to confirm the following question:

The image encoder in CLIP is not used here, instead by a zeroshot classifier?

The text embeddings are precomputed by CLIP text encoder saved as metadata/lvis_v1_clip_a+cname.npy for LVIS val set, so text encoder can be regarded as fixed in training?

In zeroshot classifier forward process, what x and classifier represent for?
opened by Kyfafyd 10
Data Preparation for COCO Issue

Hello, I am trying to prepare the data for zero-shot COCO.

When I run get_cc_tags.py, the code breaks because 'categories' is not in cc_data, here: https://github.com/facebookresearch/Detic/blob/main/tools/get_cc_tags.py#L135

If I try to use --cat_path, the code again breaks because 'categories' is not in cc_data, here: https://github.com/facebookresearch/Detic/blob/main/tools/get_cc_tags.py#L128

Could you help me out with this?

Thanks so much!

opened by greeneggsandyaml 10
Problem in running Lazy config for Detic_ViLD training

Hi @xingyizhou

Thank you for sharing the great work.

I am trying to reproduce the Detic results on LVIS. Firstly Box Supervised ViLD baseline model has been successfully trained.

Now using its weights, I am trying to run Detic configuration using Detic_ViLD_200e.py config file, but after running the lazy_train_net.py, it prints initial logs and get stuck when iteration 0 begins to start. I am attaching screenshots below, after the last warning, the training does not get started, while the utilization from all GPUs instantly reaches 100% and get stuck on 100% Even after quitting the python run, the memory usage still get stuck to 100%, I have to manually kill the python process in-order to reset the GPU utilization.

Can you please help me in running the Detic training for LVIS. I am using the same environment, workstation etc on which the Box Supervised ViLD baseline is trained.

Waiting for your kind response. Thank you.

opened by muzairkhattak 6
How to get inference faster at the expense of something (accuracy, class number, etc..)?

I am currently trying to use this software for realtime domain like robotics, where faster inference is often preferred to high accuracy. https://github.com/HiroIshida/detic_ros

So it would be great if someone share tips or config settings to make get inference faster at the sacrifice of something (accuracy, class number, etc..).

A simple attempt to do that end may changing the input image size. However, the inference time varying input image size is like follows, which means input image size is not the primal factor. 300x300 0.19 100x100 0.17 50x50 0.165 (Using Geforce GTX-1080ti and the model is used as same as the demo described in README)

Finally, thank you very much for releasing such a great software.

opened by HiroIshida 6

torchscipt export error

Hi,

Thank you for the great work. I am exporting torch script model but during inference I run into shapes error for model corresponding to the config : "Detic_LI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml"

export script example :

# Prepare inputs 

im = cv2.imread('input.jpg')
image = aug.get_transform(im).apply_image(im)
image = torch.as_tensor(image.astype("float32").transpose(2, 0, 1))
inputs = [{"image": image}]

## Export to torchscipt
if isinstance(model, GeneralizedRCNN):
    def inference(model, inputs):
        # use do_postprocess=False so it returns ROI mask
        inst = model.inference(inputs, do_postprocess=False)[0]
        return [{"instances": inst}]
else:
    inference = None

traceable_model = TracingAdapter(model, inputs, inference)
traceable_model.eval()
_# 1. option : pass image array as a tuple_
ts_model = torch.jit.trace(traceable_model, (image,))
_# or 2nd option_ 
# ts_model = torch.jit.trace(traceable_model, image, strict=False)
d = {"shape": image.shape}
extra_files = {'exported_config.txt': json.dumps(d)}

with open("models/detic_model_2.ts", "wb") as f:
    torch.jit.save(ts_model, f)

opened by anshudaur 5

How are the bbox labels predicted during testing?

I am curious about how the bbox labels are predicted since the data are split into base and novel. Taking LVIS as example, when calculating the AP_novel, are the labels obtained by taking the maximum of the 1203 scores of all categories or only the maximum of the 337 novel categories?

opened by wusize 5
Fix links
Hi. Fixed links:

to 21K model and LVIS API licence in README.md

to UniDet repository in datasets/REAMDE.md

to Detic_DeformDETR_R50_2x config in docs/MODEL_ZOO.md

cla signed
opened by amrzv 4
Can I get extract just the prediction labels and the confidence level?

HI, thanks for making Detic available; there's some fantastic functionality in here.

My issue is quite mundane: How can I get the predictor function to output the prediction classes as text strings with an associated confidence level? I'd like to iterate across a large number of images and record the results in a pandas dataframe or similar. Having them overlaid on an image is less useful for this.

Thanks!

opened by texturejc 4
Questions about CLIP embeddings
Hi, I have some questions about using CLIP embeddings as classification weights

I think there is no other code which load CLIP embeddings as classification weights except load model. Aren’t the provided weights trained? (Are CLIP embeddings loaded only?)

I think it may work if FC layer of a classification network is used as classification weights of detector. Why are you choose to use CLIP embedding instead of pre-trained classification network?

Thank you.
opened by dk-hong 3
How can I get 1280 dimensional box_features?

Hello.

I was able to get box_features from a previous question.

The dimension of these box_features was 1024 as shown in the config file in the picture below.

I'm trying to use this box_features in another pre-trained model, but the dimension doesn't match. This is because the dimension of other pre-trained models is 1280.

I was wondering if I need to re-train the model from scratch to get box_features of 1280 dimension or how to get it.

Or, I was wondering if it is a good way to extend the dimension with Autoencoder or PCA method to convert from 1024 dimension to 1280 dimension.

Thank you for your time to read this question!

opened by yoojin9649 3
Screenshot source and capability to run on CPU only

I have added a few features to the demo application - one is to enable a screen-shot video source so that it is possible to run on screen as video input. I also added an additional argument for specifying CPU only as I do not have a GPU on my MAC. Is this of interest - should I prepare a PR with this?

Screenshots like this can be run in the demo application:

opened by joakimeriksson 3
DLWL reproduction

Hi, in your repo, it seems there are many implementations for other variants of losses but can not find the implementation for DLWL which uses ADMM algorithm for label assignment. It would be grateful if you provide that one since the author of DLWL does not release the code. Thanks

opened by jihwanp 0
Unable to convert openimages annotation to coco format
Hello Xingyi Zhou, Thank you for your brilliant work.

As said in datasets/README.md, I follow the instructions from UniDet.

We followed the instructions in UniDet to convert the metadata for OpenImages.

But there are some problems.

run $python tools/convert_datasets/convert_oid.py -p datasets/oid/ --subsets train NameError: name 'image_label_sourcefile' is not defined

run python tools/convert_datasets/convert_oid.py -p datasets/oid/ --subsets val --expand_label FileNotFoundError: [Errno 2] No such file or directory: 'datasets/oid/annotations/challenge-2019-validation-detection-human-imagelabels_expanded.csv'

For pre-processed annotation files url, there is no any annotation files. even the old url in commit history.

Our pre-processed annotation files can be directly downloaded here.

Anyone helps?

similiar issue #https://github.com/xingyizhou/UniDet/issues/2
opened by wikiwen 0
What is the best confidence threshold for inference?

Hello @facebookresearch, Thank you for your fantastic work. I am a student, and I wonder which is the best confidence score for the inference phase. (both for COCO and Imagenet datasets).

Thank you for your help, this value will benefit further research.

Best, TIn

opened by ngthanhtin 0
Upgrade to Cog version 0.1

The new version of Cog improves the Python API, along with several other changes. Particularly pydantic is now used for Predictor and the previous version will be deprecated.

This PR upgrades the Replicate demo and API to Cog version >= 0.1. I have already pushed this to Replicate, so you don't need to do anything for the demo to keep working :) https://replicate.com/facebookresearch/detic
cla signed

opened by chenxwh 0
Possible data leak in lvis_v1_train_cat_info.json

Hi, Xinyi!

I loaded the file "lvis_v1_train_cat_info.json", it seems to contain image_count for rare classes. It may lead to data leak in the open-vocabulary setting when using the fed loss.

opened by wusize 2

Owner

Meta Research

GitHub

The code release of paper 'Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization' NIPS 2020.

Domain Generalization for Medical Imaging Classification with Linear Dependency Regularization The code release of paper 'Domain Generalization for Me

56 Dec 28, 2022

Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".

Related tags

Overview

Detecting Twenty-thousand Classes using Image-level Supervision

Features

Installation

Demo

Benchmark evaluation and training

License

Ethical Considerations

Citation

Comments

Owner

Meta Research

The code release of paper 'Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization' NIPS 2020.

Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021)

This is the official code release for the paper Shape and Material Capture at Home

Code release for "COTR: Correspondence Transformer for Matching Across Images"

Code release for paper: The Boombox: Visual Reconstruction from Acoustic Vibrations

We will release the code of "ConTNet: Why not use convolution and transformer at the same time?" in this repo

Code release to accompany paper "Geometry-Aware Gradient Algorithms for Neural Architecture Search."

This is the dataset and code release of the OpenRooms Dataset.

Code release of paper "Deep Multi-View Stereo gone wild"

Code release for DS-NeRF (Depth-supervised Neural Radiance Fields)

Code release for BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images

Code Release for Learning to Adapt to Evolving Domains

Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

Code release for our paper, "SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo"

Code release for The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification (TIP 2020)

Code release for NeurIPS 2020 paper "Co-Tuning for Transfer Learning"

Code release for NeuS

Code Release for ICCV 2021 (oral), "AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds"

Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".