Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".

Related tags

Deep Learning Detic
Overview

Detecting Twenty-thousand Classes using Image-level Supervision

Detic: A Detector with image classes that can use image-level labels to easily train detectors.

Detecting Twenty-thousand Classes using Image-level Supervision,
Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra,
arXiv technical report (arXiv 2201.02605)

Features

  • Detects any class given class names (using CLIP).

  • We train the detector on ImageNet-21K dataset with 21K classes.

  • Cross-dataset generalization to OpenImages and Objects365 without finetuning.

  • State-of-the-art results on Open-vocabulary LVIS and Open-vocabulary COCO.

  • Works for DETR-style detectors.

Installation

See installation instructions.

Demo

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the web demo: Hugging Face Spaces

Run our demo using Colab (no GPU needed): Open In Colab

We use the default detectron2 demo interface. For example, to run our 21K model on a messy desk image (image credit David Fouhey) with the lvis vocabulary, run

mkdir models
wget https://dl.fbaipublicfiles.com/detic/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth -O models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth
wget https://web.eecs.umich.edu/~fouhey/fun/desk/desk.jpg
python demo.py --config-file configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml --input desk.jpg --output out.jpg --vocabulary lvis --opts MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

If setup correctly, the output should look like:

The same model can run with other vocabularies (COCO, OpenImages, or Objects365), or a custom vocabulary. For example:

python demo.py --config-file configs/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml --input desk.jpg --output out2.jpg --vocabulary custom --custom_vocabulary headphone,webcam,paper,coffe --confidence-threshold 0.3 --opts MODEL.WEIGHTS models/Detic_LCOCOI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.pth

The output should look like:

Note that headphone, paper and coffe (typo intended) are not LVIS classes. Despite the misspelled class name, our detector can produce a reasonable detection for coffe.

Benchmark evaluation and training

Please first prepare datasets, then check our MODEL ZOO to reproduce results in our paper. We highlight key results below:

  • Open-vocabulary LVIS

    mask mAP mask mAP_novel
    Box-Supervised 30.2 16.4
    Detic 32.4 24.9
  • Standard LVIS

    Detector/ Backbone mask mAP mask mAP_rare
    Box-Supervised CenterNet2-ResNet50 31.5 25.6
    Detic CenterNet2-ResNet50 33.2 29.7
    Box-Supervised CenterNet2-SwinB 40.7 35.9
    Detic CenterNet2-SwinB 41.7 41.7
    Detector/ Backbone box mAP box mAP_rare
    Box-Supervised DeformableDETR-ResNet50 31.7 21.4
    Detic DeformableDETR-ResNet50 32.5 26.2
  • Cross-dataset generalization

    Backbone Objects365 box mAP OpenImages box mAP50
    Box-Supervised SwinB 19.1 46.2
    Detic SwinB 21.4 55.2

License

The majority of Detic is licensed under the Apache 2.0 license, however portions of the project are available under separate license terms: SWIN-Transformer, CLIP, and TensorFlow Object Detection API are licensed under the MIT license; UniDet is licensed under the Apache 2.0 license; and the LVIS API is licensed under a custom license (https://github.com/lvis-dataset/lvis-api/blob/master/LICENSE)” If you later add other third party code, please keep this license info updated, and please let us know if that component is licensed under something other than CC-BY-NC, MIT, or CC0

Ethical Considerations

Detic's wide range of detection capabilities may introduce similar challenges to many other visual recognition and open-set recognition methods. As the user can define arbitrary detection classes, class design and semantics may impact the model output.

Citation

If you find this project useful for your research, please use the following BibTeX entry.

@inproceedings{zhou2021detecting,
  title={Detecting Twenty-thousand Classes using Image-level Supervision},
  author={Zhou, Xingyi and Girdhar, Rohit and Joulin, Armand and Kr{\"a}henb{\"u}hl, Philipp and Misra, Ishan},
  booktitle={arXiv preprint arXiv:2201.02605},
  year={2021}
}
Comments
  • To confirm some technical questions

    To confirm some technical questions

    Hi, Thanks for your sharing this nice work! I would like to confirm the following question:

    • The image encoder in CLIP is not used here, instead by a zeroshot classifier?
    • The text embeddings are precomputed by CLIP text encoder saved as metadata/lvis_v1_clip_a+cname.npy for LVIS val set, so text encoder can be regarded as fixed in training?
    • In zeroshot classifier forward process, what x and classifier represent for?
    opened by Kyfafyd 10
  • Data Preparation for COCO Issue

    Data Preparation for COCO Issue

    Hello, I am trying to prepare the data for zero-shot COCO.

    When I run get_cc_tags.py, the code breaks because 'categories' is not in cc_data, here: https://github.com/facebookresearch/Detic/blob/main/tools/get_cc_tags.py#L135

    If I try to use --cat_path, the code again breaks because 'categories' is not in cc_data, here: https://github.com/facebookresearch/Detic/blob/main/tools/get_cc_tags.py#L128

    Could you help me out with this?

    Thanks so much!

    opened by greeneggsandyaml 10
  • Problem in running Lazy config for Detic_ViLD training

    Problem in running Lazy config for Detic_ViLD training

    Hi @xingyizhou

    Thank you for sharing the great work.

    I am trying to reproduce the Detic results on LVIS. Firstly Box Supervised ViLD baseline model has been successfully trained.

    Now using its weights, I am trying to run Detic configuration using Detic_ViLD_200e.py config file, but after running the lazy_train_net.py, it prints initial logs and get stuck when iteration 0 begins to start. I am attaching screenshots below, after the last warning, the training does not get started, while the utilization from all GPUs instantly reaches 100% and get stuck on 100% image image Even after quitting the python run, the memory usage still get stuck to 100%, I have to manually kill the python process in-order to reset the GPU utilization.

    Can you please help me in running the Detic training for LVIS. I am using the same environment, workstation etc on which the Box Supervised ViLD baseline is trained.

    Waiting for your kind response. Thank you.

    opened by muzairkhattak 6
  • How to get inference faster at the expense of something (accuracy, class number, etc..)?

    How to get inference faster at the expense of something (accuracy, class number, etc..)?

    I am currently trying to use this software for realtime domain like robotics, where faster inference is often preferred to high accuracy. https://github.com/HiroIshida/detic_ros

    So it would be great if someone share tips or config settings to make get inference faster at the sacrifice of something (accuracy, class number, etc..).

    A simple attempt to do that end may changing the input image size. However, the inference time varying input image size is like follows, which means input image size is not the primal factor. 300x300 0.19 100x100 0.17 50x50 0.165 (Using Geforce GTX-1080ti and the model is used as same as the demo described in README)

    Finally, thank you very much for releasing such a great software.

    opened by HiroIshida 6
  • torchscipt export error

    torchscipt export error

    Hi,

    Thank you for the great work. I am exporting torch script model but during inference I run into shapes error for model corresponding to the config : "Detic_LI21k_CLIP_SwinB_896b32_4x_ft4x_max-size.yaml"

    export script example :

    # Prepare inputs 
    
    im = cv2.imread('input.jpg')
    image = aug.get_transform(im).apply_image(im)
    image = torch.as_tensor(image.astype("float32").transpose(2, 0, 1))
    inputs = [{"image": image}]
    
    ## Export to torchscipt
    if isinstance(model, GeneralizedRCNN):
        def inference(model, inputs):
            # use do_postprocess=False so it returns ROI mask
            inst = model.inference(inputs, do_postprocess=False)[0]
            return [{"instances": inst}]
    else:
        inference = None
    
    traceable_model = TracingAdapter(model, inputs, inference)
    traceable_model.eval()
    _# 1. option : pass image array as a tuple_
    ts_model = torch.jit.trace(traceable_model, (image,))
    _# or 2nd option_ 
    # ts_model = torch.jit.trace(traceable_model, image, strict=False)
    d = {"shape": image.shape}
    extra_files = {'exported_config.txt': json.dumps(d)}
    
    with open("models/detic_model_2.ts", "wb") as f:
        torch.jit.save(ts_model, f)
    
    opened by anshudaur 5
  • How are the bbox labels predicted during testing?

    How are the bbox labels predicted during testing?

    I am curious about how the bbox labels are predicted since the data are split into base and novel. Taking LVIS as example, when calculating the AP_novel, are the labels obtained by taking the maximum of the 1203 scores of all categories or only the maximum of the 337 novel categories?

    opened by wusize 5
  • Fix links

    Fix links

    Hi. Fixed links:

    • to 21K model and LVIS API licence in README.md
    • to UniDet repository in datasets/REAMDE.md
    • to Detic_DeformDETR_R50_2x config in docs/MODEL_ZOO.md
    cla signed 
    opened by amrzv 4
  • Can I get extract just the prediction labels and the confidence level?

    Can I get extract just the prediction labels and the confidence level?

    HI, thanks for making Detic available; there's some fantastic functionality in here.

    My issue is quite mundane: How can I get the predictor function to output the prediction classes as text strings with an associated confidence level? I'd like to iterate across a large number of images and record the results in a pandas dataframe or similar. Having them overlaid on an image is less useful for this.

    Thanks!

    opened by texturejc 4
  • Questions about CLIP embeddings

    Questions about CLIP embeddings

    Hi, I have some questions about using CLIP embeddings as classification weights

    1. I think there is no other code which load CLIP embeddings as classification weights except load model. Aren’t the provided weights trained? (Are CLIP embeddings loaded only?)

    2. I think it may work if FC layer of a classification network is used as classification weights of detector. Why are you choose to use CLIP embedding instead of pre-trained classification network?

    Thank you.

    opened by dk-hong 3
  • How can I get 1280 dimensional box_features?

    How can I get 1280 dimensional box_features?

    Hello.

    I was able to get box_features from a previous question.

    The dimension of these box_features was 1024 as shown in the config file in the picture below.

    image

    I'm trying to use this box_features in another pre-trained model, but the dimension doesn't match. This is because the dimension of other pre-trained models is 1280.

    I was wondering if I need to re-train the model from scratch to get box_features of 1280 dimension or how to get it.

    Or, I was wondering if it is a good way to extend the dimension with Autoencoder or PCA method to convert from 1024 dimension to 1280 dimension.

    Thank you for your time to read this question!

    opened by yoojin9649 3
  • Screenshot source and capability to run on CPU only

    Screenshot source and capability to run on CPU only

    I have added a few features to the demo application - one is to enable a screen-shot video source so that it is possible to run on screen as video input. I also added an additional argument for specifying CPU only as I do not have a GPU on my MAC. Is this of interest - should I prepare a PR with this?

    Screenshots like this can be run in the demo application: image

    opened by joakimeriksson 3
  • DLWL reproduction

    DLWL reproduction

    Hi, in your repo, it seems there are many implementations for other variants of losses but can not find the implementation for DLWL which uses ADMM algorithm for label assignment. It would be grateful if you provide that one since the author of DLWL does not release the code. Thanks

    opened by jihwanp 0
  • Unable to convert openimages annotation to coco format

    Unable to convert openimages annotation to coco format

    Hello Xingyi Zhou, Thank you for your brilliant work.

    As said in datasets/README.md, I follow the instructions from UniDet.

    We followed the instructions in UniDet to convert the metadata for OpenImages.

    But there are some problems.

    1. run $python tools/convert_datasets/convert_oid.py -p datasets/oid/ --subsets train NameError: name 'image_label_sourcefile' is not defined
    2. run python tools/convert_datasets/convert_oid.py -p datasets/oid/ --subsets val --expand_label FileNotFoundError: [Errno 2] No such file or directory: 'datasets/oid/annotations/challenge-2019-validation-detection-human-imagelabels_expanded.csv'
    3. For pre-processed annotation files url, there is no any annotation files. even the old url in commit history.

    Our pre-processed annotation files can be directly downloaded here.

    Anyone helps?

    similiar issue #https://github.com/xingyizhou/UniDet/issues/2

    opened by wikiwen 0
  • What is the best confidence threshold for inference?

    What is the best confidence threshold for inference?

    Hello @facebookresearch, Thank you for your fantastic work. I am a student, and I wonder which is the best confidence score for the inference phase. (both for COCO and Imagenet datasets).

    Thank you for your help, this value will benefit further research.

    Best, TIn

    opened by ngthanhtin 0
  • Upgrade to Cog version 0.1

    Upgrade to Cog version 0.1

    The new version of Cog improves the Python API, along with several other changes. Particularly pydantic is now used for Predictor and the previous version will be deprecated.

    This PR upgrades the Replicate demo and API to Cog version >= 0.1. I have already pushed this to Replicate, so you don't need to do anything for the demo to keep working :) https://replicate.com/facebookresearch/detic

    cla signed 
    opened by chenxwh 0
  • Possible data leak in lvis_v1_train_cat_info.json

    Possible data leak in lvis_v1_train_cat_info.json

    Hi, Xinyi!

    I loaded the file "lvis_v1_train_cat_info.json", it seems to contain image_count for rare classes. It may lead to data leak in the open-vocabulary setting when using the fed loss.

    opened by wusize 2
Owner
Meta Research
Meta Research
The code release of paper 'Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization' NIPS 2020.

Domain Generalization for Medical Imaging Classification with Linear Dependency Regularization The code release of paper 'Domain Generalization for Me

Yufei Wang 56 Dec 28, 2022
Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021)

Transferable Semantic Augmentation for Domain Adaptation Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021) Paper

null 66 Dec 16, 2022
This is the official code release for the paper Shape and Material Capture at Home

This is the official code release for the paper Shape and Material Capture at Home. The code enables you to reconstruct a 3D mesh and Cook-Torrance BRDF from one or more images captured with a flashlight or camera with flash.

null 89 Dec 10, 2022
Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

UBC Computer Vision Group 360 Jan 6, 2023
Code release for paper: The Boombox: Visual Reconstruction from Acoustic Vibrations

The Boombox: Visual Reconstruction from Acoustic Vibrations Boyuan Chen, Mia Chiquier, Hod Lipson, Carl Vondrick Columbia University Project Website |

Boyuan Chen 12 Nov 30, 2022
We will release the code of "ConTNet: Why not use convolution and transformer at the same time?" in this repo

ConTNet Introduction ConTNet (Convlution-Tranformer Network) is proposed mainly in response to the following two issues: (1) ConvNets lack a large rec

null 93 Nov 8, 2022
Code release to accompany paper "Geometry-Aware Gradient Algorithms for Neural Architecture Search."

Geometry-Aware Gradient Algorithms for Neural Architecture Search This repository contains the code required to run the experiments for the DARTS sear

null 18 May 27, 2022
This is the dataset and code release of the OpenRooms Dataset.

This is the dataset and code release of the OpenRooms Dataset.

Visual Intelligence Lab of UCSD 95 Jan 8, 2023
Code release of paper "Deep Multi-View Stereo gone wild"

Deep MVS gone wild Pytorch implementation of "Deep MVS gone wild" (Paper | website) This repository provides the code to reproduce the experiments of

François Darmon 53 Dec 24, 2022
Code release for DS-NeRF (Depth-supervised Neural Radiance Fields)

Depth-supervised NeRF: Fewer Views and Faster Training for Free Project | Paper | YouTube Pytorch implementation of our method for learning neural rad

null 524 Jan 8, 2023
Code release for BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images

BlockGAN Code release for BlockGAN: Learning 3D Object-aware Scene Representations from Unlabelled Images BlockGAN: Learning 3D Object-aware Scene Rep

null 41 May 18, 2022
Code Release for Learning to Adapt to Evolving Domains

EAML Code release for "Learning to Adapt to Evolving Domains" (NeurIPS 2020) Prerequisites PyTorch >= 0.4.0 (with suitable CUDA and CuDNN version) tor

null 23 Dec 7, 2022
Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

Self-Tuning for Data-Efficient Deep Learning This repository contains the implementation code for paper: Self-Tuning for Data-Efficient Deep Learning

THUML @ Tsinghua University 101 Dec 11, 2022
Code release for our paper, "SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo"

SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo Thomas Kollar, Michael Laskey, Kevin Stone, Brijen Thananjeyan

null 68 Dec 14, 2022
Code release for The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification (TIP 2020)

The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification Code release for The Devil is in the Channels: Mutual-Channel

PRIS-CV: Computer Vision Group 230 Dec 31, 2022
Code release for NeurIPS 2020 paper "Co-Tuning for Transfer Learning"

CoTuning Official implementation for NeurIPS 2020 paper Co-Tuning for Transfer Learning. [News] 2021/01/13 The COCO 70 dataset used in the paper is av

THUML @ Tsinghua University 35 Sep 23, 2022
Code release for NeuS

NeuS We present a novel neural surface reconstruction method, called NeuS, for reconstructing objects and scenes with high fidelity from 2D image inpu

Peng Wang 813 Jan 4, 2023
Code Release for ICCV 2021 (oral), "AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds"

AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds (ICCV 2021 oral) **Project Page | Arxiv ** Runsong Zhu¹, Yuan Liu², Zhen Dong¹, Te

null 40 Dec 30, 2022
Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".

PixelTransformer Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation". Project Page Installation Please insta

Shubham Tulsiani 24 Dec 17, 2022