MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

Related tags

Deep Learning mdetr
Overview

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

WebsiteColabPaper

This repository contains code and links to pre-trained models for MDETR (Modulated DETR) for pre-training on data having aligned text and images with box annotations, as well as fine-tuning on tasks requiring fine grained understanding of image and text.

We show big gains on the phrase grounding task (Flickr30k), Referring Expression Comprehension (RefCOCO, RefCOCO+ and RefCOCOg) as well as Referring Expression Segmentation (PhraseCut, CLEVR Ref+). We also achieve competitive performance on visual question answering (GQA, CLEVR).

MDETR

TL;DR. We depart from the fixed frozen object detector approach of several popular vision + language pre-trained models and achieve true end-to-end multi-modal understanding by training our detector in the loop. In addition, we only detect objects that are relevant to the given text query, where the class labels for the objects are just the relevant words in the text query. This allows us to expand our vocabulary to anything found in free form text, making it possible to detect and reason over novel combination of object classes and attributes.

For details, please see the paper: MDETR - Modulated Detection for End-to-End Multi-Modal Understanding by Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve and Nicolas Carion.

Aishwarya Kamath and Nicolas Carion made equal contributions to this codebase.

Usage

The requirements file has all the dependencies that are needed by MDETR.

We provide instructions how to install dependencies via conda. First, clone the repository locally:

git clone https://github.com/ashkamath/mdetr.git

Make a new conda env and activate it:

conda create -n mdetr_env python=3.8
conda activate mdetr_env

Install the the packages in the requirements.txt:

pip install -r requirements.txt

Multinode training

Distributed training is available via Slurm and submitit:

pip install submitit

Pre-training

The links to data, steps for data preparation and script for running finetuning can be found in Pretraining Instructions We also provide the pre-trained model weights for MDETR trained on our combined aligned dataset of 1.3 million images paired with text.

The models are summarized in the following table. Note that the performance reported is "raw", without any fine-tuning. For each dataset, we report the class-agnostic box AP@50, which measures how well the model finds the boxes mentioned in the text. All performances are reported on the respective validation sets of each dataset.

Backbone GQA Flickr Refcoco Url
Size
AP AP R@1 AP Refcoco R@1 Refcoco+ R@1 Refcocog R@1
1 R101 58.9 75.6 82.5 60.3 72.1 58.0 55.7 model 3GB
2 ENB3 59.5 76.6 82.9 57.6 70.2 56.7 53.8 model 2.4GB
3 ENB5 59.9 76.4 83.7 61.8 73.4 58.8 57.1 model 2.7GB

Downstream tasks

Phrase grounding on Flickr30k

Instructions for data preparation and script to run evaluation can be found at Flickr30k Instructions

AnyBox protocol

Backbone Pre-training Image Data Val R@1 Val R@5 Val R@10 Test R@1 Test R@5 Test R@10 url size
Resnet-101 COCO+VG+Flickr 82.5 92.9 94.9 83.4 93.5 95.3 model 3GB
EfficientNet-B3 COCO+VG+Flickr 82.9 93.2 95.2 84.0 93.8 95.6 model 2.4GB
EfficientNet-B5 COCO+VG+Flickr 83.6 93.4 95.1 84.3 93.9 95.8 model 2.7GB

MergedBox protocol

Backbone Pre-training Image Data Val R@1 Val R@5 Val R@10 Test R@1 Test R@5 Test R@10 url size
Resnet-101 COCO+VG+Flickr 82.3 91.8 93.7 83.8 92.7 94.4 model 3GB

Referring expression comprehension on RefCOCO, RefCOCO+, RefCOCOg

Instructions for data preparation and script to run finetuning and evaluation can be found at Referring Expression Instructions

RefCOCO

Backbone Pre-training Image Data Val TestA TestB url size
Resnet-101 COCO+VG+Flickr 86.75 89.58 81.41 model 3GB
EfficientNet-B3 COCO+VG+Flickr 87.51 90.40 82.67 model 2.4GB

RefCOCO+

Backbone Pre-training Image Data Val TestA TestB url size
Resnet-101 COCO+VG+Flickr 79.52 84.09 70.62 model 3GB
EfficientNet-B3 COCO+VG+Flickr 81.13 85.52 72.96 model 2.4GB

RefCOCOg

Backbone Pre-training Image Data Val Test url size
Resnet-101 COCO+VG+Flickr 81.64 80.89 model 3GB
EfficientNet-B3 COCO+VG+Flickr 83.35 83.31 model 2.4GB

Referring expression segmentation on PhraseCut

Instructions for data preparation and script to run finetuning and evaluation can be found at PhraseCut Instructions

Backbone M-IoU Precision @0.5 Precision @0.7 Precision @0.9 url size
Resnet-101 53.1 56.1 38.9 11.9 model 1.5GB
EfficientNet-B3 53.7 57.5 39.9 11.9 model 1.2GB

Visual question answering on GQA

Instructions for data preparation and scripts to run finetuning and evaluation can be found at GQA Instructions

Backbone Test-dev Test-std url size
Resnet-101 62.48 61.99 model 3GB
EfficientNet-B5 62.95 62.45 model 2.7GB

Long-tailed few-shot object detection

Instructions for data preparation and scripts to run finetuning and evaluation can be found at LVIS Instructions

Data AP AP 50 AP r APc AP f url size
1% 16.7 25.8 11.2 14.6 19.5 model 3GB
10% 24.2 38.0 20.9 24.9 24.3 model 3GB
100% 22.5 35.2 7.4 22.7 25.0 model 3GB

Synthetic datasets

Instructions to reproduce our results on CLEVR-based datasets are available at CLEVR instructions

Overall Accuracy Count Exist
Compare Number Query Attribute Compare Attribute Url Size
99.7 99.3 99.9 99.4 99.9 99.9 model 446MB

License

MDETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Citation

If you find this repository useful please give it a star and cite as follows! :) :

    @article{kamath2021mdetr,
      title={MDETR--Modulated Detection for End-to-End Multi-Modal Understanding},
      author={Kamath, Aishwarya and Singh, Mannat and LeCun, Yann and Misra, Ishan and Synnaeve, Gabriel and Carion, Nicolas},
      journal={arXiv preprint arXiv:2104.12763},
      year={2021}
    }
Comments
  • cannot do distributed training

    cannot do distributed training

    Hello,

    Thanks for open sourcing!

    I try to run distributed training for pretraining. Without distributed training, it works fine.

    I get the below error. I tried with pytorch versions 1.7.0, 1.7.1 and 1.8.0 They get below error. Version 1.9 gets ImportError: cannot import name '_new_empty_tensor' from 'torchvision.ops' **(/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torchvision/ops/__init__.py)** ``

    I tried changing this line to losses.backward(retain_graph=True), it did not fix. Let me know if you have any suggestions on how to address this issue.

    Traceback (most recent call last):
      File "main.py", line 643, in <module>
        main(args)
      File "main.py", line 546, in main
        train_stats = train_one_epoch(
      File "/work/vcirik/mdetr/engine.py", line 100, in train_one_epoch
        losses.backward()
      File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
        torch.autograd.backward(self, gradient, retain_graph, create_graph)
      File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
        Variable._execution_engine.run_backward(
    RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [1, 10]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operati\
    on that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
    
    opened by volkancirik 22
  • Pre-trained model gives nan prediction in Colab

    Pre-trained model gives nan prediction in Colab

    Thank you for making this great work public! I'm trying to run the colab notebook but found no predictions on the test image. Then I print out the predictions of the model and found all are nan. Can you please check this?

    image

    opened by Steve-Tod 6
  • pretrain performance

    pretrain performance

    Hi,

    Thanks for your great work.

    When I tried to reproduce the pretrained performance, I found the results were mismatch with the paper, especially for Refcoco.

    Any help would be much appreciated.

      | GQA AP | Flickr AP | Flickr R@1 | Refcoco AP | Refcoco R@1 | Refcoco+ R@1 | Refcocog R@1 -- | -- | -- | -- | -- | -- | -- | -- Res101 | 58.9 | 75.6 | 82.5 | 60.3 | 72.1 | 58.0 | 55.7 reprodude | 58.6 | 75.7 | 82.9 | 56.5 | 70.2 | 55.3 | 54.2

    opened by ShoufaChen 6
  • OOM error when evaluate detection on lvis minival

    OOM error when evaluate detection on lvis minival

    Hello, I am facing an out of memory problem when testing with the eval_lvis.py file.
    My hardware setup is 8*1080 Ti GPU with pytorch 1.5. I have managed to successfully run the training code with batch size as 1, but when I try to test the detection on lvis performance, there is an out of memory error as following: line 76 at util/dist.py.
    Can you help me with this problem? Thank you in advance.

    opened by Flaick 6
  • How to run fine-tuning on VQA2 dataset?

    How to run fine-tuning on VQA2 dataset?

    Experiments with VQA v.2 dataset are described in Appendix E of the article. But it's not clear from main.py and run_with_submitit.py files how to run the fine-tuning (I've tried to write the same command that is used for fine-tuning on CLEVR). I've also found vqa_coco_format.py but it seems like preparation of the data, not fine-tuning itself. Also, I've encountered using build_dataset function in main.py and I don't see VQA v2 in the function :( Could you please explain how to do so?

    UPD 1 (09.27.21): I've downloaded COCO and VQA v2 datasets and ran

    python scripts/fine-tuning/vqa_coco_format.py --data_path VQA_v2_dataset/ --img_path COCO_dataset/images/ --coco_path COCO_dataset/
    

    And the processing has finished correctly. Now I'm thinking how to write VQA v2 dataset script...

    UPD 2 (10.03.21) It seems I managed to implement all the necessary classes and fix the code. I'm currently doing an experiment eval -> train on vqa2 -> eval. As soon as it successfully finishes I'll push the code into my fork of the repo.

    UPD 3 (10.03.21) Yeah, it works! Here is the link: https://github.com/TopCoder2K/mdetr. I haven't written any documentation because I'm not sure that fine-tuning on VQA is useful to anybody.)) If you have any question, please ask here :)

    opened by TopCoder2K 5
  • The evaluation of referring expression

    The evaluation of referring expression

    Hi,

    I followed the instruction of evaluation for referring expression on COCO dataset train2014. But when I passed the args for test "!python run_with_submitit.py --dataset_config configs/refcoco.json --batch_size 4 --resume https://zenodo.org/record/4721981/files/refcoco_resnet101_checkpoint.pth --ngpus 1 --nodes 1 --ema --test --test_type testA", I didn't get any result of precision or recall, only: "Start training Training time 0:00:00 submitit INFO (2021-07-17 11:30:17,492) - Job completed successfully"

    I also downloaded the coco dataset of val2014 and test2014 but I am not sure if I need to use that because it gave me error when I pass these dataset.

    Thanks a lot in advance!

    Best,

    opened by Jiang15 5
  • RuntimeError: No shared folder available

    RuntimeError: No shared folder available

    When I run python run_with_submitit.py --dataset_config configs/refcoco.json --batch_size 4 --load https://zenodo.org/record/4721981/files/pretrained_resnet101_checkpoint.pth?download=1 --ngpus 1 --nodes 2 --ema --text_encoder_lr 1e-5 --lr 5e-5, the following error occurred: Traceback (most recent call last): File "run_with_submitit.py", line 171, in main() File "run_with_submitit.py", line 130, in main args.job_dir = get_shared_folder(args) / "%j" File "run_with_submitit.py", line 41, in get_shared_folder raise RuntimeError("No shared folder available") RuntimeError: No shared folder available How to deal with it?

    opened by qumengxue 5
  • Loss increases during pretraining

    Loss increases during pretraining

    Hi @alcinos, @ashkamath, @nguyeho7,

    I hope you are doing good.

    I was trying to pretrain MDETR using the provided instructions. What I noticed is that loss started increasing during the 20th epoch. It kept decreasing to around 39 till the 19th epoch and jumped to around 77 after the 20th epoch. What could be the reason for this? Note that I am using the EfficientNetB5 backbone. The log.txt is attached.

    Thanks

    log.txt

    opened by mmaaz60 4
  • Missing finetune_phrasecut_test.json

    Missing finetune_phrasecut_test.json

    when testing on phrasecut dataset, met a bug:

    FileNotFoundErrorFileNotFoundError: [Errno 2] No such file or directory: '/data16t/data/referring-segmentation/Pre-processed-annotations/finetune_phrasecut_test.json':
    

    There is no such file in provided mdetr_annotations.tar.gz. Really hope you can open this file. Thank you very much.

    opened by colorblank 4
  • Missing pretrained checkpoint for ResNet101

    Missing pretrained checkpoint for ResNet101

    Hi, In mdetr/models/backbone.py, class GroupNormBackbone loads a pretrained checkpoint:

    name_map = {
                "resnet50-gn": ("resnet50", "/checkpoint/szagoruyko/imagenet/22014122/checkpoint.pth"),
                "resnet101-gn": ("resnet101", "/checkpoint/szagoruyko/imagenet/22080524/checkpoint.pth"),
    }
    

    It seems like the path is on your disk, where do I download these .pth files?

    opened by sean-zhuh 3
  • Doubts regarding pretraining

    Doubts regarding pretraining

    As I understand from the paper, you pretrained the whole mDETR model on image-text pairs. Did you try pertaining only the encoder model just like the Visual-BERT model?

    opened by IISCAditayTripathi 3
  • How to generate

    How to generate "tokens_negative" and "tokens_positive" when we convert our own dataset into mdetr annotations?

    Hi, thanks for the open-source code and annotations.

    I am confused about how to generate "tokens_negative" and "tokens_positive" in the annotation. For example,

    in 'images': {'file_name': 'COCO_train2014_000000580957.jpg', 'height': 428, 'width': 640, 'id': 120624, 'original_id': 580957, 'caption': 'bowl behind the others can only see part', 'dataset_name': 'refcoco', 'tokens_negative': [[0, 4], [5, 11], [23, 26], [27, 31], [32, 35], [36, 40]]}

    I couldn't understand the meaning of "[[0, 4], [5, 11], [23, 26], [27, 31], [32, 35], [36, 40]]".

    in 'annotations': {'area': 17770.195949999998, 'iscrowd': 0, 'image_id': 120624, 'category_id': 51, 'id': 120624, 'bbox': [468.3, 0.91, 171.7, 116.12], 'original_id': 1537681, 'tokens_positive': [[36, 40]]}

    I couldn't understand the meaning of "[[36, 40]]".

    I will be very grateful if you could help me to understand!

    opened by QiuHeqian 0
  • ValueError: char_to_token() is not available when using Python based tokenizers

    ValueError: char_to_token() is not available when using Python based tokenizers

    error log:


    Namespace(aux_loss=True, backbone='resnet101', batch_size=4, bbox_loss_coef=5, ce_loss_coef=1, clevr_ann_path='', clevr_img_path='', clip_max_norm=0.1, coco_path='/data_SSD1/lhxiao/transvg/ln_data/other/images/mscoco/images/', combine_datasets=['refexp'], combine_datasets_val=['refexp'], contrastive_align_loss=True, contrastive_align_loss_coef=1, contrastive_loss=False, contrastive_loss_coef=0.1, contrastive_loss_hdim=64, dataset_config='configs/refcoco.json', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_backend='nccl', dist_url='env://', distributed=True, do_qa=False, dropout=0.1, ema=True, ema_decay=0.9998, enc_layers=6, eos_coef=0.1, epoch_chunks=-1, epochs=5, eval=False, eval_skip=1, fraction_warmup_steps=0.01, freeze_text_encoder=False, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, load='/data_SSD1/lhxiao/mdetr/checkpoint/pretrain/pretrained_resnet101_checkpoint.pth', lr=5e-05, lr_backbone=1e-05, lr_drop=3, mask_loss_coef=1, mask_model='none', masks=False, modulated_lvis_ann_path='', nheads=8, no_detection=False, num_queries=100, num_workers=2, optimizer='adam', output_dir='/data_SSD1/lhxiao/mdetr/output/v01', pass_pos_and_query=True, phrasecut_ann_path='', phrasecut_orig_ann_path='', position_embedding='sine', pre_norm=False, predict_final=False, qa_loss_coef=1, rank=0, refexp_ann_path='/data_SSD1/lhxiao/mdetr/mdetr_annotations/OpenSource/', refexp_dataset_name='refcoco', remove_difficult=False, resume='', run_name='', schedule='linear_with_warmup', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, set_loss='hungarian', split_qa_heads=False, start_epoch=0, temperature_NCE=0.07, test=False, test_type='test', text_encoder_lr=1e-05, text_encoder_type='roberta-base', vg_ann_path='', vg_img_path='', weight_decay=0.0001, world_size=2)

    Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight']

    • This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
    • This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight']
    • This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
    • This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). number of params: 185160324 loading annotations into memory... Done (t=1.61s) creating index... index created! loading annotations into memory... Done (t=0.09s) creating index... index created! loading from /data_SSD1/lhxiao/mdetr/checkpoint/pretrain/pretrained_resnet101_checkpoint.pth Start training Starting epoch 0 /home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/position_encoding.py:41: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats) /home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/position_encoding.py:41: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats) Traceback (most recent call last): File "main.py", line 591, in main(args) File "main.py", line 494, in main train_stats = train_one_epoch( File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/engine.py", line 73, in train_one_epoch loss_dict.update(criterion(outputs, targets, positive_map)) File "/home/mmc_xiaolinhui/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/mdetr.py", line 679, in forward losses.update(self.get_loss(loss, outputs, targets, positive_map, indices, num_boxes)) File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/mdetr.py", line 655, in get_loss return loss_map[loss](outputs, targets, positive_map, indices, num_boxes, **kwargs) File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/mdetr.py", line 518, in loss_contrastive_align beg_pos = tokenized.char_to_token(i, beg) File "/home/mmc_xiaolinhui/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 547, in char_to_token raise ValueError("char_to_token() is not available when using Python based tokenizers") ValueError: char_to_token() is not available when using Python based tokenizers Traceback (most recent call last): File "main.py", line 591, in main(args) **File "main.py", line 494, in main train_stats = train_one_epoch( File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/engine.py", line 73, in train_one_epoch loss_dict.update(criterion(outputs, targets, positive_map)) File "/home/mmc_xiaolinhui/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/mdetr.py", line 679, in forward losses.update(self.get_loss(loss, outputs, targets, positive_map, indices, num_boxes)) File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/mdetr.py", line 655, in get_loss return loss_map[loss](outputs, targets, positive_map, indices, num_boxes, kwargs) File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/mdetr.py", line 518, in loss_contrastive_align beg_pos = tokenized.char_to_token(i, beg) File "/home/mmc_xiaolinhui/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 547, in char_to_token raise ValueError("char_to_token() is not available when using Python based tokenizers") ValueError: char_to_token() is not available when using Python based tokenizers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3789675 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3789672) of binary: /home/mmc_xiaolinhui/anaconda3/envs/mdetr_env/bin/python

    env :

    Name Version Build Channel _libgcc_mutex 0.1 conda_forge https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge _openmp_mutex 4.5 2_gnu https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge bzip2 1.0.8 h7f98852_4 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge ca-certificates 2022.12.7 ha878542_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge certifi 2022.12.7 pypi_0 pypi charset-normalizer 2.1.1 pypi_0 pypi click 8.1.3 pypi_0 pypi cloudpickle 2.2.0 pypi_0 pypi coloredlogs 15.0.1 pypi_0 pypi contourpy 1.0.6 pypi_0 pypi cycler 0.11.0 pypi_0 pypi cython 0.29.32 pypi_0 pypi filelock 3.8.2 pypi_0 pypi flatbuffers 22.12.6 pypi_0 pypi fonttools 4.38.0 pypi_0 pypi huggingface-hub 0.0.8 pypi_0 pypi humanfriendly 10.0 pypi_0 pypi idna 3.4 pypi_0 pypi joblib 1.2.0 pypi_0 pypi kiwisolver 1.4.4 pypi_0 pypi ld_impl_linux-64 2.39 hcc3a1bd_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libffi 3.4.2 h7f98852_5 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libgcc-ng 12.2.0 h65d4601_19 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libgomp 12.2.0 h65d4601_19 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libnsl 2.0.0 h7f98852_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libsqlite 3.40.0 h753d276_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libuuid 2.32.1 h7f98852_1000 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libzlib 1.2.13 h166bdaf_4 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge matplotlib 3.6.2 pypi_0 pypi mpmath 1.2.1 pypi_0 pypi ncurses 6.3 h27087fc_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge numpy 1.23.5 pypi_0 pypi onnx 1.13.0 pypi_0 pypi onnxruntime 1.13.1 pypi_0 pypi openssl 3.0.7 h0b41bf4_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge packaging 22.0 pypi_0 pypi panopticapi 0.1 pypi_0 pypi pillow 9.3.0 pypi_0 pypi pip 22.3.1 pyhd8ed1ab_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge prettytable 3.5.0 pypi_0 pypi protobuf 3.20.3 pypi_0 pypi pycocotools 2.0 pypi_0 pypi pyparsing 3.0.9 pypi_0 pypi python 3.8.15 h4a9ceb5_0_cpython https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge python-dateutil 2.8.2 pypi_0 pypi pyyaml 6.0 pypi_0 pypi readline 8.1.2 h0f457ee_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge regex 2022.10.31 pypi_0 pypi requests 2.28.1 pypi_0 pypi sacremoses 0.0.53 pypi_0 pypi scipy 1.9.3 pypi_0 pypi setuptools 65.5.1 pyhd8ed1ab_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge six 1.16.0 pypi_0 pypi submitit 1.4.5 pypi_0 pypi sympy 1.11.1 pypi_0 pypi timm 0.6.12 pypi_0 pypi tk 8.6.12 h27826a3_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge tokenizers 0.10.3 pypi_0 pypi torch 1.11.0+cu113 pypi_0 pypi torchaudio 0.11.0+cu113 pypi_0 pypi torchvision 0.12.0+cu113 pypi_0 pypi tqdm 4.64.1 pypi_0 pypi transformers 4.6.0 pypi_0 pypi typing-extensions 4.4.0 pypi_0 pypi urllib3 1.26.13 pypi_0 pypi wcwidth 0.2.5 pypi_0 pypi wheel 0.38.4 pyhd8ed1ab_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge xmltodict 0.13.0 pypi_0 pypi xz 5.2.6 h166bdaf_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge

    opened by linhuixiao 1
  • Bbox assertion error when using ENB models (eval & pretrain as well): assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

    Bbox assertion error when using ENB models (eval & pretrain as well): assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

    Hi, I am using your requirements file, the same libraries, but I am receiving this bbox assertion error, only when using ENB models (ENB3 & 5), and everything is fine when using ResNet backbone.

    It seems that the bbox predictions are all Nan. I've found this error in DETR but no clear solution to it (I tried different lr, different batch sizes).

    In eval it appears right away, but in the pretraining mode, it is very random, at different iterations.

    Epoch: [0] [ 940/78534] eta: 8:58:29 lr: 0.000100 lr_backbone: 0.000010 lr_text_encoder: 0.000001 loss: 84.7997 (98.7713) loss_bbox: 1.9616 (3.2480) loss_bbox_0: 2.0860 (3.2589) loss_bbox_1: 2.0562 (3.2474) loss_bbox_2: 1.9565 (3.2655) loss_bbox_3: 2.0650 (3.2771) loss_bbox_4: 1.9537 (3.2593) loss_ce: 10.5923 (11.1107) loss_ce_0: 10.5793 (11.1826) loss_ce_1: 10.5315 (11.0932) loss_ce_2: 10.4743 (11.1262) loss_ce_3: 10.5987 (11.1445) loss_ce_4: 10.5877 (11.0967) loss_giou: 1.7100 (2.0738) loss_giou_0: 1.8241 (2.0809) loss_giou_1: 1.7486 (2.0892) loss_giou_2: 1.7185 (2.0759) loss_giou_3: 1.8257 (2.0803) loss_giou_4: 1.6419 (2.0610) cardinality_error_unscaled: 4.8750 (5.8658) cardinality_error_0_unscaled: 4.8750 (7.5942) cardinality_error_1_unscaled: 4.8750 (6.0007) cardinality_error_2_unscaled: 4.8750 (6.0588) cardinality_error_3_unscaled: 4.8750 (5.8966) cardinality_error_4_unscaled: 4.8750 (5.8688) loss_bbox_unscaled: 0.3923 (0.6496) loss_bbox_0_unscaled: 0.4172 (0.6518) loss_bbox_1_unscaled: 0.4112 (0.6495) loss_bbox_2_unscaled: 0.3913 (0.6531) loss_bbox_3_unscaled: 0.4130 (0.6554) loss_bbox_4_unscaled: 0.3907 (0.6519) loss_ce_unscaled: 10.5923 (11.1107) loss_ce_0_unscaled: 10.5793 (11.1826) loss_ce_1_unscaled: 10.5315 (11.0932) loss_ce_2_unscaled: 10.4743 (11.1262) loss_ce_3_unscaled: 10.5987 (11.1445) loss_ce_4_unscaled: 10.5877 (11.0967) loss_giou_unscaled: 0.8550 (1.0369) loss_giou_0_unscaled: 0.9121 (1.0405) loss_giou_1_unscaled: 0.8743 (1.0446) loss_giou_2_unscaled: 0.8593 (1.0379) loss_giou_3_unscaled: 0.9129 (1.0401) loss_giou_4_unscaled: 0.8209 (1.0305) time: 0.4399 data: 0.0077 max mem: 10505 Traceback (most recent call last): File “main.py”, line 646, in main(args) File “main.py”, line 549, in main train_stats = train_one_epoch( File “/home/ubuntu/efs/users/oignat/internship/mdetr/engine.py”, line 72, in train_one_epoch loss_dict.update(criterion(outputs, targets, positive_map)) File “/home/ubuntu/better_glip/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/home/ubuntu/efs/users/oignat/internship/mdetr/models/mdetr.py”, line 666, in forward indices = self.matcher(outputs_without_aux, targets, positive_map) File “/home/ubuntu/better_glip/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/home/ubuntu/better_glip/lib/python3.8/site-packages/torch/autograd/grad_mode.py”, line 27, in decorate_context return func(*args, **kwargs) File “/home/ubuntu/efs/users/oignat/internship/mdetr/models/matcher.py”, line 75, in forward cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox)) File “/home/ubuntu/efs/users/oignat/internship/mdetr/util/box_ops.py”, line 51, in generalized_box_iou assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

    opened by OanaIgnat 1
  • How to define

    How to define "Negative Tokens" ?

    Hello,

    I am wondering how to define "negative tokens" in the dataset_dict['images'][0]['tokens_negative'].

    Is there any algorithm or rule to make "negative tokens" ?

    I couldn't find how to pattern it... !

    Thanks,

    Best Regards,

    Eric.

    opened by jeantirole 0
  • Pre-training question

    Pre-training question

    Hi, I am super interested in your work. Wanted to change the CNN backbone of the model and train the whole model from scratch. Can you please point me to the location where to make the required changes and what else to keep in mind.

    opened by hrituraj007 0
Owner
Aishwarya Kamath
Find me @ ashkamath.github.io
Aishwarya Kamath
[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

Large Scale Image Completion via Co-Modulated Generative Adversarial Networks, ICLR 2021 (Spotlight) Demo | Paper [NEW!] Time to play with our interac

Shengyu Zhao 373 Jan 2, 2023
Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

Han Xu 129 Dec 11, 2022
🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

Rendi Chevi 156 Jan 9, 2023
DUE: End-to-End Document Understanding Benchmark

This is the repository that provide tools to download data, reproduce the baseline results and evaluation. What can you achieve with this guide Based

null 21 Dec 29, 2022
Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

Visual Intelligence & Learning Lab, Swiss Federal Institute of Technology (EPFL) 385 Jan 6, 2023
Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

DSA^2 F: Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral) This repo is the official imp

如今我已剑指天涯 46 Dec 21, 2022
End-to-end face detection, cropping, norm estimation, and landmark detection in a single onnx model

onnx-facial-lmk-detector End-to-end face detection, cropping, norm estimation, and landmark detection in a single onnx model, model.onnx. Demo You can

atksh 42 Dec 30, 2022
A Multi-modal Model Chinese Spell Checker Released on ACL2021.

ReaLiSe ReaLiSe is a multi-modal Chinese spell checking model. This the office code for the paper Read, Listen, and See: Leveraging Multimodal Informa

DaDa 106 Dec 29, 2022
We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Multi-Modal Self-Supervision using GDT and StiCa This is an official pytorch implementation of papers: Multi-modal Self-Supervision from Generalized D

Facebook Research 42 Dec 9, 2022
A pytorch-based deep learning framework for multi-modal 2D/3D medical image segmentation

A 3D multi-modal medical image segmentation library in PyTorch We strongly believe in open and reproducible deep learning research. Our goal is to imp

Adaloglou Nikolas 1.2k Dec 27, 2022
AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition [ArXiv] [Project Page] This repository is the official implementation of AdaMML:

International Business Machines 43 Dec 26, 2022
Self-supervised Multi-modal Hybrid Fusion Network for Brain Tumor Segmentation

JBHI-Pytorch This repository contains a reference implementation of the algorithms described in our paper "Self-supervised Multi-modal Hybrid Fusion N

FeiyiFANG 5 Dec 13, 2021
Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2

CoaDTI Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2 Abstract Environment The test was conducted i

Layne_Huang 7 Nov 14, 2022
Code of paper Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification.

Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification We provide the codes for repr

null 12 Dec 12, 2022
Multi-Modal Machine Learning toolkit based on PyTorch.

简体中文 | English TorchMM 简介 多模态学习工具包 TorchMM 旨在于提供模态联合学习和跨模态学习算法模型库,为处理图片文本等多模态数据提供高效的解决方案,助力多模态学习应用落地。 近期更新 2022.1.5 发布 TorchMM 初始版本 v1.0 特性 丰富的任务场景:工具

njustkmg 1 Jan 5, 2022
Multi-Modal Machine Learning toolkit based on PaddlePaddle.

简体中文 | English PaddleMM 简介 飞桨多模态学习工具包 PaddleMM 旨在于提供模态联合学习和跨模态学习算法模型库,为处理图片文本等多模态数据提供高效的解决方案,助力多模态学习应用落地。 近期更新 2022.1.5 发布 PaddleMM 初始版本 v1.0 特性 丰富的任务

njustkmg 520 Dec 28, 2022
Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features | paper | Official PyTorch implementation for Mul

null 48 Dec 28, 2022
4st place solution for the PBVS 2022 Multi-modal Aerial View Object Classification Challenge - Track 1 (SAR) at PBVS2022

A Two-Stage Shake-Shake Network for Long-tailed Recognition of SAR Aerial View Objects 4st place solution for the PBVS 2022 Multi-modal Aerial View Ob

LinpengPan 5 Nov 9, 2022
[LREC] MMChat: Multi-Modal Chat Dataset on Social Media

MMChat This repo contains the code and data for the LREC2022 paper MMChat: Multi-Modal Chat Dataset on Social Media. Dataset MMChat is a large-scale d

Silver 47 Jan 3, 2023