MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

Aishwarya Kamath

Last update: Dec 28, 2022

Related tags

Deep Learning mdetr

Overview

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

This repository contains code and links to pre-trained models for MDETR (Modulated DETR) for pre-training on data having aligned text and images with box annotations, as well as fine-tuning on tasks requiring fine grained understanding of image and text.

We show big gains on the phrase grounding task (Flickr30k), Referring Expression Comprehension (RefCOCO, RefCOCO+ and RefCOCOg) as well as Referring Expression Segmentation (PhraseCut, CLEVR Ref+). We also achieve competitive performance on visual question answering (GQA, CLEVR).

TL;DR. We depart from the fixed frozen object detector approach of several popular vision + language pre-trained models and achieve true end-to-end multi-modal understanding by training our detector in the loop. In addition, we only detect objects that are relevant to the given text query, where the class labels for the objects are just the relevant words in the text query. This allows us to expand our vocabulary to anything found in free form text, making it possible to detect and reason over novel combination of object classes and attributes.

For details, please see the paper: MDETR - Modulated Detection for End-to-End Multi-Modal Understanding by Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve and Nicolas Carion.

Aishwarya Kamath and Nicolas Carion made equal contributions to this codebase.

Usage

The requirements file has all the dependencies that are needed by MDETR.

We provide instructions how to install dependencies via conda. First, clone the repository locally:

git clone https://github.com/ashkamath/mdetr.git

Make a new conda env and activate it:

conda create -n mdetr_env python=3.8
conda activate mdetr_env

Install the the packages in the requirements.txt:

pip install -r requirements.txt

Multinode training

Distributed training is available via Slurm and submitit:

pip install submitit

Pre-training

The links to data, steps for data preparation and script for running finetuning can be found in Pretraining Instructions We also provide the pre-trained model weights for MDETR trained on our combined aligned dataset of 1.3 million images paired with text.

The models are summarized in the following table. Note that the performance reported is "raw", without any fine-tuning. For each dataset, we report the class-agnostic box AP@50, which measures how well the model finds the boxes mentioned in the text. All performances are reported on the respective validation sets of each dataset.

	Backbone	GQA	Flickr		Refcoco				Url	Size
	Backbone	AP	AP	R@1	AP	Refcoco R@1	Refcoco+ R@1	Refcocog R@1	Url	Size
1	R101	58.9	75.6	82.5	60.3	72.1	58.0	55.7	model	3GB
2	ENB3	59.5	76.6	82.9	57.6	70.2	56.7	53.8	model	2.4GB
3	ENB5	59.9	76.4	83.7	61.8	73.4	58.8	57.1	model	2.7GB

Downstream tasks

Phrase grounding on Flickr30k

Instructions for data preparation and script to run evaluation can be found at Flickr30k Instructions

AnyBox protocol

Backbone	Pre-training Image Data	Val R@1	Val R@5	Val R@10	Test R@1	Test R@5	Test R@10	url	size
Resnet-101	COCO+VG+Flickr	82.5	92.9	94.9	83.4	93.5	95.3	model	3GB
EfficientNet-B3	COCO+VG+Flickr	82.9	93.2	95.2	84.0	93.8	95.6	model	2.4GB
EfficientNet-B5	COCO+VG+Flickr	83.6	93.4	95.1	84.3	93.9	95.8	model	2.7GB

MergedBox protocol

Backbone	Pre-training Image Data	Val R@1	Val R@5	Val R@10	Test R@1	Test R@5	Test R@10	url	size
Resnet-101	COCO+VG+Flickr	82.3	91.8	93.7	83.8	92.7	94.4	model	3GB

Referring expression comprehension on RefCOCO, RefCOCO+, RefCOCOg

Instructions for data preparation and script to run finetuning and evaluation can be found at Referring Expression Instructions

RefCOCO

Backbone	Pre-training Image Data	Val	TestA	TestB	url	size
Resnet-101	COCO+VG+Flickr	86.75	89.58	81.41	model	3GB
EfficientNet-B3	COCO+VG+Flickr	87.51	90.40	82.67	model	2.4GB

RefCOCO+

Backbone	Pre-training Image Data	Val	TestA	TestB	url	size
Resnet-101	COCO+VG+Flickr	79.52	84.09	70.62	model	3GB
EfficientNet-B3	COCO+VG+Flickr	81.13	85.52	72.96	model	2.4GB

RefCOCOg

Backbone	Pre-training Image Data	Val	Test	url	size
Resnet-101	COCO+VG+Flickr	81.64	80.89	model	3GB
EfficientNet-B3	COCO+VG+Flickr	83.35	83.31	model	2.4GB

Referring expression segmentation on PhraseCut

Instructions for data preparation and script to run finetuning and evaluation can be found at PhraseCut Instructions

Backbone	M-IoU	Precision @0.5	Precision @0.7	Precision @0.9	url	size
Resnet-101	53.1	56.1	38.9	11.9	model	1.5GB
EfficientNet-B3	53.7	57.5	39.9	11.9	model	1.2GB

Visual question answering on GQA

Instructions for data preparation and scripts to run finetuning and evaluation can be found at GQA Instructions

Backbone	Test-dev	Test-std	url	size
Resnet-101	62.48	61.99	model	3GB
EfficientNet-B5	62.95	62.45	model	2.7GB

Long-tailed few-shot object detection

Instructions for data preparation and scripts to run finetuning and evaluation can be found at LVIS Instructions

Data	AP	AP 50	AP r	APc	AP f	url	size
1%	16.7	25.8	11.2	14.6	19.5	model	3GB
10%	24.2	38.0	20.9	24.9	24.3	model	3GB
100%	22.5	35.2	7.4	22.7	25.0	model	3GB

Synthetic datasets

Instructions to reproduce our results on CLEVR-based datasets are available at CLEVR instructions

Overall Accuracy	Count	Exist	Compare Number	Query Attribute	Compare Attribute	Url	Size
99.7	99.3	99.9	99.4	99.9	99.9	model	446MB

License

MDETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Citation

If you find this repository useful please give it a star and cite as follows! :) :

    @article{kamath2021mdetr,
      title={MDETR--Modulated Detection for End-to-End Multi-Modal Understanding},
      author={Kamath, Aishwarya and Singh, Mannat and LeCun, Yann and Misra, Ishan and Synnaeve, Gabriel and Carion, Nicolas},
      journal={arXiv preprint arXiv:2104.12763},
      year={2021}
    }

Comments

cannot do distributed training

Hello,

Thanks for open sourcing!

I try to run distributed training for pretraining. Without distributed training, it works fine.

I get the below error. I tried with pytorch versions 1.7.0, 1.7.1 and 1.8.0 They get below error. Version 1.9 gets ImportError: cannot import name '_new_empty_tensor' from 'torchvision.ops' **(/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torchvision/ops/__init__.py)** ``

I tried changing this line to losses.backward(retain_graph=True), it did not fix. Let me know if you have any suggestions on how to address this issue.

Traceback (most recent call last):
  File "main.py", line 643, in <module>
    main(args)
  File "main.py", line 546, in main
    train_stats = train_one_epoch(
  File "/work/vcirik/mdetr/engine.py", line 100, in train_one_epoch
    losses.backward()
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/tensor.py", line 221, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/work/vcirik/anaconda3/envs/mdetr/lib/python3.8/site-packages/torch/autograd/__init__.py", line 130, in backward
    Variable._execution_engine.run_backward(
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.LongTensor [1, 10]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operati\
on that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

opened by volkancirik 22

Pre-trained model gives nan prediction in Colab

Thank you for making this great work public! I'm trying to run the colab notebook but found no predictions on the test image. Then I print out the predictions of the model and found all are nan. Can you please check this?

opened by Steve-Tod 6
pretrain performance

Hi,

Thanks for your great work.

When I tried to reproduce the pretrained performance, I found the results were mismatch with the paper, especially for Refcoco.

Any help would be much appreciated.

| GQA AP | Flickr AP | Flickr R@1 | Refcoco AP | Refcoco R@1 | Refcoco+ R@1 | Refcocog R@1 -- | -- | -- | -- | -- | -- | -- | -- Res101 | 58.9 | 75.6 | 82.5 | 60.3 | 72.1 | 58.0 | 55.7 reprodude | 58.6 | 75.7 | 82.9 | 56.5 | 70.2 | 55.3 | 54.2

opened by ShoufaChen 6
OOM error when evaluate detection on lvis minival

Hello, I am facing an out of memory problem when testing with the eval_lvis.py file.
My hardware setup is 8*1080 Ti GPU with pytorch 1.5. I have managed to successfully run the training code with batch size as 1, but when I try to test the detection on lvis performance, there is an out of memory error as following: line 76 at util/dist.py.
Can you help me with this problem? Thank you in advance.

opened by Flaick 6
How to run fine-tuning on VQA2 dataset?
Experiments with VQA v.2 dataset are described in Appendix E of the article. But it's not clear from main.py and run_with_submitit.py files how to run the fine-tuning (I've tried to write the same command that is used for fine-tuning on CLEVR). I've also found vqa_coco_format.py but it seems like preparation of the data, not fine-tuning itself. Also, I've encountered using build_dataset function in main.py and I don't see VQA v2 in the function :( Could you please explain how to do so?

UPD 1 (09.27.21): I've downloaded COCO and VQA v2 datasets and ran

python scripts/fine-tuning/vqa_coco_format.py --data_path VQA_v2_dataset/ --img_path COCO_dataset/images/ --coco_path COCO_dataset/

And the processing has finished correctly. Now I'm thinking how to write VQA v2 dataset script...

UPD 2 (10.03.21) It seems I managed to implement all the necessary classes and fix the code. I'm currently doing an experiment eval -> train on vqa2 -> eval. As soon as it successfully finishes I'll push the code into my fork of the repo.

UPD 3 (10.03.21) Yeah, it works! Here is the link: https://github.com/TopCoder2K/mdetr. I haven't written any documentation because I'm not sure that fine-tuning on VQA is useful to anybody.)) If you have any question, please ask here :)
opened by TopCoder2K 5
The evaluation of referring expression

Hi,

I followed the instruction of evaluation for referring expression on COCO dataset train2014. But when I passed the args for test "!python run_with_submitit.py --dataset_config configs/refcoco.json --batch_size 4 --resume https://zenodo.org/record/4721981/files/refcoco_resnet101_checkpoint.pth --ngpus 1 --nodes 1 --ema --test --test_type testA", I didn't get any result of precision or recall, only: "Start training Training time 0:00:00 submitit INFO (2021-07-17 11:30:17,492) - Job completed successfully"

I also downloaded the coco dataset of val2014 and test2014 but I am not sure if I need to use that because it gave me error when I pass these dataset.

Thanks a lot in advance!

Best,

opened by Jiang15 5
RuntimeError: No shared folder available

When I run python run_with_submitit.py --dataset_config configs/refcoco.json --batch_size 4 --load https://zenodo.org/record/4721981/files/pretrained_resnet101_checkpoint.pth?download=1 --ngpus 1 --nodes 2 --ema --text_encoder_lr 1e-5 --lr 5e-5, the following error occurred: Traceback (most recent call last): File "run_with_submitit.py", line 171, in main() File "run_with_submitit.py", line 130, in main args.job_dir = get_shared_folder(args) / "%j" File "run_with_submitit.py", line 41, in get_shared_folder raise RuntimeError("No shared folder available") RuntimeError: No shared folder available How to deal with it?

opened by qumengxue 5
Loss increases during pretraining

Hi @alcinos, @ashkamath, @nguyeho7,

I hope you are doing good.

I was trying to pretrain MDETR using the provided instructions. What I noticed is that loss started increasing during the 20th epoch. It kept decreasing to around 39 till the 19th epoch and jumped to around 77 after the 20th epoch. What could be the reason for this? Note that I am using the EfficientNetB5 backbone. The log.txt is attached.

Thanks

log.txt

opened by mmaaz60 4
Missing finetune_phrasecut_test.json
when testing on phrasecut dataset, met a bug:

FileNotFoundErrorFileNotFoundError: [Errno 2] No such file or directory: '/data16t/data/referring-segmentation/Pre-processed-annotations/finetune_phrasecut_test.json':

There is no such file in provided mdetr_annotations.tar.gz. Really hope you can open this file. Thank you very much.
opened by colorblank 4
Missing pretrained checkpoint for ResNet101
Hi, In mdetr/models/backbone.py, class GroupNormBackbone loads a pretrained checkpoint:

name_map = { "resnet50-gn": ("resnet50", "/checkpoint/szagoruyko/imagenet/22014122/checkpoint.pth"), "resnet101-gn": ("resnet101", "/checkpoint/szagoruyko/imagenet/22080524/checkpoint.pth"), }

It seems like the path is on your disk, where do I download these .pth files?
opened by sean-zhuh 3
Doubts regarding pretraining

As I understand from the paper, you pretrained the whole mDETR model on image-text pairs. Did you try pertaining only the encoder model just like the Visual-BERT model?

opened by IISCAditayTripathi 3
How to generate "tokens_negative" and "tokens_positive" when we convert our own dataset into mdetr annotations?

Hi, thanks for the open-source code and annotations.

I am confused about how to generate "tokens_negative" and "tokens_positive" in the annotation. For example,

in 'images': {'file_name': 'COCO_train2014_000000580957.jpg', 'height': 428, 'width': 640, 'id': 120624, 'original_id': 580957, 'caption': 'bowl behind the others can only see part', 'dataset_name': 'refcoco', 'tokens_negative': [[0, 4], [5, 11], [23, 26], [27, 31], [32, 35], [36, 40]]}

I couldn't understand the meaning of "[[0, 4], [5, 11], [23, 26], [27, 31], [32, 35], [36, 40]]".

in 'annotations': {'area': 17770.195949999998, 'iscrowd': 0, 'image_id': 120624, 'category_id': 51, 'id': 120624, 'bbox': [468.3, 0.91, 171.7, 116.12], 'original_id': 1537681, 'tokens_positive': [[36, 40]]}

I couldn't understand the meaning of "[[36, 40]]".

I will be very grateful if you could help me to understand!

opened by QiuHeqian 0
ValueError: char_to_token() is not available when using Python based tokenizers
error log：

Namespace(aux_loss=True, backbone='resnet101', batch_size=4, bbox_loss_coef=5, ce_loss_coef=1, clevr_ann_path='', clevr_img_path='', clip_max_norm=0.1, coco_path='/data_SSD1/lhxiao/transvg/ln_data/other/images/mscoco/images/', combine_datasets=['refexp'], combine_datasets_val=['refexp'], contrastive_align_loss=True, contrastive_align_loss_coef=1, contrastive_loss=False, contrastive_loss_coef=0.1, contrastive_loss_hdim=64, dataset_config='configs/refcoco.json', dec_layers=6, device='cuda', dice_loss_coef=1, dilation=False, dim_feedforward=2048, dist_backend='nccl', dist_url='env://', distributed=True, do_qa=False, dropout=0.1, ema=True, ema_decay=0.9998, enc_layers=6, eos_coef=0.1, epoch_chunks=-1, epochs=5, eval=False, eval_skip=1, fraction_warmup_steps=0.01, freeze_text_encoder=False, frozen_weights=None, giou_loss_coef=2, gpu=0, hidden_dim=256, load='/data_SSD1/lhxiao/mdetr/checkpoint/pretrain/pretrained_resnet101_checkpoint.pth', lr=5e-05, lr_backbone=1e-05, lr_drop=3, mask_loss_coef=1, mask_model='none', masks=False, modulated_lvis_ann_path='', nheads=8, no_detection=False, num_queries=100, num_workers=2, optimizer='adam', output_dir='/data_SSD1/lhxiao/mdetr/output/v01', pass_pos_and_query=True, phrasecut_ann_path='', phrasecut_orig_ann_path='', position_embedding='sine', pre_norm=False, predict_final=False, qa_loss_coef=1, rank=0, refexp_ann_path='/data_SSD1/lhxiao/mdetr/mdetr_annotations/OpenSource/', refexp_dataset_name='refcoco', remove_difficult=False, resume='', run_name='', schedule='linear_with_warmup', seed=42, set_cost_bbox=5, set_cost_class=1, set_cost_giou=2, set_loss='hungarian', split_qa_heads=False, start_epoch=0, temperature_NCE=0.07, test=False, test_type='test', text_encoder_lr=1e-05, text_encoder_type='roberta-base', vg_ann_path='', vg_img_path='', weight_decay=0.0001, world_size=2)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight']

This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of the model checkpoint at roberta-base were not used when initializing RobertaModel: ['lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight']

This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). number of params: 185160324 loading annotations into memory... Done (t=1.61s) creating index... index created! loading annotations into memory... Done (t=0.09s) creating index... index created! loading from /data_SSD1/lhxiao/mdetr/checkpoint/pretrain/pretrained_resnet101_checkpoint.pth Start training Starting epoch 0 /home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/position_encoding.py:41: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats) /home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/position_encoding.py:41: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). dim_t = self.temperature ** (2 * (dim_t // 2) / self.num_pos_feats) Traceback (most recent call last): File "main.py", line 591, in main(args) File "main.py", line 494, in main train_stats = train_one_epoch( File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/engine.py", line 73, in train_one_epoch loss_dict.update(criterion(outputs, targets, positive_map)) File "/home/mmc_xiaolinhui/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/mdetr.py", line 679, in forward losses.update(self.get_loss(loss, outputs, targets, positive_map, indices, num_boxes)) File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/mdetr.py", line 655, in get_loss return loss_map[loss](outputs, targets, positive_map, indices, num_boxes, **kwargs) File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/mdetr.py", line 518, in loss_contrastive_align beg_pos = tokenized.char_to_token(i, beg) File "/home/mmc_xiaolinhui/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 547, in char_to_token raise ValueError("char_to_token() is not available when using Python based tokenizers") ValueError: char_to_token() is not available when using Python based tokenizers Traceback (most recent call last): File "main.py", line 591, in main(args) **File "main.py", line 494, in main train_stats = train_one_epoch( File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/engine.py", line 73, in train_one_epoch loss_dict.update(criterion(outputs, targets, positive_map)) File "/home/mmc_xiaolinhui/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/mdetr.py", line 679, in forward losses.update(self.get_loss(loss, outputs, targets, positive_map, indices, num_boxes)) File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/mdetr.py", line 655, in get_loss return loss_map[loss](outputs, targets, positive_map, indices, num_boxes, kwargs) File "/home/mmc_xiaolinhui/mmc_226_exp_202206/mdetr/models/mdetr.py", line 518, in loss_contrastive_align beg_pos = tokenized.char_to_token(i, beg) File "/home/mmc_xiaolinhui/anaconda3/envs/mdetr_env/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 547, in char_to_token raise ValueError("char_to_token() is not available when using Python based tokenizers") ValueError: char_to_token() is not available when using Python based tokenizers WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3789675 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 3789672) of binary: /home/mmc_xiaolinhui/anaconda3/envs/mdetr_env/bin/python

env ：

Name Version Build Channel _libgcc_mutex 0.1 conda_forge https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge _openmp_mutex 4.5 2_gnu https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge bzip2 1.0.8 h7f98852_4 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge ca-certificates 2022.12.7 ha878542_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge certifi 2022.12.7 pypi_0 pypi charset-normalizer 2.1.1 pypi_0 pypi click 8.1.3 pypi_0 pypi cloudpickle 2.2.0 pypi_0 pypi coloredlogs 15.0.1 pypi_0 pypi contourpy 1.0.6 pypi_0 pypi cycler 0.11.0 pypi_0 pypi cython 0.29.32 pypi_0 pypi filelock 3.8.2 pypi_0 pypi flatbuffers 22.12.6 pypi_0 pypi fonttools 4.38.0 pypi_0 pypi huggingface-hub 0.0.8 pypi_0 pypi humanfriendly 10.0 pypi_0 pypi idna 3.4 pypi_0 pypi joblib 1.2.0 pypi_0 pypi kiwisolver 1.4.4 pypi_0 pypi ld_impl_linux-64 2.39 hcc3a1bd_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libffi 3.4.2 h7f98852_5 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libgcc-ng 12.2.0 h65d4601_19 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libgomp 12.2.0 h65d4601_19 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libnsl 2.0.0 h7f98852_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libsqlite 3.40.0 h753d276_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libuuid 2.32.1 h7f98852_1000 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge libzlib 1.2.13 h166bdaf_4 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge matplotlib 3.6.2 pypi_0 pypi mpmath 1.2.1 pypi_0 pypi ncurses 6.3 h27087fc_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge numpy 1.23.5 pypi_0 pypi onnx 1.13.0 pypi_0 pypi onnxruntime 1.13.1 pypi_0 pypi openssl 3.0.7 h0b41bf4_1 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge packaging 22.0 pypi_0 pypi panopticapi 0.1 pypi_0 pypi pillow 9.3.0 pypi_0 pypi pip 22.3.1 pyhd8ed1ab_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge prettytable 3.5.0 pypi_0 pypi protobuf 3.20.3 pypi_0 pypi pycocotools 2.0 pypi_0 pypi pyparsing 3.0.9 pypi_0 pypi python 3.8.15 h4a9ceb5_0_cpython https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge python-dateutil 2.8.2 pypi_0 pypi pyyaml 6.0 pypi_0 pypi readline 8.1.2 h0f457ee_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge regex 2022.10.31 pypi_0 pypi requests 2.28.1 pypi_0 pypi sacremoses 0.0.53 pypi_0 pypi scipy 1.9.3 pypi_0 pypi setuptools 65.5.1 pyhd8ed1ab_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge six 1.16.0 pypi_0 pypi submitit 1.4.5 pypi_0 pypi sympy 1.11.1 pypi_0 pypi timm 0.6.12 pypi_0 pypi tk 8.6.12 h27826a3_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge tokenizers 0.10.3 pypi_0 pypi torch 1.11.0+cu113 pypi_0 pypi torchaudio 0.11.0+cu113 pypi_0 pypi torchvision 0.12.0+cu113 pypi_0 pypi tqdm 4.64.1 pypi_0 pypi transformers 4.6.0 pypi_0 pypi typing-extensions 4.4.0 pypi_0 pypi urllib3 1.26.13 pypi_0 pypi wcwidth 0.2.5 pypi_0 pypi wheel 0.38.4 pyhd8ed1ab_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge xmltodict 0.13.0 pypi_0 pypi xz 5.2.6 h166bdaf_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge
opened by linhuixiao 1
Bbox assertion error when using ENB models (eval & pretrain as well): assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

Hi, I am using your requirements file, the same libraries, but I am receiving this bbox assertion error, only when using ENB models (ENB3 & 5), and everything is fine when using ResNet backbone.

It seems that the bbox predictions are all Nan. I've found this error in DETR but no clear solution to it (I tried different lr, different batch sizes).

In eval it appears right away, but in the pretraining mode, it is very random, at different iterations.

Epoch: [0] [ 940/78534] eta: 8:58:29 lr: 0.000100 lr_backbone: 0.000010 lr_text_encoder: 0.000001 loss: 84.7997 (98.7713) loss_bbox: 1.9616 (3.2480) loss_bbox_0: 2.0860 (3.2589) loss_bbox_1: 2.0562 (3.2474) loss_bbox_2: 1.9565 (3.2655) loss_bbox_3: 2.0650 (3.2771) loss_bbox_4: 1.9537 (3.2593) loss_ce: 10.5923 (11.1107) loss_ce_0: 10.5793 (11.1826) loss_ce_1: 10.5315 (11.0932) loss_ce_2: 10.4743 (11.1262) loss_ce_3: 10.5987 (11.1445) loss_ce_4: 10.5877 (11.0967) loss_giou: 1.7100 (2.0738) loss_giou_0: 1.8241 (2.0809) loss_giou_1: 1.7486 (2.0892) loss_giou_2: 1.7185 (2.0759) loss_giou_3: 1.8257 (2.0803) loss_giou_4: 1.6419 (2.0610) cardinality_error_unscaled: 4.8750 (5.8658) cardinality_error_0_unscaled: 4.8750 (7.5942) cardinality_error_1_unscaled: 4.8750 (6.0007) cardinality_error_2_unscaled: 4.8750 (6.0588) cardinality_error_3_unscaled: 4.8750 (5.8966) cardinality_error_4_unscaled: 4.8750 (5.8688) loss_bbox_unscaled: 0.3923 (0.6496) loss_bbox_0_unscaled: 0.4172 (0.6518) loss_bbox_1_unscaled: 0.4112 (0.6495) loss_bbox_2_unscaled: 0.3913 (0.6531) loss_bbox_3_unscaled: 0.4130 (0.6554) loss_bbox_4_unscaled: 0.3907 (0.6519) loss_ce_unscaled: 10.5923 (11.1107) loss_ce_0_unscaled: 10.5793 (11.1826) loss_ce_1_unscaled: 10.5315 (11.0932) loss_ce_2_unscaled: 10.4743 (11.1262) loss_ce_3_unscaled: 10.5987 (11.1445) loss_ce_4_unscaled: 10.5877 (11.0967) loss_giou_unscaled: 0.8550 (1.0369) loss_giou_0_unscaled: 0.9121 (1.0405) loss_giou_1_unscaled: 0.8743 (1.0446) loss_giou_2_unscaled: 0.8593 (1.0379) loss_giou_3_unscaled: 0.9129 (1.0401) loss_giou_4_unscaled: 0.8209 (1.0305) time: 0.4399 data: 0.0077 max mem: 10505 Traceback (most recent call last): File “main.py”, line 646, in main(args) File “main.py”, line 549, in main train_stats = train_one_epoch( File “/home/ubuntu/efs/users/oignat/internship/mdetr/engine.py”, line 72, in train_one_epoch loss_dict.update(criterion(outputs, targets, positive_map)) File “/home/ubuntu/better_glip/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/home/ubuntu/efs/users/oignat/internship/mdetr/models/mdetr.py”, line 666, in forward indices = self.matcher(outputs_without_aux, targets, positive_map) File “/home/ubuntu/better_glip/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1130, in _call_impl return forward_call(*input, **kwargs) File “/home/ubuntu/better_glip/lib/python3.8/site-packages/torch/autograd/grad_mode.py”, line 27, in decorate_context return func(*args, **kwargs) File “/home/ubuntu/efs/users/oignat/internship/mdetr/models/matcher.py”, line 75, in forward cost_giou = -generalized_box_iou(box_cxcywh_to_xyxy(out_bbox), box_cxcywh_to_xyxy(tgt_bbox)) File “/home/ubuntu/efs/users/oignat/internship/mdetr/util/box_ops.py”, line 51, in generalized_box_iou assert (boxes1[:, 2:] >= boxes1[:, :2]).all()

opened by OanaIgnat 1
How to define "Negative Tokens" ?

Hello,

I am wondering how to define "negative tokens" in the dataset_dict['images'][0]['tokens_negative'].

Is there any algorithm or rule to make "negative tokens" ?

I couldn't find how to pattern it... !

Thanks,

Best Regards,

Eric.

opened by jeantirole 0
Pre-training question

Hi, I am super interested in your work. Wanted to change the CNN backbone of the model and train the whole model from scratch. Can you please point me to the location where to make the required changes and what else to keep in mind.

opened by hrituraj007 0

Owner

Aishwarya Kamath

Find me @ ashkamath.github.io

GitHub

[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

Large Scale Image Completion via Co-Modulated Generative Adversarial Networks, ICLR 2021 (Spotlight) Demo | Paper [NEW!] Time to play with our interac

373 Jan 2, 2023

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

129 Dec 11, 2022

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

?? Nix-TTS An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation Rendi Chevi, Radityo Eko Prasojo, Alham Fikri Aji

156 Jan 9, 2023

DUE: End-to-End Document Understanding Benchmark

This is the repository that provide tools to download data, reproduce the baseline results and evaluation. What can you achieve with this guide Based

21 Dec 29, 2022

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

Visual Intelligence & Learning Lab, Swiss Federal Institute of Technology (EPFL)

385 Jan 6, 2023

Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

DSA^2 F: Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral) This repo is the official imp

46 Dec 21, 2022

End-to-end face detection, cropping, norm estimation, and landmark detection in a single onnx model

onnx-facial-lmk-detector End-to-end face detection, cropping, norm estimation, and landmark detection in a single onnx model, model.onnx. Demo You can

42 Dec 30, 2022

A Multi-modal Model Chinese Spell Checker Released on ACL2021.

ReaLiSe ReaLiSe is a multi-modal Chinese spell checking model. This the office code for the paper Read, Listen, and See: Leveraging Multimodal Informa

106 Dec 29, 2022

We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

Multi-Modal Self-Supervision using GDT and StiCa This is an official pytorch implementation of papers: Multi-modal Self-Supervision from Generalized D

42 Dec 9, 2022

A pytorch-based deep learning framework for multi-modal 2D/3D medical image segmentation

A 3D multi-modal medical image segmentation library in PyTorch We strongly believe in open and reproducible deep learning research. Our goal is to imp

1.2k Dec 27, 2022

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition [ArXiv] [Project Page] This repository is the official implementation of AdaMML:

43 Dec 26, 2022

Self-supervised Multi-modal Hybrid Fusion Network for Brain Tumor Segmentation

JBHI-Pytorch This repository contains a reference implementation of the algorithms described in our paper "Self-supervised Multi-modal Hybrid Fusion N

5 Dec 13, 2021

Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2

CoaDTI Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2 Abstract Environment The test was conducted i

7 Nov 14, 2022

Code of paper Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification.

Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification We provide the codes for repr

12 Dec 12, 2022

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

Related tags

Overview

MDETR: Modulated Detection for End-to-End Multi-Modal Understanding

Usage

Pre-training

Downstream tasks

Phrase grounding on Flickr30k

AnyBox protocol

MergedBox protocol

Referring expression comprehension on RefCOCO, RefCOCO+, RefCOCOg

RefCOCO

RefCOCO+

RefCOCOg

Referring expression segmentation on PhraseCut

Visual question answering on GQA

Long-tailed few-shot object detection

Synthetic datasets

License

Citation

Comments

Owner

Aishwarya Kamath

[ICLR 2021, Spotlight] Large Scale Image Completion via Co-Modulated Generative Adversarial Networks

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

DUE: End-to-End Document Understanding Benchmark

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

End-to-end face detection, cropping, norm estimation, and landmark detection in a single onnx model

A Multi-modal Model Chinese Spell Checker Released on ACL2021.

We present a framework for training multi-modal deep learning models on unlabelled video data by forcing the network to learn invariances to transformations applied to both the audio and video streams.

A pytorch-based deep learning framework for multi-modal 2D/3D medical image segmentation

AdaMML: Adaptive Multi-Modal Learning for Efficient Video Recognition

Self-supervised Multi-modal Hybrid Fusion Network for Brain Tumor Segmentation

Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2

Code of paper Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification.

Multi-Modal Machine Learning toolkit based on PyTorch.

Multi-Modal Machine Learning toolkit based on PaddlePaddle.

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

4st place solution for the PBVS 2022 Multi-modal Aerial View Object Classification Challenge - Track 1 (SAR) at PBVS2022

[LREC] MMChat: Multi-Modal Chat Dataset on Social Media