GLIP: Grounded Language-Image Pre-training

Microsoft

Last update: Jan 1, 2023

Related tags

Deep Learning GLIP

Overview

GLIP: Grounded Language-Image Pre-training

Updates

12/06/2021: GLIP paper on arxiv https://arxiv.org/abs/2112.03857. Code and Model are under internal review and will release soon. Stay tuned!

11/23/2021: Project page built.

Introduction

This repository is the project page for GLIP, containing necessary instructions to reproduce the results presented in the paper. This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks.

When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines.
After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA.
When transferred to 13 downstream object detection tasks, a few-shot GLIP rivals with a fully-supervised Dynamic Head.

Supervised baselines on COCO object detection: Faster-RCNN w/ ResNet50 (40.2) or ResNet101 (42.0) from Detectron2, and DyHead w/ Swin-Tiny (49.7).

Citations

Please consider citing this paper if you use the code:

@inproceedings{harold_GLIP2021,
      title={Grounded Language-Image Pre-training},
      author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
      year={2021},
      booktitle={arXiv preprint arXiv:2112.03857},
}

Comments

Mask prediction

Thank you for your great work! I tried the demo and it's insanely well!

I'm wondering if your model contains a mask prediction head. Because in the GLIPDemo there's a show_mask_heatmaps parameter. When I set it to true, the prediction does not has mask field and therefore failed.

Do you have pretrained model with a mask prediction head?

opened by Steve-Tod 13

RuntimeError: Not compiled with GPU support

Hi guys,

Thanks for the amazing work! I am trying to run the model but I got the following error:

Traceback (most recent call last):
  File "tools/test_grounding_net.py", line 222, in <module>
    main()
  File "tools/test_grounding_net.py", line 205, in main
    inference(
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/engine/inference.py", line 495, in inference
    output = model(images, captions=captions, positive_map=positive_map_label_to_token)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/modeling/detector/generalized_vl_rcnn.py", line 284, in forward
    proposals, proposal_losses, fused_visual_features = self.rpn(images, visual_features, targets, language_dict_features, positive_map,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py", line 920, in forward
    proj_tokens, contrastive_logits, dot_product_logits, mlm_logits, shallow_img_emb_feats, fused_visual_features = self.head(features,
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py", line 739, in forward
    dyhead_tower = self.dyhead_tower(feat_inputs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py", line 205, in forward
    temp_fea = [self.DyConv[1](feature, **conv_args)]
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py", line 135, in forward
    x = self.conv(input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 235, in decorate_fwd
    return fwd(*args, **kwargs)
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/layers/deform_conv.py", line 380, in forward
    return modulated_deform_conv(
  File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/layers/deform_conv.py", line 184, in forward
    _C.modulated_deform_conv_forward(
RuntimeError: Not compiled with GPU support

My nvidia-smi output, I've run python setup.py build develop --user beforehand.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57       Driver Version: 515.57       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:1D:00.0 Off |                  N/A |
| 30%   49C    P8    26W / 350W |    129MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Thanks :)

Cheers,

Francesco

opened by FrancescoSaverioZuppichini 7

The behavior of `ModulatedDeformConv` when the mask shape is different from the input shape.
Thank you for your great work.

When I changed maskrcnn_benchmark.layers.ModulatedDeformConv to torch.ops.DeformConv2d to avoid CUDA compilation, I found a strange behavior.

In maskrcnn_benchmark/modeling/rpn/vldyhead.py at line 210, the input shape is different from the shape of offset and mask. The input shape is the shape of level + 1, but the offset and mask shapes are the shape of level. But it does not cause an error.

I guess it is because ModulatedDeformConv does not check the shapes of the mask and offset. I do not read its CUDA code, but I found strange behavior in it.

import torch from maskrcnn_benchmark.layers import ModulatedDeformConv moddc = ModulatedDeformConv(8, 16, 3).cuda() x = torch.randn(1, 8, 10, 10).cuda() offset = torch.randn(1, 18, 20, 20).cuda() mask = torch.randn(1, 9, 20, 20).cuda() res1 = moddc(x, offset, mask) res2 = moddc(x, offset.reshape(4, 18, 10, 10)[0], mask.reshape(4, 9, 10, 10)[0]) result = torch.abs(res1 - res2).max()

The result is always 0. vldyhead.py has the same behavior that may not be desirable.

I am looking forward to your reply. Sincerely,
opened by KeitaOtani 6
no box detected using code in the codelab demo

Thanks for the great work! I'm trying the code provided in your codelab demo. However, there is no bounding box detected (No errors appear during the compiling and execution). I've also tried to re-install packages to ensure versions of packages on my server are the same as those in your code. Is there any possible reasons for the result?

opened by abril4416 6

RuntimeError: Not compiled with GPU support on Colab

Thank you for sharing this interesting work. When I try to run the Colab example, the execution of 6th cell of the notebook resulted in the following error.

[[[0, 12]], [[16, 19]], [[23, 32]]]

/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py:813: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.
  "The `device` argument is deprecated and will be removed in v5 of Transformers.", FutureWarning

---------------------------------------------------------------------------

RuntimeError                              Traceback (most recent call last)

[<ipython-input-6-d454bb231030>](https://localhost:8080/#) in <module>()
      1 image = load('http://farm4.staticflickr.com/3693/9472793441_b7822c00de_z.jpg')
      2 caption = 'bobble heads on top of the shelf'
----> 3 result, _ = glip_demo.run_on_web_image(image, caption, 0.5)
      4 imshow(result, caption)

17 frames

[/content/GLIP/maskrcnn_benchmark/engine/predictor_glip.py](https://localhost:8080/#) in run_on_web_image(self, original_image, original_caption, thresh, custom_entity, alpha)
    138             custom_entity = None,
    139             alpha = 0.0):
--> 140         predictions = self.compute_prediction(original_image, original_caption, custom_entity)
    141         top_predictions = self._post_process(predictions, thresh)
    142 

[/content/GLIP/maskrcnn_benchmark/engine/predictor_glip.py](https://localhost:8080/#) in compute_prediction(self, original_image, original_caption, custom_entity)
    217         # compute predictions
    218         with torch.no_grad():
--> 219             predictions = self.model(image_list, captions=[original_caption], positive_map=positive_map_label_to_token)
    220             predictions = [o.to(self.cpu_device) for o in predictions]
    221         print("inference time per image: {}".format(timeit.time.perf_counter() - tic))

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

[/content/GLIP/maskrcnn_benchmark/modeling/detector/generalized_vl_rcnn.py](https://localhost:8080/#) in forward(self, images, targets, captions, positive_map, greenlight_map)
    283         else:
    284             proposals, proposal_losses, fused_visual_features = self.rpn(images, visual_features, targets, language_dict_features, positive_map,
--> 285                                               captions, swint_feature_c4)
    286         if self.roi_heads:
    287             if self.cfg.MODEL.ROI_MASK_HEAD.PREDICTOR.startswith("VL"):

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

[/content/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py](https://localhost:8080/#) in forward(self, images, features, targets, language_dict_features, positive_map, captions, swint_feature_c4)
    921                                                                         language_dict_features,
    922                                                                         embedding,
--> 923                                                                         swint_feature_c4
    924                                                                         )
    925         anchors = self.anchor_generator(images, features)

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

[/content/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py](https://localhost:8080/#) in forward(self, x, language_dict_features, embedding, swint_feature_c4)
    737                        "lang": language_dict_features}
    738 
--> 739         dyhead_tower = self.dyhead_tower(feat_inputs)
    740 
    741         # soft token

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py](https://localhost:8080/#) in forward(self, input)
    137     def forward(self, input):
    138         for module in self:
--> 139             input = module(input)
    140         return input
    141 

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

[/content/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py](https://localhost:8080/#) in forward(self, inputs)
    203                 conv_args = dict(offset=offset, mask=mask)
    204 
--> 205             temp_fea = [self.DyConv[1](feature, **conv_args)]
    206 
    207             if level > 0:

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

[/content/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py](https://localhost:8080/#) in forward(self, input, **kwargs)
    133 
    134     def forward(self, input, **kwargs):
--> 135         x = self.conv(input, **kwargs)
    136         if self.bn:
    137             x = self.bn(x)

[/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
   1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1050                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1051             return forward_call(*input, **kwargs)
   1052         # Do not call functions when jit is used
   1053         full_backward_hooks, non_full_backward_hooks = [], []

[/usr/local/lib/python3.7/dist-packages/torch/cuda/amp/autocast_mode.py](https://localhost:8080/#) in decorate_fwd(*args, **kwargs)
    217                     return fwd(*_cast(args, cast_inputs), **_cast(kwargs, cast_inputs))
    218             else:
--> 219                 return fwd(*args, **kwargs)
    220     return decorate_fwd
    221 

[/content/GLIP/maskrcnn_benchmark/layers/deform_conv.py](https://localhost:8080/#) in forward(self, input, offset, mask)
    380         return modulated_deform_conv(
    381             input, offset, mask, self.weight, self.bias, self.stride,
--> 382             self.padding, self.dilation, self.groups, self.deformable_groups)
    383 
    384     def __repr__(self):

[/content/GLIP/maskrcnn_benchmark/layers/deform_conv.py](https://localhost:8080/#) in forward(ctx, input, offset, mask, weight, bias, stride, padding, dilation, groups, deformable_groups)
    201             ctx.groups,
    202             ctx.deformable_groups,
--> 203             ctx.with_bias
    204         )
    205         return output

RuntimeError: Not compiled with GPU support

Here is the GPU information that I used:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   51C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Any advice is appreciated.

Sincerely,

opened by Kashu7100 6

Custom Dataset - Some guidance

Hi there, I'm confused with the terms tokens_positive / tokens_negative and the image caption itself.

What should be the image caption if I have multiple objects with different attributes on the same image? For instance: Pink elephant, blue elephant and normal elephant on the same image. Image caption on annotation file should be: blue elephant,normal elephant,pink elephant? For the boxes, for each elephant should I have the tokens_positive as the correspondent elephant? ex: for blue elephant => { "category_id": 1, "bbox: [xMin,yMin,width,height], "tokens_positive": [0,13] } ex: for normal elephant => { "category_id": 1, "bbox: [xMin,yMin,width,height], "tokens_positive": [14,29] } ex: for pink elephant => { "category_id": 1, "bbox: [xMin,yMin,width,height], "tokens_positive": [30,44] } "categories": [{ "supercategory": "animal", "id": 1, "name": "elephant" }]

Do you know any guide for creating the dataset? Thanks!

opened by fernandorovai 5
Object365 in tsv format

Dear authors,

Thanks for presenting such a great work.

I'm interesting in the pretraining part but quite confused by the data format of object365. From #10 , I know a old version of Object365 (v1) was used. But Object365 has been updated to v2. It seems some v1 images were deleted from v2. The provided train.label.tsv in this repo contains 608606 images, but I only find 519789 out of 608606 in Object365 v2 Would you like to share the script to generate tsv format so that we can generate the required data for Object365 v2?

During tsv data generation, it seems we need to load all the images into memory, which takes too much memory. Is there a way to do pretraining without tsv format?

Best

opened by xiaofeng94 5
When the prompt exceeds the length

hello.

Q1. The total number of classes in the custom data set I have is about 300. So it throws an error. I think this error occurs because the dimension of the logit is [:, :, 256] but the value index of my positive map is greater than 256. Am i right?

Q2. In #37 "If the prompt exceeds the length, you can take a look at the inference codes about how we deal with the LVIS dataset (~1200 classes)" said. Can converting coco annotations to LVIS annotations solve the error in Q1? If so, Is there an API that provides conversion between annotations?

opened by jsk1107 4

Colab example Cuda 10.2 install error

Firstly, thank you for sharing this project.

When running the demo on Colab, I get this error:

https://colab.research.google.com/drive/12x7v-_miN7-SRiziK3Cx4ffJzstBJNqb?usp=sharing#scrollTo=BtMdw_J6PprI

cuda-repo-ubuntu180 100%[===================>]   1.77G   233MB/s    in 7.7s    

2022-06-16 18:28:57 (235 MB/s) - ‘cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb’ saved [1896270068/1896270068]

Selecting previously unselected package cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01.
(Reading database ... 114088 files and directories currently installed.)
Preparing to unpack cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb ...
Unpacking cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01 (1.0-1) ...
Setting up cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01 (1.0-1) ...
OK
Reading package lists... Done
Building dependency tree       
Reading state information... Done
E: Unable to locate package cuda

The rest of the notebook runs until this cell

image = load('http://farm4.staticflickr.com/3693/9472793441_b7822c00de_z.jpg')
caption = 'bobble heads on top of the shelf'
result, _ = glip_demo.run_on_web_image(image, caption, 0.5)
imshow(result, caption)

/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py:813: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.
  "The `device` argument is deprecated and will be removed in v5 of Transformers.", FutureWarning
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
[<ipython-input-6-d454bb231030>](https://4wx9ajd7bx7-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220615-060045-RC00_455067423#) in <module>()
      1 image = load('http://farm4.staticflickr.com/3693/9472793441_b7822c00de_z.jpg')
      2 caption = 'bobble heads on top of the shelf'
----> 3 result, _ = glip_demo.run_on_web_image(image, caption, 0.5)
      4 imshow(result, caption)

17 frames
[/content/GLIP/maskrcnn_benchmark/layers/deform_conv.py](https://4wx9ajd7bx7-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220615-060045-RC00_455067423#) in forward(ctx, input, offset, mask, weight, bias, stride, padding, dilation, groups, deformable_groups)
    201             ctx.groups,
    202             ctx.deformable_groups,
--> 203             ctx.with_bias
    204         )
    205         return output

RuntimeError: Not compiled with GPU support

Thank you very much!

opened by vade 4

Runtime Error in Colab demo

Hi, really thanks for releasing so amazing work. But I encounter something unexpected in the colab demo.

image = load('http://farm4.staticflickr.com/3693/9472793441_b7822c00de_z.jpg')

caption = 'bobble heads on top of the shelf'

result, _ = glip_demo.run_on_web_image(image, caption, 0.5)

imshow(result, caption)

When I run this codes above , "RuntimeError: Not compiled with GPU support" is reported, can you provide some advice?

opened by NickChang97 4
Results on COCO 2017 dataset

Hi,

Could you please clarify what is your actual results on coco 2017 dataset?

I've just found the submission from GLIP/Microsoft research team that took the 2nd place on coco test-dev2019 with AP=0.62 and AP50=0.80

Have you used GLIP-L model or another one which is not public so far?

https://competitions.codalab.org/competitions/20794#results

opened by aiotko 4

Hopefully the authors will make the GLIP V2 environment a little easier to configure, thank you！

    @dongzhiwu Thank you for your interest in our GLIPv2 work. Currently, the v2 paper is still under review, but we will be sure to open source our codes and models as soon as the process finishes. Again, thank you for the patience and we really appreciate it!

Originally posted by @Haotian-Zhang in https://github.com/microsoft/GLIP/issues/17#issuecomment-1159916498

opened by linhuixiao 0

the role of contrastive words in prompt

For the demo in Colab, if we simplify the prompt to "person . sofa . remote", it couldn't detect "remote".

But if we add a very different word, i.e., sky, it works:

Any idea why we need contrastive words in the prompt? Should we always include contrastive words to enhance the performance?

opened by TracyYXChen 0
Does not build with current PyTorch

Hi, thanks for the amazing work! I was interested in experimenting with it when I realized there is an error when trying to build with the last few versions of PyTorch. As this issue pointed out, the THC/THC.h results in an error as it was deprecated and no longer in PyTorch. This prevents the method from being integrated into any environment that uses these more recent versions of PyTorch, and if other code depends on the newer versions downgrading is not a viable solution. Is there any way to get around this?

opened by sachit-menon 0
Get much lower AP on LVIS after directly following the instructions

Dear authors,

By following the commands in https://github.com/microsoft/GLIP#lvis-evaluation using the provided config file and pretrained weights, the evaluation results on LVIS minival is much lower than the ones reported in README.

The command I used: CUDA_VISIBLE_DEVICES=3,4,5,6 python -m torch.distributed.launch --nproc_per_node=4 tools/test_grounding_net.py --config-file configs/pretrain/glip_Swin_T_O365_GoldG.yaml --task_config configs/lvis/minival.yaml --weight PRETRAINED/glip_tiny_model_o365_goldg.pth TEST.EVAL_TASK detection OUTPUT_DIR evals/lvis TEST.CHUNKED_EVALUATION 40 TEST.IMS_PER_BATCH 16 SOLVER.IMS_PER_BATCH 16 TEST.MDETR_STYLE_AGGREGATE_CLASS_NUM 3000 MODEL.RETINANET.DETECTIONS_PER_IMG 300 MODEL.FCOS.DETECTIONS_PER_IMG 300 MODEL.ATSS.DETECTIONS_PER_IMG 300 MODEL.ROI_HEADS.DETECTIONS_PER_IMG 300

The corresponding evaluation results using the config and weights of GLIP-T(C), who's APr should be either ~14.3 or ~17.7:

I have also tried the GLIP-T(A), but the results is also much lower. Do you have any suggestions about where I might haven't done correctly?

opened by backseason 0

Owner

Microsoft

Open source projects and samples from Microsoft

GitHub

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

17 Mar 10, 2022

Code release for SLIP Self-supervision meets Language-Image Pre-training

SLIP: Self-supervision meets Language-Image Pre-training What you can find in this repo: Pre-trained models (with ViT-Small, Base, Large) and code to

621 Dec 31, 2022

Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

ResDAVEnet-VQ Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech What is in this repo? M

21 Aug 23, 2022

DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations

DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations This repository contains the data, scripts and baseline co

51 Dec 17, 2022

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

ALFRED A Benchmark for Interpreting Grounded Instructions for Everyday Tasks Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han,

204 Dec 15, 2022

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Knover Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out eff

607 Dec 31, 2022

PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner [Li et al., 2020].

VGPL-Visual-Prior PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner (VGPL). Give

8 Dec 29, 2022

[CVPR'22] Official PyTorch Implementation of Collaborative Transformers for Grounded Situation Recognition

[CVPR'22] Collaborative Transformers for Grounded Situation Recognition Paper | Model Checkpoint This is the official PyTorch implementation of Collab

29 Dec 10, 2022

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

196 Dec 13, 2022

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

248 Dec 4, 2022

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

28 Dec 30, 2022

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

250 Jan 8, 2023

PyTorch implementation of Rethinking Positional Encoding in Language Pre-training

TUPE PyTorch implementation of Rethinking Positional Encoding in Language Pre-training. Quickstart Clone this repository. git clone https://github.com

5 Jan 27, 2022

X-VLM: Multi-Grained Vision Language Pre-Training

X-VLM: learning multi-grained vision language alignments Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. Yan Zeng, Xi

286 Dec 23, 2022

SAS: Self-Augmentation Strategy for Language Model Pre-training

SAS: Self-Augmentation Strategy for Language Model Pre-training This repository

5 Nov 2, 2022

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

FuseDream This repo contains code for our paper (paper link): FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimizat

191 Dec 31, 2022

GLIP: Grounded Language-Image Pre-training

Related tags

Overview

GLIP: Grounded Language-Image Pre-training

Updates

Introduction

Citations

Comments

Owner

Microsoft

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

Code release for SLIP Self-supervision meets Language-Image Pre-training

Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner [Li et al., 2020].

[CVPR'22] Official PyTorch Implementation of Collaborative Transformers for Grounded Situation Recognition

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

PyTorch implementation of Rethinking Positional Encoding in Language Pre-training

X-VLM: Multi-Grained Vision Language Pre-Training

SAS: Self-Augmentation Strategy for Language Model Pre-training

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

PyTorch implementation of the Transformer in Post-LN (Post-LayerNorm) and Pre-LN (Pre-LayerNorm).

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.