GLIP: Grounded Language-Image Pre-training

Related tags

Deep Learning GLIP
Overview

GLIP: Grounded Language-Image Pre-training

Updates

12/06/2021: GLIP paper on arxiv https://arxiv.org/abs/2112.03857. Code and Model are under internal review and will release soon. Stay tuned!

11/23/2021: Project page built.

Introduction

This repository is the project page for GLIP, containing necessary instructions to reproduce the results presented in the paper. This paper presents a grounded language-image pre-training (GLIP) model for learning object-level, language-aware, and semantic-rich visual representations. GLIP unifies object detection and phrase grounding for pre-training. The unification brings two benefits: 1) it allows GLIP to learn from both detection and grounding data to improve both tasks and bootstrap a good grounding model; 2) GLIP can leverage massive image-text pairs by generating grounding boxes in a self-training fashion, making the learned representation semantic-rich. In our experiments, we pre-train GLIP on 27M grounding data, including 3M human-annotated and 24M web-crawled image-text pairs. The learned representations demonstrate strong zero-shot and few-shot transferability to various object-level recognition tasks.

  1. When directly evaluated on COCO and LVIS (without seeing any images in COCO during pre-training), GLIP achieves 49.8 AP and 26.9 AP, respectively, surpassing many supervised baselines.
  2. After fine-tuned on COCO, GLIP achieves 60.8 AP on val and 61.5 AP on test-dev, surpassing prior SoTA.
  3. When transferred to 13 downstream object detection tasks, a few-shot GLIP rivals with a fully-supervised Dynamic Head.

Supervised baselines on COCO object detection: Faster-RCNN w/ ResNet50 (40.2) or ResNet101 (42.0) from Detectron2, and DyHead w/ Swin-Tiny (49.7).

Citations

Please consider citing this paper if you use the code:

@inproceedings{harold_GLIP2021,
      title={Grounded Language-Image Pre-training},
      author={Liunian Harold Li* and Pengchuan Zhang* and Haotian Zhang* and Jianwei Yang and Chunyuan Li and Yiwu Zhong and Lijuan Wang and Lu Yuan and Lei Zhang and Jenq-Neng Hwang and Kai-Wei Chang and Jianfeng Gao},
      year={2021},
      booktitle={arXiv preprint arXiv:2112.03857},
}
Comments
  • Mask prediction

    Mask prediction

    Thank you for your great work! I tried the demo and it's insanely well!

    I'm wondering if your model contains a mask prediction head. Because in the GLIPDemo there's a show_mask_heatmaps parameter. When I set it to true, the prediction does not has mask field and therefore failed.

    Do you have pretrained model with a mask prediction head?

    opened by Steve-Tod 13
  • RuntimeError: Not compiled with GPU support

    RuntimeError: Not compiled with GPU support

    Hi guys,

    Thanks for the amazing work! I am trying to run the model but I got the following error:

    Traceback (most recent call last):
      File "tools/test_grounding_net.py", line 222, in <module>
        main()
      File "tools/test_grounding_net.py", line 205, in main
        inference(
      File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/engine/inference.py", line 495, in inference
        output = model(images, captions=captions, positive_map=positive_map_label_to_token)
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
        return forward_call(*input, **kwargs)
      File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/modeling/detector/generalized_vl_rcnn.py", line 284, in forward
        proposals, proposal_losses, fused_visual_features = self.rpn(images, visual_features, targets, language_dict_features, positive_map,
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
        return forward_call(*input, **kwargs)
      File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py", line 920, in forward
        proj_tokens, contrastive_logits, dot_product_logits, mlm_logits, shallow_img_emb_feats, fused_visual_features = self.head(features,
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
        return forward_call(*input, **kwargs)
      File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py", line 739, in forward
        dyhead_tower = self.dyhead_tower(feat_inputs)
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
        return forward_call(*input, **kwargs)
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py", line 139, in forward
        input = module(input)
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
        return forward_call(*input, **kwargs)
      File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py", line 205, in forward
        temp_fea = [self.DyConv[1](feature, **conv_args)]
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
        return forward_call(*input, **kwargs)
      File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py", line 135, in forward
        x = self.conv(input, **kwargs)
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1056, in _call_impl
        return forward_call(*input, **kwargs)
      File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/autocast_mode.py", line 235, in decorate_fwd
        return fwd(*args, **kwargs)
      File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/layers/deform_conv.py", line 380, in forward
        return modulated_deform_conv(
      File "/workspace/GLIP-benchmark/GLIP/maskrcnn_benchmark/layers/deform_conv.py", line 184, in forward
        _C.modulated_deform_conv_forward(
    RuntimeError: Not compiled with GPU support
    

    My nvidia-smi output, I've run python setup.py build develop --user beforehand.

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 515.57       Driver Version: 515.57       CUDA Version: 11.7     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  NVIDIA GeForce ...  Off  | 00000000:1D:00.0 Off |                  N/A |
    | 30%   49C    P8    26W / 350W |    129MiB / 24576MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    +-----------------------------------------------------------------------------+
    

    Thanks :)

    Cheers,

    Francesco

    opened by FrancescoSaverioZuppichini 7
  • The behavior of `ModulatedDeformConv` when the mask shape is different from the input shape.

    The behavior of `ModulatedDeformConv` when the mask shape is different from the input shape.

    Thank you for your great work.

    When I changed maskrcnn_benchmark.layers.ModulatedDeformConv to torch.ops.DeformConv2d to avoid CUDA compilation, I found a strange behavior.

    In maskrcnn_benchmark/modeling/rpn/vldyhead.py at line 210, the input shape is different from the shape of offset and mask. The input shape is the shape of level + 1, but the offset and mask shapes are the shape of level. But it does not cause an error.

    I guess it is because ModulatedDeformConv does not check the shapes of the mask and offset. I do not read its CUDA code, but I found strange behavior in it.

    import torch
    from maskrcnn_benchmark.layers import ModulatedDeformConv
    
    moddc = ModulatedDeformConv(8, 16, 3).cuda()
    x = torch.randn(1, 8, 10, 10).cuda()
    offset = torch.randn(1, 18, 20, 20).cuda()
    mask = torch.randn(1, 9, 20, 20).cuda()
    
    res1 = moddc(x, offset, mask)
    res2 = moddc(x, offset.reshape(4, 18, 10, 10)[0], mask.reshape(4, 9, 10, 10)[0])
    result = torch.abs(res1 - res2).max()
    

    The result is always 0. vldyhead.py has the same behavior that may not be desirable.

    I am looking forward to your reply. Sincerely,

    opened by KeitaOtani 6
  • no box detected using code in the codelab demo

    no box detected using code in the codelab demo

    Thanks for the great work! I'm trying the code provided in your codelab demo. However, there is no bounding box detected (No errors appear during the compiling and execution). I've also tried to re-install packages to ensure versions of packages on my server are the same as those in your code. Is there any possible reasons for the result? image image

    opened by abril4416 6
  • RuntimeError: Not compiled with GPU support on Colab

    RuntimeError: Not compiled with GPU support on Colab

    Thank you for sharing this interesting work. When I try to run the Colab example, the execution of 6th cell of the notebook resulted in the following error.

    [[[0, 12]], [[16, 19]], [[23, 32]]]
    
    /usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py:813: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.
      "The `device` argument is deprecated and will be removed in v5 of Transformers.", FutureWarning
    
    ---------------------------------------------------------------------------
    
    RuntimeError                              Traceback (most recent call last)
    
    [<ipython-input-6-d454bb231030>](https://localhost:8080/#) in <module>()
          1 image = load('http://farm4.staticflickr.com/3693/9472793441_b7822c00de_z.jpg')
          2 caption = 'bobble heads on top of the shelf'
    ----> 3 result, _ = glip_demo.run_on_web_image(image, caption, 0.5)
          4 imshow(result, caption)
    
    17 frames
    
    [/content/GLIP/maskrcnn_benchmark/engine/predictor_glip.py](https://localhost:8080/#) in run_on_web_image(self, original_image, original_caption, thresh, custom_entity, alpha)
        138             custom_entity = None,
        139             alpha = 0.0):
    --> 140         predictions = self.compute_prediction(original_image, original_caption, custom_entity)
        141         top_predictions = self._post_process(predictions, thresh)
        142 
    
    [/content/GLIP/maskrcnn_benchmark/engine/predictor_glip.py](https://localhost:8080/#) in compute_prediction(self, original_image, original_caption, custom_entity)
        217         # compute predictions
        218         with torch.no_grad():
    --> 219             predictions = self.model(image_list, captions=[original_caption], positive_map=positive_map_label_to_token)
        220             predictions = [o.to(self.cpu_device) for o in predictions]
        221         print("inference time per image: {}".format(timeit.time.perf_counter() - tic))
    
    [/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
       1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1050                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1051             return forward_call(*input, **kwargs)
       1052         # Do not call functions when jit is used
       1053         full_backward_hooks, non_full_backward_hooks = [], []
    
    [/content/GLIP/maskrcnn_benchmark/modeling/detector/generalized_vl_rcnn.py](https://localhost:8080/#) in forward(self, images, targets, captions, positive_map, greenlight_map)
        283         else:
        284             proposals, proposal_losses, fused_visual_features = self.rpn(images, visual_features, targets, language_dict_features, positive_map,
    --> 285                                               captions, swint_feature_c4)
        286         if self.roi_heads:
        287             if self.cfg.MODEL.ROI_MASK_HEAD.PREDICTOR.startswith("VL"):
    
    [/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
       1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1050                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1051             return forward_call(*input, **kwargs)
       1052         # Do not call functions when jit is used
       1053         full_backward_hooks, non_full_backward_hooks = [], []
    
    [/content/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py](https://localhost:8080/#) in forward(self, images, features, targets, language_dict_features, positive_map, captions, swint_feature_c4)
        921                                                                         language_dict_features,
        922                                                                         embedding,
    --> 923                                                                         swint_feature_c4
        924                                                                         )
        925         anchors = self.anchor_generator(images, features)
    
    [/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
       1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1050                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1051             return forward_call(*input, **kwargs)
       1052         # Do not call functions when jit is used
       1053         full_backward_hooks, non_full_backward_hooks = [], []
    
    [/content/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py](https://localhost:8080/#) in forward(self, x, language_dict_features, embedding, swint_feature_c4)
        737                        "lang": language_dict_features}
        738 
    --> 739         dyhead_tower = self.dyhead_tower(feat_inputs)
        740 
        741         # soft token
    
    [/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
       1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1050                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1051             return forward_call(*input, **kwargs)
       1052         # Do not call functions when jit is used
       1053         full_backward_hooks, non_full_backward_hooks = [], []
    
    [/usr/local/lib/python3.7/dist-packages/torch/nn/modules/container.py](https://localhost:8080/#) in forward(self, input)
        137     def forward(self, input):
        138         for module in self:
    --> 139             input = module(input)
        140         return input
        141 
    
    [/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
       1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1050                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1051             return forward_call(*input, **kwargs)
       1052         # Do not call functions when jit is used
       1053         full_backward_hooks, non_full_backward_hooks = [], []
    
    [/content/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py](https://localhost:8080/#) in forward(self, inputs)
        203                 conv_args = dict(offset=offset, mask=mask)
        204 
    --> 205             temp_fea = [self.DyConv[1](feature, **conv_args)]
        206 
        207             if level > 0:
    
    [/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
       1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1050                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1051             return forward_call(*input, **kwargs)
       1052         # Do not call functions when jit is used
       1053         full_backward_hooks, non_full_backward_hooks = [], []
    
    [/content/GLIP/maskrcnn_benchmark/modeling/rpn/vldyhead.py](https://localhost:8080/#) in forward(self, input, **kwargs)
        133 
        134     def forward(self, input, **kwargs):
    --> 135         x = self.conv(input, **kwargs)
        136         if self.bn:
        137             x = self.bn(x)
    
    [/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py](https://localhost:8080/#) in _call_impl(self, *input, **kwargs)
       1049         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1050                 or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1051             return forward_call(*input, **kwargs)
       1052         # Do not call functions when jit is used
       1053         full_backward_hooks, non_full_backward_hooks = [], []
    
    [/usr/local/lib/python3.7/dist-packages/torch/cuda/amp/autocast_mode.py](https://localhost:8080/#) in decorate_fwd(*args, **kwargs)
        217                     return fwd(*_cast(args, cast_inputs), **_cast(kwargs, cast_inputs))
        218             else:
    --> 219                 return fwd(*args, **kwargs)
        220     return decorate_fwd
        221 
    
    [/content/GLIP/maskrcnn_benchmark/layers/deform_conv.py](https://localhost:8080/#) in forward(self, input, offset, mask)
        380         return modulated_deform_conv(
        381             input, offset, mask, self.weight, self.bias, self.stride,
    --> 382             self.padding, self.dilation, self.groups, self.deformable_groups)
        383 
        384     def __repr__(self):
    
    [/content/GLIP/maskrcnn_benchmark/layers/deform_conv.py](https://localhost:8080/#) in forward(ctx, input, offset, mask, weight, bias, stride, padding, dilation, groups, deformable_groups)
        201             ctx.groups,
        202             ctx.deformable_groups,
    --> 203             ctx.with_bias
        204         )
        205         return output
    
    RuntimeError: Not compiled with GPU support
    

    Here is the GPU information that I used:

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
    | N/A   51C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
    |                               |                      |                  N/A |
    +-------------------------------+----------------------+----------------------+
                                                                                   
    +-----------------------------------------------------------------------------+
    | Processes:                                                                  |
    |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
    |        ID   ID                                                   Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+
    

    Any advice is appreciated.

    Sincerely,

    opened by Kashu7100 6
  • Custom Dataset - Some guidance

    Custom Dataset - Some guidance

    Hi there, I'm confused with the terms tokens_positive / tokens_negative and the image caption itself.

    What should be the image caption if I have multiple objects with different attributes on the same image? For instance: Pink elephant, blue elephant and normal elephant on the same image. Image caption on annotation file should be: blue elephant,normal elephant,pink elephant? For the boxes, for each elephant should I have the tokens_positive as the correspondent elephant? ex: for blue elephant => { "category_id": 1, "bbox: [xMin,yMin,width,height], "tokens_positive": [0,13] } ex: for normal elephant => { "category_id": 1, "bbox: [xMin,yMin,width,height], "tokens_positive": [14,29] } ex: for pink elephant => { "category_id": 1, "bbox: [xMin,yMin,width,height], "tokens_positive": [30,44] } "categories": [{ "supercategory": "animal", "id": 1, "name": "elephant" }]

    Do you know any guide for creating the dataset? Thanks!

    opened by fernandorovai 5
  • Object365 in tsv format

    Object365 in tsv format

    Dear authors,

    Thanks for presenting such a great work.

    I'm interesting in the pretraining part but quite confused by the data format of object365. From #10 , I know a old version of Object365 (v1) was used. But Object365 has been updated to v2. It seems some v1 images were deleted from v2. The provided train.label.tsv in this repo contains 608606 images, but I only find 519789 out of 608606 in Object365 v2 Would you like to share the script to generate tsv format so that we can generate the required data for Object365 v2?

    During tsv data generation, it seems we need to load all the images into memory, which takes too much memory. Is there a way to do pretraining without tsv format?

    Best

    opened by xiaofeng94 5
  • When the prompt exceeds the length

    When the prompt exceeds the length

    hello.

    Q1. The total number of classes in the custom data set I have is about 300. So it throws an error. I think this error occurs because the dimension of the logit is [:, :, 256] but the value index of my positive map is greater than 256. Am i right?

    Q2. In #37 "If the prompt exceeds the length, you can take a look at the inference codes about how we deal with the LVIS dataset (~1200 classes)" said. Can converting coco annotations to LVIS annotations solve the error in Q1? If so, Is there an API that provides conversion between annotations?

    opened by jsk1107 4
  • Colab example Cuda 10.2 install error

    Colab example Cuda 10.2 install error

    Hi

    Firstly, thank you for sharing this project.

    When running the demo on Colab, I get this error:

    https://colab.research.google.com/drive/12x7v-_miN7-SRiziK3Cx4ffJzstBJNqb?usp=sharing#scrollTo=BtMdw_J6PprI

    cuda-repo-ubuntu180 100%[===================>]   1.77G   233MB/s    in 7.7s    
    
    2022-06-16 18:28:57 (235 MB/s) - ‘cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb’ saved [1896270068/1896270068]
    
    Selecting previously unselected package cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01.
    (Reading database ... 114088 files and directories currently installed.)
    Preparing to unpack cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01_1.0-1_amd64.deb ...
    Unpacking cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01 (1.0-1) ...
    Setting up cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01 (1.0-1) ...
    OK
    Reading package lists... Done
    Building dependency tree       
    Reading state information... Done
    E: Unable to locate package cuda
    

    The rest of the notebook runs until this cell

    image = load('http://farm4.staticflickr.com/3693/9472793441_b7822c00de_z.jpg')
    caption = 'bobble heads on top of the shelf'
    result, _ = glip_demo.run_on_web_image(image, caption, 0.5)
    imshow(result, caption)
    
    /usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py:813: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.
      "The `device` argument is deprecated and will be removed in v5 of Transformers.", FutureWarning
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    [<ipython-input-6-d454bb231030>](https://4wx9ajd7bx7-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220615-060045-RC00_455067423#) in <module>()
          1 image = load('http://farm4.staticflickr.com/3693/9472793441_b7822c00de_z.jpg')
          2 caption = 'bobble heads on top of the shelf'
    ----> 3 result, _ = glip_demo.run_on_web_image(image, caption, 0.5)
          4 imshow(result, caption)
    
    17 frames
    [/content/GLIP/maskrcnn_benchmark/layers/deform_conv.py](https://4wx9ajd7bx7-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20220615-060045-RC00_455067423#) in forward(ctx, input, offset, mask, weight, bias, stride, padding, dilation, groups, deformable_groups)
        201             ctx.groups,
        202             ctx.deformable_groups,
    --> 203             ctx.with_bias
        204         )
        205         return output
    
    RuntimeError: Not compiled with GPU support
    

    Thank you very much!

    opened by vade 4
  • Runtime Error in Colab demo

    Runtime Error in Colab demo

    Hi, really thanks for releasing so amazing work. But I encounter something unexpected in the colab demo.

    image = load('http://farm4.staticflickr.com/3693/9472793441_b7822c00de_z.jpg')

    caption = 'bobble heads on top of the shelf'

    result, _ = glip_demo.run_on_web_image(image, caption, 0.5)

    imshow(result, caption)

    When I run this codes above , "RuntimeError: Not compiled with GPU support" is reported, can you provide some advice?

    opened by NickChang97 4
  • Results on COCO 2017 dataset

    Results on COCO 2017 dataset

    Hi,

    Could you please clarify what is your actual results on coco 2017 dataset?

    I've just found the submission from GLIP/Microsoft research team that took the 2nd place on coco test-dev2019 with AP=0.62 and AP50=0.80

    Have you used GLIP-L model or another one which is not public so far?

    https://competitions.codalab.org/competitions/20794#results

    opened by aiotko 4
  • Hopefully the authors will make the GLIP V2 environment a little easier to configure, thank you!

    Hopefully the authors will make the GLIP V2 environment a little easier to configure, thank you!

        @dongzhiwu Thank you for your interest in our GLIPv2 work. Currently, the v2 paper is still under review, but we will be sure to open source our codes and models as soon as the process finishes. Again, thank you for the patience and we really appreciate it!
    

    Originally posted by @Haotian-Zhang in https://github.com/microsoft/GLIP/issues/17#issuecomment-1159916498

    opened by linhuixiao 0
  • the role of contrastive words in prompt

    the role of contrastive words in prompt

    For the demo in Colab, if we simplify the prompt to "person . sofa . remote", it couldn't detect "remote". Screen Shot 2022-12-16 at 10 43 10 AM

    But if we add a very different word, i.e., sky, it works: Screen Shot 2022-12-16 at 10 43 39 AM

    Any idea why we need contrastive words in the prompt? Should we always include contrastive words to enhance the performance?

    opened by TracyYXChen 0
  • Does not build with current PyTorch

    Does not build with current PyTorch

    Hi, thanks for the amazing work! I was interested in experimenting with it when I realized there is an error when trying to build with the last few versions of PyTorch. As this issue pointed out, the THC/THC.h results in an error as it was deprecated and no longer in PyTorch. This prevents the method from being integrated into any environment that uses these more recent versions of PyTorch, and if other code depends on the newer versions downgrading is not a viable solution. Is there any way to get around this?

    opened by sachit-menon 0
  • Get much lower AP on LVIS after directly following the instructions

    Get much lower AP on LVIS after directly following the instructions

    Dear authors,

    By following the commands in https://github.com/microsoft/GLIP#lvis-evaluation using the provided config file and pretrained weights, the evaluation results on LVIS minival is much lower than the ones reported in README.

    The command I used: CUDA_VISIBLE_DEVICES=3,4,5,6 python -m torch.distributed.launch --nproc_per_node=4 tools/test_grounding_net.py --config-file configs/pretrain/glip_Swin_T_O365_GoldG.yaml --task_config configs/lvis/minival.yaml --weight PRETRAINED/glip_tiny_model_o365_goldg.pth TEST.EVAL_TASK detection OUTPUT_DIR evals/lvis TEST.CHUNKED_EVALUATION 40 TEST.IMS_PER_BATCH 16 SOLVER.IMS_PER_BATCH 16 TEST.MDETR_STYLE_AGGREGATE_CLASS_NUM 3000 MODEL.RETINANET.DETECTIONS_PER_IMG 300 MODEL.FCOS.DETECTIONS_PER_IMG 300 MODEL.ATSS.DETECTIONS_PER_IMG 300 MODEL.ROI_HEADS.DETECTIONS_PER_IMG 300

    The corresponding evaluation results using the config and weights of GLIP-T(C), who's APr should be either ~14.3 or ~17.7: image

    I have also tried the GLIP-T(A), but the results is also much lower. Do you have any suggestions about where I might haven't done correctly?

    opened by backseason 0
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

Galuh 17 Mar 10, 2022
Code release for SLIP Self-supervision meets Language-Image Pre-training

SLIP: Self-supervision meets Language-Image Pre-training What you can find in this repo: Pre-trained models (with ViT-Small, Base, Large) and code to

Meta Research 621 Dec 31, 2022
Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

ResDAVEnet-VQ Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech What is in this repo? M

Wei-Ning Hsu 21 Aug 23, 2022
DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations

DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations This repository contains the data, scripts and baseline co

Alexa 51 Dec 17, 2022
ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

ALFRED A Benchmark for Interpreting Grounded Instructions for Everyday Tasks Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han,

ALFRED 204 Dec 15, 2022
Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Knover Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out eff

null 607 Dec 31, 2022
PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner [Li et al., 2020].

VGPL-Visual-Prior PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner (VGPL). Give

Toru 8 Dec 29, 2022
[CVPR'22] Official PyTorch Implementation of Collaborative Transformers for Grounded Situation Recognition

[CVPR'22] Collaborative Transformers for Grounded Situation Recognition Paper | Model Checkpoint This is the official PyTorch implementation of Collab

Junhyeong Cho 29 Dec 10, 2022
[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

Multimedia Research 196 Dec 13, 2022
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

null 248 Dec 4, 2022
CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

Mingyang Zhou 28 Dec 30, 2022
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

null 250 Jan 8, 2023
PyTorch implementation of Rethinking Positional Encoding in Language Pre-training

TUPE PyTorch implementation of Rethinking Positional Encoding in Language Pre-training. Quickstart Clone this repository. git clone https://github.com

Jake Tae 5 Jan 27, 2022
X-VLM: Multi-Grained Vision Language Pre-Training

X-VLM: learning multi-grained vision language alignments Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. Yan Zeng, Xi

Yan Zeng 286 Dec 23, 2022
SAS: Self-Augmentation Strategy for Language Model Pre-training

SAS: Self-Augmentation Strategy for Language Model Pre-training This repository

Alibaba 5 Nov 2, 2022
FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

FuseDream This repo contains code for our paper (paper link): FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimizat

XCL 191 Dec 31, 2022
PyTorch implementation of the Transformer in Post-LN (Post-LayerNorm) and Pre-LN (Pre-LayerNorm).

Transformer-PyTorch A PyTorch implementation of the Transformer from the paper Attention is All You Need in both Post-LN (Post-LayerNorm) and Pre-LN (

Jared Wang 22 Feb 27, 2022
Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

Kimio Kuramitsu 1 Dec 13, 2021
The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

SuperGen The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Requirements Before running, you

Yu Meng 38 Dec 12, 2022