[CVPR2021] Look before you leap: learning landmark features for one-stage visual grounding.

Overview

LBYL-Net

This repo implements paper Look Before You Leap: Learning Landmark Features For One-Stage Visual Grounding CVPR 2021.


Getting Started

Prerequisites

  • python 3.7
  • pytorch 10.0
  • cuda 10.0
  • gcc 4.92 or above

Installation

  1. Then clone the repo and install dependencies.
    git clone https://github.com/svip-lab/LBYLNet.git
    cd LBYLNet
    pip install requirements.txt 
  2. You also need to install our landmark feature convolution:
    cd ext
    git clone https://github.com/hbb1/landmarkconv.git
    cd landmarkconv/lib/layers
    python setup.py install --user
  3. We follow dataset structure DMS and FAOA. For convience, we have pack them togather, including ReferitGame, RefCOCO, RefCOCO+, RefCOCOg.
    bash data/refer/download_data.sh ./data/refer
  4. download the generated index files and place them in ./data/refer. Available at [Gdrive], [One Drive] .
  5. download the pretained model of YOLOv3.
    wget -P ext https://pjreddie.com/media/files/yolov3.weights

Training and Evaluation

By default, we use 2 gpus and batchsize 64 with DDP (distributed data-parallel). We have provided several configurations and training log for reproducing our results. If you want to use different hyperparameters or models, you may create configs for yourself. Here are examples:

  • For distributed training with gpus :

    CUDA_VISIBLE_DEVICES=0,1 python train.py lbyl_lstm_referit_batch64  --workers 8 --distributed --world_size 1  --dist_url "tcp://127.0.0.1:60006"
  • If you use single gpu or won't use distributed training (make sure to adjust the batchsize in the corresponding config file to match your devices):

    CUDA_VISIBLE_DEVICES=0, python train.py lbyl_lstm_referit_batch64  --workers 8
  • For evaluation:

    CUDA_VISIBLE_DEVICES=0, python evaluate.py lbyl_lstm_referit_batch64 --testiter 100 --split val

Trained Models

We provide the our retrained models with this re-organized codebase and provide their checkpoints and logs for reproducing the results. To use our trained models, download them from the [Gdrive] and save them into directory cache. Then the file path is expected to be <LBYLNet dir>/cache/nnet/<config>/<dataset>/<config>_100.pkl

Notice: The reproduced performances are occassionally higher or lower (within a reasonable range) than the results reported in the paper.

In this repo, we provide the peformance of our LBYL-Nets below. You can also find the details on <LBYLNet dir>/results and <LBYLNet dir>/logs.

  • Performance on ReferitGame ([email protected]).

    Dataset Langauge Split Papar Reproduce
    ReferitGame LSTM test 65.48 65.98
    BERT test 67.47 68.48
  • Performance on RefCOCO ([email protected]).

    Dataset Langauge Split Papar Reproduce
    RefCOCO LSTM
    testA 82.18 82.48
    testB 71.91 71.76
    BERT
    testA 82.91 82.82
    testB 74.15 72.82
  • Performance on RefCOCO+ ([email protected]).

    Dataset Langauge Split Papar Reproduce
    RefCOCO+ LSTM val 66.64 66.71
    testA 73.21 72.63
    testB 56.23 55.88
    BERT val 68.64 68.76
    testA 73.38 73.73
    testB 59.49 59.62
  • Performance on RefCOCOg ([email protected]).

    Dataset Langauge Split Papar Reproduce
    RefCOCOg LSTM val 58.72 60.03
    BERT val 62.70 63.20

Demo

We also provide demo scripts to test if the repo is corretly installed. After installing the repo and download the pretained weights, you should be able to use the LBYL-Net to ground your own images.

python demo.py

you can change the model, image or phrase in the demo.py. You will see the output image in imgs/demo_out.jpg.

#!/usr/bin/env python
import cv2
import torch
from core.test.test import _visualize
from core.groundors import Net 
# pick one model
cfg_file = "lbyl_bert_unc+_batch64"
detector = Net(cfg_file, iter=100)
# inference
image = cv2.imread('imgs/demo.jpeg')
phrase = 'the green gaint'
bbox = detector(image, phrase)
_visualize(image, pred_bbox=bbox, phrase=phrase, save_path='imgs/demo_out.jpg', color=(1, 174, 245), draw_phrase=True)

Input:

Output:


Acknowledgements

This repo is organized as CornerNet-Lite and the code is partially from FAOA (e.g. data preparation) and MAttNet (e.g. LSTM). We thank for their great works.


Citations:

If you use any part of this repo in your research, please cite our paper:

@InProceedings{huang2021look,
      title={Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding}, 
      author={Huang, Binbin and Lian, Dongze and Luo, Weixin and Gao, Shenghua},
      booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      month = {June},
      year={2021},
}
Comments
  • No module named 'landmarks'

    No module named 'landmarks'

    Hi @piaozhx,

    Couuld you please tell what should I do to get rid of the error? I am trying to run the training of the model using the command: CUDA_VISIBLE_DEVICES=0,1 python train.py lbyl_lstm_referit_batch64 --workers 8 --distributed --world_size 1 --dist_url "tcp://127.0.0.1:60006"

    but I am getting the following error:

    ngpus_per_node 2
    /media/disk/user/abhinav/LBYLNet/core/../data/refer/data/referit/corpus.pth
    /media/disk/user/abhinav/LBYLNet/core/../data/refer/data/referit/corpus.pth
    /media/disk/user/abhinav/LBYLNet/env/lib/python3.6/site-packages/torch/nn/_reduction.py:43: UserWarning: size_average and reduce args will be deprecated, please use reduction='mean' instead.
      warnings.warn(warning.format(ret))
    /media/disk/user/abhinav/LBYLNet/env/lib/python3.6/site-packages/torch/nn/_reduction.py:43: UserWarning: size_average and reduce args will be deprecated, please use reduction='mean' instead.
      warnings.warn(warning.format(ret))
    /media/disk/user/abhinav/LBYLNet/env/lib/python3.6/site-packages/torch/nn/modules/rnn.py:50: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1
      "num_layers={}".format(dropout, num_layers))
    /media/disk/user/abhinav/LBYLNet/env/lib/python3.6/site-packages/torch/nn/modules/rnn.py:50: UserWarning: dropout option adds dropout after all but last recurrent layer, so non-zero dropout expects num_layers greater than 1, but got dropout=0.2 and num_layers=1
      "num_layers={}".format(dropout, num_layers))
    Traceback (most recent call last):
      File "train.py", line 375, in <module>
        mp.spawn(main, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
      File "/media/disk/user/abhinav/LBYLNet/env/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
        while not spawn_context.join():
      File "/media/disk/user/abhinav/LBYLNet/env/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
        raise Exception(msg)
    Exception: 
    
    -- Process 1 terminated with the following error:
    Traceback (most recent call last):
      File "/media/disk/user/abhinav/LBYLNet/env/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
        fn(i, *args)
      File "/media/disk/user/abhinav/LBYLNet/train.py", line 300, in main
        model = LBYLNet(system_config, config["db"])
      File "/media/disk/user/abhinav/LBYLNet/core/models/net/lbylnet.py", line 59, in __init__
        self.context_block = context.get(cfg_sys.context)(self.joint_out_dim, mapdim=self.map_dim)
      File "/media/disk/user/abhinav/LBYLNet/core/models/context/module.py", line 120, in __init__
        self._init_layers(dim)
      File "/media/disk/user/abhinav/LBYLNet/core/models/context/module.py", line 137, in _init_layers
        from ._pconv.conv4 import TopLeftPool, TopRightPool, BottomLeftPool, BottomRightPool
      File "/media/disk/user/abhinav/LBYLNet/core/models/context/_pconv/conv4.py", line 3, in <module>
        import landmarks
    ModuleNotFoundError: No module named 'landmarks'
    
    good first issue 
    opened by abhinavkaul95 6
  • Does the pretrained visual backbone exclude val/test images on refcoco/+/g datasets ?

    Does the pretrained visual backbone exclude val/test images on refcoco/+/g datasets ?

    wget -P ext https://pjreddie.com/media/files/yolov3.weights

    It seems that it actually does not exclude val/test images on refcoco/+/g datasets, this is unfair, since not removing val/test images during pretraining usually brings about 2-3 improments

    opened by sean-zhuh 4
  • Is BertEncoder fine-tuned on the visual grounding tasks?

    Is BertEncoder fine-tuned on the visual grounding tasks?

    It seems that BertEncoder will be finetuned and updated during training, which is unfair compared to "Improving one-stage visual grounding by recursive sub-query construction" and "A fast and accurate one- stage approach to visual grounding".

    opened by PlumedSerpent 2
  • Training getting stopped after sometime

    Training getting stopped after sometime

    Hi @piaozhx,

    While trying to train the model, after sometime, training is getting stopped and I am getting the following error at the end:

    /usr/lib/python3.6/multiprocessing/semaphore_tracker.py:143: UserWarning: semaphore_tracker: There appear to be 32 leaked semaphores to clean up at shutdown
      len(cache))
    

    There are some more errors in between as well, but this was the error at the last. The complete logs of the training are attached with the issue. logs_training.txt

    opened by abhinavkaul95 2
  • Trained models pickle files corrupted?

    Trained models pickle files corrupted?

    Hi,

    I've tried loading a few of your provided model files from https://drive.google.com/drive/folders/1ICLArOUtWAx_W9nfn7uwobdtIkmN_RoA

    but can't get them to load, neither when using your code (demo and evaluation) nor when just trying to pickle.load() the file normally. There's an error saying (when using your code):

    lib/python3.7/site-packages/torch/serialization.py", line 755, in _legacy_load magic_number = pickle_module.load(f, **pickle_load_args) _pickle.UnpicklingError: invalid load key, '<'.

    Last line is the same when just trying with pickle.load() in regular python. This happens for all your model files I tried. How were those files created? Knowing this might help resolve the issue.

    Thanks!

    opened by dreichCSL 1
  • The forward( ) and backward( ) about TopLeftPool

    The forward( ) and backward( ) about TopLeftPool

    I have a question about "TopLeftPool()", "TopRightPool()", "BottomLeftPool()", "BottomRightPool()". The forward( ) is aim to get the max number for Dynamic Max Pooling,i think this step have updated the input, but why also need a backward( ) function? This function update what?

    opened by OliverHuang1220 1
  • How can I get the landmarkconv module?

    How can I get the landmarkconv module?

    Thanks for your great work. In LBYL/core/models/context/_pconv/conv4.py line 6, "from landmarkconv import _C". How can I get the landmarkconv module? Thanks

    opened by shydyl 0
  • download link of the

    download link of the "ReferitGame" dataset is no longer available

    Hello, the download link of the "ReferitGame" dataset is no longer available. Could you please send me a file through other methods? Thank you very much!

    opened by Rose-bud 0
  • The reproduced performances on unc/unc+

    The reproduced performances on unc/unc+

    Hi, The AP I reproduced on unc and unc+ is only 30%,but it is normal on referit and gref.I tried to download the dataset and code again, but it still doesn’t work. All experimental environments :CUDA9.2、pytorch1.7,batch_size=32 and train on a single 1080ti. Is there anything else I should pay attention to?

    help wanted 
    opened by OliverHuang1220 19
Owner
SVIP Lab
ShanghaiTech Vision and Intelligent Perception Lab
SVIP Lab
You Only Look One-level Feature (YOLOF), CVPR2021, Detectron2

You Only Look One-level Feature (YOLOF), CVPR2021 A simple, fast, and efficient object detector without FPN. This repo provides a neat implementation

qiang chen 273 Jan 3, 2023
Code for our NeurIPS 2021 paper Mining the Benefits of Two-stage and One-stage HOI Detection

CDN Code for our NeurIPS 2021 paper "Mining the Benefits of Two-stage and One-stage HOI Detection". Contributed by Aixi Zhang*, Yue Liao*, Si Liu, Mia

null 71 Dec 14, 2022
Code for Mining the Benefits of Two-stage and One-stage HOI Detection

Status: Archive (code is provided as-is, no updates expected) PPO-EWMA [Paper] This is code for training agents using PPO-EWMA and PPG-EWMA, introduce

OpenAI 33 Dec 15, 2022
(CVPR2021) DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation

DANNet: A One-Stage Domain Adaptation Network for Unsupervised Nighttime Semantic Segmentation CVPR2021(oral) [arxiv] Requirements python3.7 pytorch==

W-zx-Y 85 Dec 7, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
Virtual Dance Reality Stage: a feature that offers you to share a stage with another user virtually

Portrait Segmentation using Tensorflow This script removes the background from an input image. You can read more about segmentation here Setup The scr

null 291 Dec 24, 2022
Code for Two-stage Identifier: "Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition"

Code for Two-stage Identifier: "Locate and Label: A Two-stage Identifier for Nested Named Entity Recognition", accepted at ACL 2021. For details of the model and experiments, please see our paper.

tricktreat 87 Dec 16, 2022
The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation This repository is the official implementation of CVPR 2021 paper:

null 9 Nov 14, 2022
[ICCV2021] 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds

3DVG-Transformer This repository is for the ICCV 2021 paper "3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds" Our method "3DV

null 22 Dec 11, 2022
SeqTR: A Simple yet Universal Network for Visual Grounding

SeqTR This is the official implementation of SeqTR: A Simple yet Universal Network for Visual Grounding, which simplifies and unifies the modelling fo

seanZhuh 76 Dec 24, 2022
Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

2D-TAN (Optimized) Introduction This is an optimized re-implementation repository for AAAI'2020 paper: Learning 2D Temporal Localization Networks for

Joya Chen 112 Dec 31, 2022
Code for one-stage adaptive set-based HOI detector AS-Net.

AS-Net Code for one-stage adaptive set-based HOI detector AS-Net. Mingfei Chen*, Yue Liao*, Si Liu, Zhiyuan Chen, Fei Wang, Chen Qian. "Reformulating

Mingfei Chen 45 Dec 9, 2022
TOOD: Task-aligned One-stage Object Detection, ICCV2021 Oral

One-stage object detection is commonly implemented by optimizing two sub-tasks: object classification and localization, using heads with two parallel branches, which might lead to a certain level of spatial misalignment in predictions between the two tasks.

null 264 Jan 9, 2023
DAFNe: A One-Stage Anchor-Free Deep Model for Oriented Object Detection

DAFNe: A One-Stage Anchor-Free Deep Model for Oriented Object Detection Code for our Paper DAFNe: A One-Stage Anchor-Free Deep Model for Oriented Obje

Steven Lang 58 Dec 19, 2022
FishNet: One Stage to Detect, Segmentation and Pose Estimation

FishNet FishNet: One Stage to Detect, Segmentation and Pose Estimation Introduction In this project, we combine target detection, instance segmentatio

null 1 Oct 5, 2022
Pyramid Grafting Network for One-Stage High Resolution Saliency Detection. CVPR 2022

PGNet Pyramid Grafting Network for One-Stage High Resolution Saliency Detection. CVPR 2022, CVPR 2022 (arXiv 2204.05041) Abstract Recent salient objec

CVTEAM 109 Dec 5, 2022
A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

Keren Ye 35 Nov 20, 2022
Fashion Landmark Estimation with HRNet

HRNet for Fashion Landmark Estimation (Modified from deep-high-resolution-net.pytorch) Introduction This code applies the HRNet (Deep High-Resolution

SVIP Lab 91 Dec 26, 2022
Deep Unsupervised 3D SfM Face Reconstruction Based on Massive Landmark Bundle Adjustment.

(ACMMM 2021 Oral) SfM Face Reconstruction Based on Massive Landmark Bundle Adjustment This repository shows two tasks: Face landmark detection and Fac

BoomStar 51 Dec 13, 2022