Vision-Language Transformer and Query Generation for Referring Segmentation (ICCV 2021)


Please consider citing our paper in your publications if the project helps your research.

  title={Vision-Language Transformer and Query Generation for Referring Segmentation},
  author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},


  1. Environment:

    • Python 3.6

    • tensorflow 1.15

    • Other dependencies in requirements.txt

    • SpaCy model for embedding:

      python -m spacy download en_vectors_web_lg

  2. Dataset preparation

    • Put the folder of COCO training set ("train2014") under data/images/.

    • Download the RefCOCO dataset from here and extract them to data/. Then run the script for data preparation under data/:

      cd data
      python --data_root . --output_dir data_v2 --dataset [refcoco/refcoco+/refcocog] --split [unc/umd/google] --generate_mask


  1. Download pretrained models & config files from here.

  2. In the config file, set:

    • evaluate_model: path to the pretrained weights
    • evaluate_set: path to the dataset for evaluation.
  3. Run

    python test [PATH_TO_CONFIG_FILE]


  1. Pretrained Backbones: We use the backbone weights proviede by MCN.

    Note: we use the backbone that excludes all images that appears in the val/test splits of RefCOCO, RefCOCO+ and RefCOCOg.

  2. Specify hyperparameters, dataset path and pretrained weight path in the configuration file. Please refer to the examples under /config, or config file of our pretrained models.

  3. Run

    python train [PATH_TO_CONFIG_FILE]


We borrowed a lot of codes from MCN, keras-transformer, RefCOCO API and keras-yolo3. Thanks for their excellent works!

  • About training speed.

    Thanks for the great work! I wonder about the training speed? I tried to train the model on 2*V100 with batch size 256, but the training process is too slow and about 10 hours per epoch, is it normal?

    opened by chaoqunwangcs 2
  • Two questions about the model training: weight mismatch and yolov3_480000.h5

    Two questions about the model training: weight mismatch and yolov3_480000.h5

    Hi, I am running your code for model training, but I have two questions:

    1. How to generate the weights of yolov3_480000.h5. I try to generate the yolov3_480000.h5 follow the Keras implementation of YOLOv3 by Converting the Darknet YOLO model to a Keras model: python yolov3.cfg yolov3.weights model_data/yolo.h5. So is it right?
    2. After generating the yolov3_480000.h5, I start to train the model. But it shows some warming information, why? image
    opened by jianhua2022 1
  • Corresponding code for the Query Generation Module

    Corresponding code for the Query Generation Module

    image Thanks for sharing the code. However, I'm quite confused for the code of QGM as the naming of the code is a little different from the original paper(if I understand it correctly...)

    I think the code for that module is defined in function lang_tf_enc of model/

    def lang_tf_enc(vision_input,
        decoder_embed_lang = TrigPosEmbedding(
        decoder_embed_vis = TrigPosEmbedding(
        q_inp = L.Dense(hidden_dim, activation='relu')(decoder_embed_vis)
        k_inp = L.Dense(hidden_dim, activation='relu')(decoder_embed_lang)
        v_inp = L.Dense(hidden_dim, activation='relu')(decoder_embed_lang)
        decoded_layer = MultiHeadAttention(head_num=head_num)(
            [q_inp, k_inp, v_inp])
        add_layer = L.Add(name='Fusion-Add')([decoded_layer, vision_input])
        return add_layer

    As the figure 4 suggests, the input vision features should be the raw vision features extracted from the vision backbone network. Yet the input for this function is features fused by vision & language features Fm_query(in function make_multitask_braches of model/

    def make_multitask_braches(Fv, fq, fq_word, config):
        # fq: bs, 1024
        # fq_word: bs, 15, 1024
        Fm = simple_fusion(Fv[0], fq, config.jemb_dim)  # 13, 13, 1024
        Fm_mid_query = up_proj_cat_proj(Fm, Fv[1], K.int_shape(Fv[1],)[-1], K.int_shape(Fm)[-1]//2)  # 26, 26, 512
        Fm_query = pool_proj_cat_proj(Fm_mid_query, Fv[2], K.int_shape(Fv[2])[-1], K.int_shape(Fm)[-1]//2)  # 26, 26, 512
        Fm_mid_tf = proj_cat(Fm_query, Fm_mid_query, K.int_shape(Fm)[-1]//2)  # 26, 26, 1024
        F_tf = up_proj_cat_proj(Fm, Fm_mid_tf, K.int_shape(Fm)[-1] // 2)
        F_tf = V.DarknetConv2D_BN_Leaky(config.hidden_dim, (1, 1))(F_tf)
        # Fm_query:  bs, Hm, Wm, C  (None, 26, 26, 512)
        # Fm_top_tf :  bs, Hc, Wc, C  (None, 26, 26, 512)
        query_out = vlt_querynet(Fm_query, config)
        mask_out = vlt_transformer(F_tf, fq_word, query_out, config)
        mask_out = vlt_postproc(mask_out, Fm_query, config)
        return mask_out

    Can you tell me if I got it wrong? Thanks for your great patience.

    opened by KevinGoodman 0
  • Problem:   AttributeError: 'tuple' object has no attribute 'layer'

    Problem: AttributeError: 'tuple' object has no attribute 'layer'

    Whenever I try to train, test or whatever I get allways the same error, I have the code setup as you explain, but I keep getting the same error:

    (ultimate) @fio:Vision-Language-Transformer> python train config.yaml
    Using TensorFlow backend.
    batch_size: 128
    embed_dim: 300
    epoches: 50
    evaluate_model: ./models/test_map.h5
    evaluate_set: ./data/data_v2/anns/refcocog/val.json
    free_body: 1
    hidden_dim: 256
    image_path: ./data/images/train2014
    input_size: 416
    jemb_dim: 1024
    lang_att: True
    log_images: 0
    log_path: ./log/refcocog
    lr: 0.001
    lr_scheduler: step
    max_queue_size: 10
    multi_thres: False
    num_query: 16
    pretrained_weights: ./data/weights/yolov3_480000.h5
    query_balance: True
    rnn_bidirectional: True
    rnn_drop_out: 0.1
    rnn_hidden_size: 1024
    seed: 10010
    seg_gt_path: ./data/data_v2/masks/refcocog
    seg_out_stride: 2
    segment_thresh: 0.35
    start_epoch: 0
    steps: [40, 45, 50]
    train_set: ./data/data_v2/anns/refcocog/train.json
    transformer_decoder_num: 2
    transformer_encoder_num: 2
    transformer_head_num: 8
    transformer_hidden_dim: 256
    word_embed: en_vectors_web_lg
    word_len: 20
    workers: 32
    1 GPUs detected:
    Dataset Loaded: evaluate_set,  Len: 5000
    Dataset Loaded: train_set,  Len: 44822
    Creating model...
    Traceback (most recent call last):
      File "", line 54, in <module>
        trainer = Trainer(config, log_path, GPUS=GPU_COUNTS, debug=args.debug, verbose=args.verbose)
      File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/", line 117, in __init__
        super(Trainer, self).__init__(config, **kwargs)
      File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/", line 39, in __init__
        self.yolo_model, self.yolo_body, self.yolo_body_single = self.create_model()
      File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/", line 54, in create_model
        model_body = yolo_body(image_input, q_input, self.config)
      File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/", line 162, in yolo_body
        mask_out = make_multitask_braches(Fv, fq, fq_word, config)
      File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/", line 79, in make_multitask_braches
        mask_out = vlt_transformer(F_tf, fq_word, query_out, config)
      File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/", line 100, in vlt_transformer
      File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/", line 78, in lang_tf_enc
      File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/", line 881, in __call__
        inputs, outputs, args, kwargs)
      File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/", line 2043, in _set_connectivity_metadata_
        input_tensors=inputs, output_tensors=outputs, arguments=arguments)
      File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/", line 2059, in _add_inbound_node
      File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/util/", line 536, in map_structure
        structure[0], [func(*x) for x in entries],
      File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/util/", line 536, in <listcomp>
        structure[0], [func(*x) for x in entries],
      File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/", line 2058, in <lambda>
        inbound_layers = nest.map_structure(lambda t: t._keras_history.layer,
    AttributeError: 'tuple' object has no attribute 'layer'

    I have installed all the versions you specify with the requirements, and I am able to run the data_process_v2 script without any problem. Do you know what is happening?

    Thank you in advance,


    opened by FioPio 6
  • Confusion about data_process_v2

    Confusion about data_process_v2

    Hello, I just checked the file 'data/', and I found something confusing.

    Since in line 98 you check 'if dataset == 'refclef', apparently, you take RefClef dataset into account, not only RefCoco, Refcoco+, Refcocog. But should categories in Refclef be processed the same way like Refcoco*, as in cat_process function? I guess the cat_process function is to convert COCO 91-category to 80-category. I wonder if this works to Refclef similarly?

    By the way, still in line 98, why should ['19579.jpg', '17975.jpg', '19575.jpg'] be excluded? Is there any explanation?

    Your reply would be highly appreciated, thanks :)

    opened by huangjy-pku 0
Henghui Ding
Henghui Ding
