Vision-Language Transformer and Query Generation for Referring Segmentation (ICCV 2021)

Overview

Vision-Language Transformer and Query Generation for Referring Segmentation

Please consider citing our paper in your publications if the project helps your research.

@inproceedings{vision-language-transformer,
  title={Vision-Language Transformer and Query Generation for Referring Segmentation},
  author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},
  year={2021}
}

Installation

  1. Environment:

    • Python 3.6

    • tensorflow 1.15

    • Other dependencies in requirements.txt

    • SpaCy model for embedding:

      python -m spacy download en_vectors_web_lg

  2. Dataset preparation

    • Put the folder of COCO training set ("train2014") under data/images/.

    • Download the RefCOCO dataset from here and extract them to data/. Then run the script for data preparation under data/:

      cd data
      python data_process_v2.py --data_root . --output_dir data_v2 --dataset [refcoco/refcoco+/refcocog] --split [unc/umd/google] --generate_mask
      

Evaluating

  1. Download pretrained models & config files from here.

  2. In the config file, set:

    • evaluate_model: path to the pretrained weights
    • evaluate_set: path to the dataset for evaluation.
  3. Run

    python vlt.py test [PATH_TO_CONFIG_FILE]
    

Training

  1. Pretrained Backbones: We use the backbone weights proviede by MCN.

    Note: we use the backbone that excludes all images that appears in the val/test splits of RefCOCO, RefCOCO+ and RefCOCOg.

  2. Specify hyperparameters, dataset path and pretrained weight path in the configuration file. Please refer to the examples under /config, or config file of our pretrained models.

  3. Run

    python vlt.py train [PATH_TO_CONFIG_FILE]
    

Acknowledgement

We borrowed a lot of codes from MCN, keras-transformer, RefCOCO API and keras-yolo3. Thanks for their excellent works!

Comments
  • About training speed.

    About training speed.

    Thanks for the great work! I wonder about the training speed? I tried to train the model on 2*V100 with batch size 256, but the training process is too slow and about 10 hours per epoch, is it normal?

    opened by chaoqunwangcs 2
  • Two questions about the model training: weight mismatch and yolov3_480000.h5

    Two questions about the model training: weight mismatch and yolov3_480000.h5

    Hi, I am running your code for model training, but I have two questions:

    1. How to generate the weights of yolov3_480000.h5. I try to generate the yolov3_480000.h5 follow the Keras implementation of YOLOv3 by Converting the Darknet YOLO model to a Keras model: python convert.py yolov3.cfg yolov3.weights model_data/yolo.h5. So is it right?
    2. After generating the yolov3_480000.h5, I start to train the model. But it shows some warming information, why? image
    opened by jianhua2022 1
  • Corresponding code for the Query Generation Module

    Corresponding code for the Query Generation Module

    image Thanks for sharing the code. However, I'm quite confused for the code of QGM as the naming of the code is a little different from the original paper(if I understand it correctly...)

    I think the code for that module is defined in function lang_tf_enc of model/transformer_model.py

    def lang_tf_enc(vision_input,
                    lang_input,
                    head_num=8,
                    hidden_dim=256):
        decoder_embed_lang = TrigPosEmbedding(
            mode=TrigPosEmbedding.MODE_ADD,
            name='Fusion-Lang-Decoder-Embedding',
        )(lang_input)
        decoder_embed_vis = TrigPosEmbedding(
            mode=TrigPosEmbedding.MODE_ADD,
            name='Fusion-Vis-Decoder-Embedding',
        )(vision_input)
        q_inp = L.Dense(hidden_dim, activation='relu')(decoder_embed_vis)
        k_inp = L.Dense(hidden_dim, activation='relu')(decoder_embed_lang)
        v_inp = L.Dense(hidden_dim, activation='relu')(decoder_embed_lang)
        decoded_layer = MultiHeadAttention(head_num=head_num)(
            [q_inp, k_inp, v_inp])
        add_layer = L.Add(name='Fusion-Add')([decoded_layer, vision_input])
    
        return add_layer
    

    As the figure 4 suggests, the input vision features should be the raw vision features extracted from the vision backbone network. Yet the input for this function is features fused by vision & language features Fm_query(in function make_multitask_braches of model/vlt_model.py):

    def make_multitask_braches(Fv, fq, fq_word, config):
        # fq: bs, 1024
        # fq_word: bs, 15, 1024
        Fm = simple_fusion(Fv[0], fq, config.jemb_dim)  # 13, 13, 1024
    
        Fm_mid_query = up_proj_cat_proj(Fm, Fv[1], K.int_shape(Fv[1],)[-1], K.int_shape(Fm)[-1]//2)  # 26, 26, 512
        Fm_query = pool_proj_cat_proj(Fm_mid_query, Fv[2], K.int_shape(Fv[2])[-1], K.int_shape(Fm)[-1]//2)  # 26, 26, 512
    
        Fm_mid_tf = proj_cat(Fm_query, Fm_mid_query, K.int_shape(Fm)[-1]//2)  # 26, 26, 1024
        F_tf = up_proj_cat_proj(Fm, Fm_mid_tf, K.int_shape(Fm)[-1] // 2)
    
        F_tf = V.DarknetConv2D_BN_Leaky(config.hidden_dim, (1, 1))(F_tf)
    
        # Fm_query:  bs, Hm, Wm, C  (None, 26, 26, 512)
        # Fm_top_tf :  bs, Hc, Wc, C  (None, 26, 26, 512)
        query_out = vlt_querynet(Fm_query, config)
        mask_out = vlt_transformer(F_tf, fq_word, query_out, config)
        mask_out = vlt_postproc(mask_out, Fm_query, config)
    
        return mask_out
    

    Can you tell me if I got it wrong? Thanks for your great patience.

    opened by KevinGoodman 0
  • Problem:   AttributeError: 'tuple' object has no attribute 'layer'

    Problem: AttributeError: 'tuple' object has no attribute 'layer'

    Whenever I try to train, test or whatever I get allways the same error, I have the code setup as you explain, but I keep getting the same error:

    (ultimate) @fio:Vision-Language-Transformer> python vlt.py train config.yaml
    Using TensorFlow backend.
    batch_size: 128
    embed_dim: 300
    epoches: 50
    evaluate_model: ./models/test_map.h5
    evaluate_set: ./data/data_v2/anns/refcocog/val.json
    free_body: 1
    hidden_dim: 256
    image_path: ./data/images/train2014
    input_size: 416
    jemb_dim: 1024
    lang_att: True
    log_images: 0
    log_path: ./log/refcocog
    lr: 0.001
    lr_scheduler: step
    max_queue_size: 10
    multi_thres: False
    num_query: 16
    pretrained_weights: ./data/weights/yolov3_480000.h5
    query_balance: True
    rnn_bidirectional: True
    rnn_drop_out: 0.1
    rnn_hidden_size: 1024
    seed: 10010
    seg_gt_path: ./data/data_v2/masks/refcocog
    seg_out_stride: 2
    segment_thresh: 0.35
    start_epoch: 0
    steps: [40, 45, 50]
    train_set: ./data/data_v2/anns/refcocog/train.json
    transformer_decoder_num: 2
    transformer_encoder_num: 2
    transformer_head_num: 8
    transformer_hidden_dim: 256
    word_embed: en_vectors_web_lg
    word_len: 20
    workers: 32
    
    
    --------------------------
    PHASE:train
    
    1 GPUs detected:
    ['/device:GPU:0']
    Dataset Loaded: evaluate_set,  Len: 5000
    Dataset Loaded: train_set,  Len: 44822
    Creating model...
    Traceback (most recent call last):
      File "vlt.py", line 54, in <module>
        trainer = Trainer(config, log_path, GPUS=GPU_COUNTS, debug=args.debug, verbose=args.verbose)
      File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/executor.py", line 117, in __init__
        super(Trainer, self).__init__(config, **kwargs)
      File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/executor.py", line 39, in __init__
        self.yolo_model, self.yolo_body, self.yolo_body_single = self.create_model()
      File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/executor.py", line 54, in create_model
        model_body = yolo_body(image_input, q_input, self.config)
      File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/vlt_model.py", line 162, in yolo_body
        mask_out = make_multitask_braches(Fv, fq, fq_word, config)
      File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/vlt_model.py", line 79, in make_multitask_braches
        mask_out = vlt_transformer(F_tf, fq_word, query_out, config)
      File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/vlt_model.py", line 100, in vlt_transformer
        head_num=config.transformer_head_num)
      File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/transfromer_model.py", line 78, in lang_tf_enc
        )(lang_input)
      File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 881, in __call__
        inputs, outputs, args, kwargs)
      File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2043, in _set_connectivity_metadata_
        input_tensors=inputs, output_tensors=outputs, arguments=arguments)
      File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2059, in _add_inbound_node
        input_tensors)
      File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/util/nest.py", line 536, in map_structure
        structure[0], [func(*x) for x in entries],
      File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/util/nest.py", line 536, in <listcomp>
        structure[0], [func(*x) for x in entries],
      File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2058, in <lambda>
        inbound_layers = nest.map_structure(lambda t: t._keras_history.layer,
    AttributeError: 'tuple' object has no attribute 'layer'
    

    I have installed all the versions you specify with the requirements, and I am able to run the data_process_v2 script without any problem. Do you know what is happening?

    Thank you in advance,

    Ferriol

    opened by FioPio 6
  • Confusion about data_process_v2

    Confusion about data_process_v2

    Hello, I just checked the file 'data/data_process_v2.py', and I found something confusing.

    Since in line 98 you check 'if dataset == 'refclef', apparently, you take RefClef dataset into account, not only RefCoco, Refcoco+, Refcocog. But should categories in Refclef be processed the same way like Refcoco*, as in cat_process function? I guess the cat_process function is to convert COCO 91-category to 80-category. I wonder if this works to Refclef similarly?

    By the way, still in line 98, why should ['19579.jpg', '17975.jpg', '19575.jpg'] be excluded? Is there any explanation?

    Your reply would be highly appreciated, thanks :)

    opened by huangjy-pku 0
Owner
Henghui Ding
Henghui Ding
Alex Pashevich 62 Dec 24, 2022
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 12.6k Jan 9, 2023
This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

vision-transformer-from-scratch This repository includes several kinds of vision transformers from scratch so that one beginner can understand the the

null 1 Dec 24, 2021
Continuous Query Decomposition for Complex Query Answering in Incomplete Knowledge Graphs

Continuous Query Decomposition This repository contains the official implementation for our ICLR 2021 (Oral) paper, Complex Query Answering with Neura

UCL Natural Language Processing 71 Dec 29, 2022
Code for ACL 21: Generating Query Focused Summaries from Query-Free Resources

marge This repository releases the code for Generating Query Focused Summaries from Query-Free Resources. Please cite the following paper [bib] if you

Yumo Xu 28 Nov 10, 2022
This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Swin Transformer for Object Detection This repo contains the supported code and configuration files to reproduce object detection results of Swin Tran

Swin Transformer 1.4k Dec 30, 2022
Official implementation of the paper Vision Transformer with Progressive Sampling, ICCV 2021.

Vision Transformer with Progressive Sampling This is the official implementation of the paper Vision Transformer with Progressive Sampling, ICCV 2021.

yuexy 123 Jan 1, 2023
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Salesforce 1.3k Dec 31, 2022
Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision Download links and PyTorch implementation of "Towers of Ba

Blakey Wu 40 Dec 14, 2022
nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation ". Please

jsguo 610 Dec 28, 2022
Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation.

Unified-EPT Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation. Installation Linux, CUDA>=10.0,

null 29 Aug 23, 2022
Code for the ICML 2021 paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

ViLT Code for the paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision" Install pip install -r requirements.txt pip

Wonjae Kim 922 Jan 1, 2023
Code for the ICML 2021 paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

ViLT Code for the paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision" Install pip install -r requirements.txt pip

Wonjae Kim 922 Jan 1, 2023
Python library containing BART query generation and BERT-based Siamese models for neural retrieval.

Neural Retrieval Embedding-based Zero-shot Retrieval through Query Generation leverages query synthesis over large corpuses of unlabeled text (such as

Amazon Web Services - Labs 35 Apr 14, 2022
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Vision Longformer This project provides the source code for the vision longformer paper. Multi-Scale Vision Longformer: A New Vision Transformer for H

Microsoft 209 Dec 30, 2022
The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Shuffle Transformer The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer" Introduction Very recently, window-

null 87 Nov 29, 2022
Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

Swin-Transformer-Tensorflow A direct translation of the official PyTorch implementation of "Swin Transformer: Hierarchical Vision Transformer using Sh

null 52 Dec 29, 2022
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

CSWin-Transformer This repo is the official implementation of "CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows". Th

Microsoft 409 Jan 6, 2023
A task-agnostic vision-language architecture as a step towards General Purpose Vision

Towards General Purpose Vision Systems By Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem Overview Welcome to the official code base f

AI2 79 Dec 23, 2022