Vision-Language Transformer and Query Generation for Referring Segmentation (ICCV 2021)

Henghui Ding

Last update: Dec 23, 2022

Related tags

Deep Learning tensorflow keras transformer vision-language referring-segmentation iccv2021 vision-language-transformer

Overview

Vision-Language Transformer and Query Generation for Referring Segmentation

Please consider citing our paper in your publications if the project helps your research.

@inproceedings{vision-language-transformer,
  title={Vision-Language Transformer and Query Generation for Referring Segmentation},
  author={Ding, Henghui and Liu, Chang and Wang, Suchen and Jiang, Xudong},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision},
  year={2021}
}

Installation

Environment:
- Python 3.6
- tensorflow 1.15
- Other dependencies in requirements.txt
- SpaCy model for embedding:
  
  python -m spacy download en_vectors_web_lg
Dataset preparation
- Put the folder of COCO training set ("train2014") under data/images/.
- Download the RefCOCO dataset from here and extract them to data/. Then run the script for data preparation under data/:
```
cd data
python data_process_v2.py --data_root . --output_dir data_v2 --dataset [refcoco/refcoco+/refcocog] --split [unc/umd/google] --generate_mask
```

Evaluating

Download pretrained models & config files from here.
In the config file, set:
- evaluate_model: path to the pretrained weights
- evaluate_set: path to the dataset for evaluation.

Run

python vlt.py test [PATH_TO_CONFIG_FILE]

Training

Pretrained Backbones: We use the backbone weights proviede by MCN.

Note: we use the backbone that excludes all images that appears in the val/test splits of RefCOCO, RefCOCO+ and RefCOCOg.
Specify hyperparameters, dataset path and pretrained weight path in the configuration file. Please refer to the examples under /config, or config file of our pretrained models.

Run

python vlt.py train [PATH_TO_CONFIG_FILE]

Acknowledgement

We borrowed a lot of codes from MCN, keras-transformer, RefCOCO API and keras-yolo3. Thanks for their excellent works!

Comments

About training speed.

Thanks for the great work! I wonder about the training speed? I tried to train the model on 2*V100 with batch size 256, but the training process is too slow and about 10 hours per epoch, is it normal?

opened by chaoqunwangcs 2
Two questions about the model training: weight mismatch and yolov3_480000.h5
Hi, I am running your code for model training, but I have two questions:

How to generate the weights of yolov3_480000.h5. I try to generate the yolov3_480000.h5 follow the Keras implementation of YOLOv3 by Converting the Darknet YOLO model to a Keras model: python convert.py yolov3.cfg yolov3.weights model_data/yolo.h5. So is it right?

After generating the yolov3_480000.h5, I start to train the model. But it shows some warming information, why?
opened by jianhua2022 1

Corresponding code for the Query Generation Module

Thanks for sharing the code. However, I'm quite confused for the code of QGM as the naming of the code is a little different from the original paper(if I understand it correctly...)

I think the code for that module is defined in function lang_tf_enc of model/transformer_model.py

def lang_tf_enc(vision_input,
                lang_input,
                head_num=8,
                hidden_dim=256):
    decoder_embed_lang = TrigPosEmbedding(
        mode=TrigPosEmbedding.MODE_ADD,
        name='Fusion-Lang-Decoder-Embedding',
    )(lang_input)
    decoder_embed_vis = TrigPosEmbedding(
        mode=TrigPosEmbedding.MODE_ADD,
        name='Fusion-Vis-Decoder-Embedding',
    )(vision_input)
    q_inp = L.Dense(hidden_dim, activation='relu')(decoder_embed_vis)
    k_inp = L.Dense(hidden_dim, activation='relu')(decoder_embed_lang)
    v_inp = L.Dense(hidden_dim, activation='relu')(decoder_embed_lang)
    decoded_layer = MultiHeadAttention(head_num=head_num)(
        [q_inp, k_inp, v_inp])
    add_layer = L.Add(name='Fusion-Add')([decoded_layer, vision_input])

    return add_layer

As the figure 4 suggests, the input vision features should be the raw vision features extracted from the vision backbone network. Yet the input for this function is features fused by vision & language features Fm_query(in function make_multitask_braches of model/vlt_model.py):

def make_multitask_braches(Fv, fq, fq_word, config):
    # fq: bs, 1024
    # fq_word: bs, 15, 1024
    Fm = simple_fusion(Fv[0], fq, config.jemb_dim)  # 13, 13, 1024

    Fm_mid_query = up_proj_cat_proj(Fm, Fv[1], K.int_shape(Fv[1],)[-1], K.int_shape(Fm)[-1]//2)  # 26, 26, 512
    Fm_query = pool_proj_cat_proj(Fm_mid_query, Fv[2], K.int_shape(Fv[2])[-1], K.int_shape(Fm)[-1]//2)  # 26, 26, 512

    Fm_mid_tf = proj_cat(Fm_query, Fm_mid_query, K.int_shape(Fm)[-1]//2)  # 26, 26, 1024
    F_tf = up_proj_cat_proj(Fm, Fm_mid_tf, K.int_shape(Fm)[-1] // 2)

    F_tf = V.DarknetConv2D_BN_Leaky(config.hidden_dim, (1, 1))(F_tf)

    # Fm_query:  bs, Hm, Wm, C  (None, 26, 26, 512)
    # Fm_top_tf :  bs, Hc, Wc, C  (None, 26, 26, 512)
    query_out = vlt_querynet(Fm_query, config)
    mask_out = vlt_transformer(F_tf, fq_word, query_out, config)
    mask_out = vlt_postproc(mask_out, Fm_query, config)

    return mask_out

Can you tell me if I got it wrong? Thanks for your great patience.

opened by KevinGoodman 0

Problem: AttributeError: 'tuple' object has no attribute 'layer'

Whenever I try to train, test or whatever I get allways the same error, I have the code setup as you explain, but I keep getting the same error:

(ultimate) @fio:Vision-Language-Transformer> python vlt.py train config.yaml
Using TensorFlow backend.
batch_size: 128
embed_dim: 300
epoches: 50
evaluate_model: ./models/test_map.h5
evaluate_set: ./data/data_v2/anns/refcocog/val.json
free_body: 1
hidden_dim: 256
image_path: ./data/images/train2014
input_size: 416
jemb_dim: 1024
lang_att: True
log_images: 0
log_path: ./log/refcocog
lr: 0.001
lr_scheduler: step
max_queue_size: 10
multi_thres: False
num_query: 16
pretrained_weights: ./data/weights/yolov3_480000.h5
query_balance: True
rnn_bidirectional: True
rnn_drop_out: 0.1
rnn_hidden_size: 1024
seed: 10010
seg_gt_path: ./data/data_v2/masks/refcocog
seg_out_stride: 2
segment_thresh: 0.35
start_epoch: 0
steps: [40, 45, 50]
train_set: ./data/data_v2/anns/refcocog/train.json
transformer_decoder_num: 2
transformer_encoder_num: 2
transformer_head_num: 8
transformer_hidden_dim: 256
word_embed: en_vectors_web_lg
word_len: 20
workers: 32


--------------------------
PHASE:train

1 GPUs detected:
['/device:GPU:0']
Dataset Loaded: evaluate_set,  Len: 5000
Dataset Loaded: train_set,  Len: 44822
Creating model...
Traceback (most recent call last):
  File "vlt.py", line 54, in <module>
    trainer = Trainer(config, log_path, GPUS=GPU_COUNTS, debug=args.debug, verbose=args.verbose)
  File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/executor.py", line 117, in __init__
    super(Trainer, self).__init__(config, **kwargs)
  File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/executor.py", line 39, in __init__
    self.yolo_model, self.yolo_body, self.yolo_body_single = self.create_model()
  File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/executor.py", line 54, in create_model
    model_body = yolo_body(image_input, q_input, self.config)
  File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/vlt_model.py", line 162, in yolo_body
    mask_out = make_multitask_braches(Fv, fq, fq_word, config)
  File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/vlt_model.py", line 79, in make_multitask_braches
    mask_out = vlt_transformer(F_tf, fq_word, query_out, config)
  File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/vlt_model.py", line 100, in vlt_transformer
    head_num=config.transformer_head_num)
  File "/home/fio/Documents/UMU/semestre4TFM/code/demos/LG/test2/Vision-Language-Transformer/model/transfromer_model.py", line 78, in lang_tf_enc
    )(lang_input)
  File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 881, in __call__
    inputs, outputs, args, kwargs)
  File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2043, in _set_connectivity_metadata_
    input_tensors=inputs, output_tensors=outputs, arguments=arguments)
  File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2059, in _add_inbound_node
    input_tensors)
  File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/util/nest.py", line 536, in map_structure
    structure[0], [func(*x) for x in entries],
  File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/util/nest.py", line 536, in <listcomp>
    structure[0], [func(*x) for x in entries],
  File "/home/fio/anaconda3/envs/ultimate/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/base_layer.py", line 2058, in <lambda>
    inbound_layers = nest.map_structure(lambda t: t._keras_history.layer,
AttributeError: 'tuple' object has no attribute 'layer'

I have installed all the versions you specify with the requirements, and I am able to run the data_process_v2 script without any problem. Do you know what is happening?

Thank you in advance,

Ferriol

opened by FioPio 6

Confusion about data_process_v2

Hello, I just checked the file 'data/data_process_v2.py', and I found something confusing.

Since in line 98 you check 'if dataset == 'refclef', apparently, you take RefClef dataset into account, not only RefCoco, Refcoco+, Refcocog. But should categories in Refclef be processed the same way like Refcoco*, as in cat_process function? I guess the cat_process function is to convert COCO 91-category to 80-category. I wonder if this works to Refclef similarly?

By the way, still in line 98, why should ['19579.jpg', '17975.jpg', '19575.jpg'] be excluded? Is there any explanation?

Your reply would be highly appreciated, thanks :)

opened by huangjy-pku 0

Vision-Language Transformer and Query Generation for Referring Segmentation (ICCV 2021)

Related tags

Overview

Vision-Language Transformer and Query Generation for Referring Segmentation

Installation

Evaluating

Training

Acknowledgement

Comments

About training speed.

Two questions about the model training: weight mismatch and yolov3_480000.h5

Corresponding code for the Query Generation Module

Problem: AttributeError: 'tuple' object has no attribute 'layer'

Confusion about data_process_v2

Owner

Henghui Ding

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

This repository builds a basic vision transformer from scratch so that one beginner can understand the theory of vision transformer.

Continuous Query Decomposition for Complex Query Answering in Incomplete Knowledge Graphs

Code for ACL 21: Generating Query Focused Summaries from Query-Free Resources

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Official implementation of the paper Vision Transformer with Progressive Sampling, ICCV 2021.

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

nnFormer: Interleaved Transformer for Volumetric Segmentation Code for paper "nnFormer: Interleaved Transformer for Volumetric Segmentation "

Code for the ICCV 2021 Workshop paper: A Unified Efficient Pyramid Transformer for Semantic Segmentation.

Code for the ICML 2021 paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Code for the ICML 2021 paper: "ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision"

Python library containing BART query generation and BERT-based Siamese models for neural retrieval.

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

The implementation of "Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer"

Unofficial implementation of "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" (https://arxiv.org/abs/2103.14030)

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped

A task-agnostic vision-language architecture as a step towards General Purpose Vision