SeqTR
This is the official implementation of SeqTR: A Simple yet Universal Network for Visual Grounding, which simplifies and unifies the modelling for visual grounding tasks under a novel point prediction paradigm.
Installation
Prerequisites
pip install -r requirements.txt
wget https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz -O en_vectors_web_lg-2.1.0.tar.gz
pip install en_vectors_web_lg-2.1.0.tar.gz
Then install SeqTR package in editable mode:
pip install -e .
Data Preparation
- Download our preprocessed json files including the merged dataset for pre-training, and DarkNet-53 model weights trained on MS-COCO object detection task.
- Download the train2014 images from mscoco or from Joseph Redmon's mscoco mirror, of which the download speed is faster than the official website.
- Download original Flickr30K images, ReferItGame images, and Visual Genome images.
The project structure should look like the following:
| -- SeqTR
| -- data
| -- annotations
| -- flickr30k
| -- instances.json
| -- ix_to_token.pkl
| -- token_to_ix.pkl
| -- word_emb.npz
| -- referitgame-berkeley
| -- refcoco-unc
| -- refcocoplus-unc
| -- refcocog-umd
| -- refcocog-google
| -- pretraining-vg
| -- weights
| -- darknet.weights
| -- yolov3.weights
| -- images
| -- mscoco
| -- train2014
| -- COCO_train2014_000000000072.jpg
| -- ...
| -- saiaprtc12
| -- 25.jpg
| -- ...
| -- flickr30k
| -- 36979.jpg
| -- ...
| -- visual-genome
| -- 2412112.jpg
| -- ...
| -- configs
| -- seqtr
| -- tools
| -- teaser
Note that the darknet.weights excludes val/test images of RefCOCO/+/g datasets while yolov3.weights does not.
Training
Phrase Localization and Referring Expression Comprehension
We train SeqTR to perform grouning at bounding box level on a single V100 GPU. The following script performs the training:
python tools/train.py configs/seqtr/detection/seqtr_det_[DATASET_NAME].py --cfg-options ema=True
[DATASET_NAME] is one of "flickr30k", "referitgame-berkeley", "refcoco-unc", "refcocoplus-unc", "refcocog-umd", and "refcocog-google".
Referring Expression Segmentation
To train SeqTR to generate the target sequence of ground-truth mask, which is then assembled into the predicted mask by connecting the points, run the following script:
python tools/train.py configs/seqtr/segmentation/seqtr_mask_[DATASET_NAME].py --cfg-options ema=True
Note that instead of sampling 18 points and does not shuffle the sequence for RefCOCO dataset, for RefCOCO+ and RefCOCOg, we uniformly sample 12 points on the mask contour and randomly shffle the sequence with 20% percentage. Therefore, to execute the training on RefCOCO+/g datasets, modify num_ray at line 1 to 18 and model.head.shuffle_fraction to 0.2 at line 35, in configs/seqtr/segmentation/seqtr_mask_darknet.py.
Evaluation
python tools/test.py [PATH_TO_CONFIG_FILE] --load-from [PATH_TO_CHECKPOINT_FILE]
Pre-training + fine-tuning
We train SeqTR on 8 V100 GPUs while disabling Large Scale Jittering (LSJ) and Exponential Moving Average (EMA):
bash tools/dist_train.sh configs/seqtr/detection/seqtr_det_pretraining-vg.py 8
Models
RefCOCO | RefCOCO+ | RefCOCOg | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
val | testA | testB | model | val | testA | testB | model | val-g | val-u | val-u | model | |
SeqTR on REC | 81.23 | 85.00 | 76.08 | 68.82 | 75.37 | 58.78 | - | 71.35 | 71.58 | |||
SeqTR* on REC | 83.72 | 86.51 | 81.24 | 71.45 | 76.26 | 64.88 | 71.50 | 74.86 | 74.21 | |||
SeqTR pre-trained+finetuned on REC | 87.00 | 90.15 | 83.59 | 78.69 | 84.51 | 71.87 | - | 82.69 | 83.37 | |||
SeqTR on RES | 67.26 | 69.79 | 64.12 | 54.14 | 58.93 | 48.19 | - | 55.67 | 55.64 |
Contributing
Our codes are highly modularized and flexible to be extended to new architectures,. For instance, one can register new components such as head, fusion to promote your research ideas, or register new data augmentation techniques just as in mmdetection library. Feel free to play :-).
Citation
@article{zhu2022seqtr,
title={SeqTR: A Simple yet Universal Network for Visual Grounding},
author={Zhu, ChaoYang and Zhou, YiYi and Shen, YunHang and Luo, Gen and Pan, XingJia and Lin, MingBao and Chen, Chao and Cao, LiuJuan and Sun, XiaoShuai and Ji, RongRong},
journal={arXiv preprint arXiv:2203.16265},
year={2022}
}
Acknowledgement
Our code is built upon the open-sourced mmcv and mmdetection libraries.