SeqTR: A Simple yet Universal Network for Visual Grounding

Overview

SeqTR

overview

This is the official implementation of SeqTR: A Simple yet Universal Network for Visual Grounding, which simplifies and unifies the modelling for visual grounding tasks under a novel point prediction paradigm.

Installation

Prerequisites

pip install -r requirements.txt
wget https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz -O en_vectors_web_lg-2.1.0.tar.gz
pip install en_vectors_web_lg-2.1.0.tar.gz

Then install SeqTR package in editable mode:

pip install -e .

Data Preparation

  1. Download our preprocessed json files including the merged dataset for pre-training, and DarkNet-53 model weights trained on MS-COCO object detection task.
  2. Download the train2014 images from mscoco or from Joseph Redmon's mscoco mirror, of which the download speed is faster than the official website.
  3. Download original Flickr30K images, ReferItGame images, and Visual Genome images.

The project structure should look like the following:

| -- SeqTR
     | -- data
        | -- annotations
            | -- flickr30k
                | -- instances.json
                | -- ix_to_token.pkl
                | -- token_to_ix.pkl
                | -- word_emb.npz
            | -- referitgame-berkeley
            | -- refcoco-unc
            | -- refcocoplus-unc
            | -- refcocog-umd
            | -- refcocog-google
            | -- pretraining-vg 
        | -- weights
            | -- darknet.weights
            | -- yolov3.weights
        | -- images
            | -- mscoco
                | -- train2014
                    | -- COCO_train2014_000000000072.jpg
                    | -- ...
            | -- saiaprtc12
                | -- 25.jpg
                | -- ...
            | -- flickr30k
                | -- 36979.jpg
                | -- ...
            | -- visual-genome
                | -- 2412112.jpg
                | -- ...
     | -- configs
     | -- seqtr
     | -- tools
     | -- teaser

Note that the darknet.weights excludes val/test images of RefCOCO/+/g datasets while yolov3.weights does not.

Training

Phrase Localization and Referring Expression Comprehension

We train SeqTR to perform grouning at bounding box level on a single V100 GPU. The following script performs the training:

python tools/train.py configs/seqtr/detection/seqtr_det_[DATASET_NAME].py --cfg-options ema=True

[DATASET_NAME] is one of "flickr30k", "referitgame-berkeley", "refcoco-unc", "refcocoplus-unc", "refcocog-umd", and "refcocog-google".

Referring Expression Segmentation

To train SeqTR to generate the target sequence of ground-truth mask, which is then assembled into the predicted mask by connecting the points, run the following script:

python tools/train.py configs/seqtr/segmentation/seqtr_mask_[DATASET_NAME].py --cfg-options ema=True

Note that instead of sampling 18 points and does not shuffle the sequence for RefCOCO dataset, for RefCOCO+ and RefCOCOg, we uniformly sample 12 points on the mask contour and randomly shffle the sequence with 20% percentage. Therefore, to execute the training on RefCOCO+/g datasets, modify num_ray at line 1 to 18 and model.head.shuffle_fraction to 0.2 at line 35, in configs/seqtr/segmentation/seqtr_mask_darknet.py.

Evaluation

python tools/test.py [PATH_TO_CONFIG_FILE] --load-from [PATH_TO_CHECKPOINT_FILE]

Pre-training + fine-tuning

We train SeqTR on 8 V100 GPUs while disabling Large Scale Jittering (LSJ) and Exponential Moving Average (EMA):

bash tools/dist_train.sh configs/seqtr/detection/seqtr_det_pretraining-vg.py 8

Models

RefCOCO RefCOCO+ RefCOCOg
val testA testB model val testA testB model val-g val-u val-u model
SeqTR on REC 81.23 85.00 76.08 68.82 75.37 58.78 - 71.35 71.58
SeqTR* on REC 83.72 86.51 81.24 71.45 76.26 64.88 71.50 74.86 74.21
SeqTR pre-trained+finetuned on REC 87.00 90.15 83.59 78.69 84.51 71.87 - 82.69 83.37
SeqTR on RES 67.26 69.79 64.12 54.14 58.93 48.19 - 55.67 55.64
SeqTR* denotes that its visual encoder is initialized with yolov3.weights, while the visual encoder of the rest are initialized with darknet.weights.

Contributing

Our codes are highly modularized and flexible to be extended to new architectures,. For instance, one can register new components such as head, fusion to promote your research ideas, or register new data augmentation techniques just as in mmdetection library. Feel free to play :-).

Citation

@article{zhu2022seqtr,
  title={SeqTR: A Simple yet Universal Network for Visual Grounding},
  author={Zhu, ChaoYang and Zhou, YiYi and Shen, YunHang and Luo, Gen and Pan, XingJia and Lin, MingBao and Chen, Chao and Cao, LiuJuan and Sun, XiaoShuai and Ji, RongRong},
  journal={arXiv preprint arXiv:2203.16265},
  year={2022}
}

Acknowledgement

Our code is built upon the open-sourced mmcv and mmdetection libraries.

Comments
  • size mismatch for head.transformer.seq_positional_encoding.embedding.weight:

    size mismatch for head.transformer.seq_positional_encoding.embedding.weight:

    Dear Author, I am trying to use the model for Refcocog (pre-trained + fine-tuned SeqTR segmentation) and test it on Refcoco dataset and visualize the results.

    The code I run is "python tools/inference.py /home/chch3470/SeqTR/configs/seqtr/segmentation/seqtr_segm_refcoco-unc.py "/home/chch3470/SeqTR/work_dir/segm_best.pth" --output-dir="/home/chch3470/SeqTR/attention_map_output" --with-gt --which-set="testA" "

    I meet the error below. Do you have any idea why it happens? Is Refcocog (pre-trained + fine-tuned SeqTR segmentation) based on yolo or darknet? If it is based on yolo, what configs should we use? Also, should we change the vis_encs(currently the codebase only provides darknet.py for vis_encs)?

    I can visualize the provided models for detection tasks so I guess I know the basic setups...

    RuntimeError: Error(s) in loading state_dict for SeqTR: size mismatch for lan_enc.embedding.weight: copying a param with shape torch.Size([12692, 300]) from checkpoint, the shape in current model is torch.Size([10344, 300]). size mismatch for head.transformer.seq_positional_encoding.embedding.weight: copying a param with shape torch.Size([25, 256]) from checkpoint, the shape in current model is torch.Size([37, 256]).

    opened by CCYChongyanChen 5
  • Version for packages?

    Version for packages?

    Dear author, Could you please kindly share your versions for each of the following packages? torch, torchvision, mmdet, and mmcv-full

    Thank you so much!

    opened by CCYChongyanChen 3
  • Meeting a bug in

    Meeting a bug in "./seqtr/api/train.py" , 94 line, the accuracy function.

    Thank author for prviding clear code. When the model trained by use "python tools/train.py configs/seqtr/segmentation/seqtr_mask_[DATASET_NAME].py --cfg-options ema=True", the accuracy function only receives 3 return values, and this cause the training failed. According to post code, the "batch_ie" isn't a significant parameter. It seems a reminder, so I delete the code about "batch_iz" in "./seqtr/api/train.py" that it can work well. Could author gives a description about "batch_ie"? It would be nice if the author provided weights trained for the model. Thank you!

    good first issue 
    opened by zlj63501 3
  • Customized dataset?

    Customized dataset?

    Hi, thanks for the awesome work. Could I ask how could we obtain the token_to_ix.pkl, ix_to_token.pkl, and the word_emb.npz to generate customized dataset? Thank you so much!

    opened by CCYChongyanChen 2
  • multi-task的配置文件

    multi-task的配置文件

    作者您好,我正在研读您的code,目前有两个问题存在一些疑问。

    1.请问multi-task的训练是detection和segmentation两个任务统一训练吗?还是需要分开训呢?

    2.在multi-task的配置文件中,比如 configs/seqtr/multi-task/seqtr_multi-task_refcocog-google.py,其中需要到 '../../base/datasets/multi-task/refcocog-google.py' 的配置文件,但在本项目中没有给出,请问作者会公开这部分的配置吗?或者您可以告诉我该如何更改配置吗?

    opened by Azong-HQU 2
  • Seq_in的边界值问题

    Seq_in的边界值问题

    作者您好,请教您一个问题: seq_in[seq_in != self.end].clamp_(min=0, max=self.end-1) 这句code会将目标bbox的左上角和右下角坐标做一个最大最小值的约束,前提是seq_in != self.num_bin (eg: self.end=1000),如果碰到刚好seq_in == self.end的情况该怎么办呢? 即比如seq_in = [806, 59, 1000, 233], self.end=1000, 那么执行上述code时,1000会被过滤掉,不进行约束。同时这是不是就与targets label [X1,Y1,X2,Y2,1000]冲突了,这该怎么解决呢?

    麻烦作者有空解答一下,万分感谢!

    opened by Azong-HQU 2
  • Memory and BatchSize

    Memory and BatchSize

    Hi, thanks for the wonderful work. I am curious about why SeqTR is so memory-efficient. As shown in the config file, SeqTR is trained with a batch size of 128 on a single 32GB GPU! However, for object detectors like DETR, the batch size on each GPU is quite limited. Could you please give some insights about this? Thanks in advance.

    opened by MasterBin-IIAU 2
  • mixed datasets

    mixed datasets

    Hi, thanks for the awesome work. Datasets and most annotations can be normally downloaded following the README. But I did not find mixed in the provided google drive link. Have I missed something? Thanks in advance.

    opened by MasterBin-IIAU 2
  • Visualization

    Visualization

    Hi,

    Congratulation!

    I want to visualize the attention weights of segmentation points similar to Fig. 5.

    According to the paper: "We visualize the cross attention map averaged over decoder layers and attention heads in Fig. 5.", but I am not sure how to incorporate these weights into the original image.

    Would you like to share the script or provide a workable idea?

    Thanks~

    opened by zlj63501 2
  • Source of tokenizer files?

    Source of tokenizer files?

    Thanks for your great work! I am new to VG and want to know the source of servel files (ix_to_token.pkl, token_to_ix.pkl and word_emb.npz) below work_dir/data/annotations/dataset_name/. Do you define these vocabs and embeddings yourself or learn from other works? Thanks again!

    opened by DLUT-yyc 1
  • multi-task

    multi-task

    Hi, here are some questions about multi-task: KeyError: "RefCOCOgUMD: 'GenerateMaskVertices is not in the PIPELINES registry'" Thank you very much for your project and look forward to more code and configuration for multi-task

    opened by maxLWS 1
  • ImportError: cannot import name 'imshow_expr_bbox' from 'seqtr.core' (...../SeqTR/seqtr/core/__init__.py)

    ImportError: cannot import name 'imshow_expr_bbox' from 'seqtr.core' (...../SeqTR/seqtr/core/__init__.py)

    Hi! The following two functions imshow_expr_bbox, imshow_expr_mask are called in seqtr/apis/inference.py https://github.com/sean-zhuh/SeqTR/blob/36f74bb9da4bcf81775f9f3bb3e54b170860c536/seqtr/apis/inference.py#L6` But I can't find them from seqtr.core. Am I missing anything?

    Thanks so much for your help!

    opened by zdxdsw 1
  •  setting

    setting "is_crowd = 1" for multiple masks/ polygons resulting in inaccurate evaluation?

    Hi, thanks for sharing the great work. I have a question about the is_crowd flag. Why do you need to set it to 1 for multiple masks/ polygons when loading the data? https://github.com/sean-zhuh/SeqTR/blob/36f74bb9da4bcf81775f9f3bb3e54b170860c536/seqtr/datasets/pipelines/loading.py#L126

    If it looks like if is_crowd=1, the IoU computation from pycocotool will use a modified criterion that considers the union of gt_mask and pred_mask as pred_mask alone, resulting in a higher number than the standard IoU definition.

    https://github.com/sean-zhuh/SeqTR/blob/36f74bb9da4bcf81775f9f3bb3e54b170860c536/seqtr/apis/test.py#L19

    (See the note in pycocotool https://github.com/cocodataset/cocoapi/blob/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9/PythonAPI/pycocotools/mask.py#L65)

    Do I understand this correctly? Thanks for your help!

    opened by leookami 1
  • Errors in finetuning

    Errors in finetuning

    After completing pre-training, I finetuned to refcoco-unc and found the following error messages File "SeqTR/seqtr/utils/checkpoint.py", line 57, in load_pretrained_checkpoint state, ema_state = ckpt['state_dict'], ckpt['ema_state_dict'] KeyError: 'ema_state_dict' Even after fixing this bug, I still found many bugs (e.g. lan_enc.embedding.weight, model.head) in load_pretrained_checkpoint(). Can you please check it?

    opened by pqviet 7
Owner
seanZhuh
what/why then how
seanZhuh
A Fast and Accurate One-Stage Approach to Visual Grounding, ICCV 2019 (Oral)

One-Stage Visual Grounding ***** New: Our recent work on One-stage VG is available at ReSC.***** A Fast and Accurate One-Stage Approach to Visual Grou

Zhengyuan Yang 118 Dec 5, 2022
The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation This repository is the official implementation of CVPR 2021 paper:

null 9 Nov 14, 2022
[ICCV2021] 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds

3DVG-Transformer This repository is for the ICCV 2021 paper "3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds" Our method "3DV

null 22 Dec 11, 2022
URIE: Universal Image Enhancementfor Visual Recognition in the Wild

URIE: Universal Image Enhancementfor Visual Recognition in the Wild This is the implementation of the paper "URIE: Universal Image Enhancement for Vis

Taeyoung Son 43 Sep 12, 2022
PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

piglet PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021] This repo contains code and data for PIGLeT. If you like

Rowan Zellers 51 Oct 8, 2022
A PyTorch implementation of the baseline method in Panoptic Narrative Grounding (ICCV 2021 Oral)

A PyTorch implementation of the baseline method in Panoptic Narrative Grounding (ICCV 2021 Oral)

Biomedical Computer Vision @ Uniandes 52 Dec 19, 2022
Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

2D-TAN (Optimized) Introduction This is an optimized re-implementation repository for AAAI'2020 paper: Learning 2D Temporal Localization Networks for

Joya Chen 112 Dec 31, 2022
[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers Website • STVG Demo • Paper This repository provides the code for our paper. This includes

Antoine Yang 108 Dec 27, 2022
This is the source code for our ICLR2021 paper: Adaptive Universal Generalized PageRank Graph Neural Network.

GPRGNN This is the source code for our ICLR2021 paper: Adaptive Universal Generalized PageRank Graph Neural Network. Hidden state feature extraction i

Jianhao 92 Jan 3, 2023
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
A clear, concise, simple yet powerful and efficient API for deep learning.

The Gluon API Specification The Gluon API specification is an effort to improve speed, flexibility, and accessibility of deep learning technology for

Gluon API 2.3k Dec 17, 2022
Deep universal probabilistic programming with Python and PyTorch

Getting Started | Documentation | Community | Contributing Pyro is a flexible, scalable deep probabilistic programming library built on PyTorch. Notab

null 7.7k Dec 30, 2022
Official codebase for Pretrained Transformers as Universal Computation Engines.

universal-computation Overview Official codebase for Pretrained Transformers as Universal Computation Engines. Contains demo notebook and scripts to r

Kevin Lu 210 Dec 28, 2022
MagFace: A Universal Representation for Face Recognition and Quality Assessment

MagFace MagFace: A Universal Representation for Face Recognition and Quality Assessment in IEEE Conference on Computer Vision and Pattern Recognition

Qiang Meng 523 Jan 5, 2023
git《USD-Seg:Learning Universal Shape Dictionary for Realtime Instance Segmentation》(2020) GitHub: [fig2]

USD-Seg This project is an implement of paper USD-Seg:Learning Universal Shape Dictionary for Realtime Instance Segmentation, based on FCOS detector f

Ruolin Ye 80 Nov 28, 2022
CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

Mingyang Zhou 28 Dec 30, 2022
A universal framework for learning timestamp-level representations of time series

TS2Vec This repository contains the official implementation for the paper Learning Timestamp-Level Representations for Time Series with Hierarchical C

Zhihan Yue 284 Dec 30, 2022
LiDAR R-CNN: An Efficient and Universal 3D Object Detector

LiDAR R-CNN: An Efficient and Universal 3D Object Detector Introduction This is the official code of LiDAR R-CNN: An Efficient and Universal 3D Object

TuSimple 295 Jan 5, 2023
[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation [Paper] Prerequisites To install requirements: pip install -r requirements.txt

Guangrui Li 84 Dec 26, 2022