SeqTR: A Simple yet Universal Network for Visual Grounding

seanZhuh

Last update: Dec 24, 2022

Related tags

Overview

SeqTR

This is the official implementation of SeqTR: A Simple yet Universal Network for Visual Grounding, which simplifies and unifies the modelling for visual grounding tasks under a novel point prediction paradigm.

Installation

Prerequisites

pip install -r requirements.txt
wget https://github.com/explosion/spacy-models/releases/download/en_vectors_web_lg-2.1.0/en_vectors_web_lg-2.1.0.tar.gz -O en_vectors_web_lg-2.1.0.tar.gz
pip install en_vectors_web_lg-2.1.0.tar.gz

Then install SeqTR package in editable mode:

pip install -e .

Data Preparation

Download our preprocessed json files including the merged dataset for pre-training, and DarkNet-53 model weights trained on MS-COCO object detection task.
Download the train2014 images from mscoco or from Joseph Redmon's mscoco mirror, of which the download speed is faster than the official website.
Download original Flickr30K images, ReferItGame images, and Visual Genome images.

The project structure should look like the following:

| -- SeqTR
     | -- data
        | -- annotations
            | -- flickr30k
                | -- instances.json
                | -- ix_to_token.pkl
                | -- token_to_ix.pkl
                | -- word_emb.npz
            | -- referitgame-berkeley
            | -- refcoco-unc
            | -- refcocoplus-unc
            | -- refcocog-umd
            | -- refcocog-google
            | -- pretraining-vg 
        | -- weights
            | -- darknet.weights
            | -- yolov3.weights
        | -- images
            | -- mscoco
                | -- train2014
                    | -- COCO_train2014_000000000072.jpg
                    | -- ...
            | -- saiaprtc12
                | -- 25.jpg
                | -- ...
            | -- flickr30k
                | -- 36979.jpg
                | -- ...
            | -- visual-genome
                | -- 2412112.jpg
                | -- ...
     | -- configs
     | -- seqtr
     | -- tools
     | -- teaser

Note that the darknet.weights excludes val/test images of RefCOCO/+/g datasets while yolov3.weights does not.

Training

Phrase Localization and Referring Expression Comprehension

We train SeqTR to perform grouning at bounding box level on a single V100 GPU. The following script performs the training:

python tools/train.py configs/seqtr/detection/seqtr_det_[DATASET_NAME].py --cfg-options ema=True

[DATASET_NAME] is one of "flickr30k", "referitgame-berkeley", "refcoco-unc", "refcocoplus-unc", "refcocog-umd", and "refcocog-google".

Referring Expression Segmentation

To train SeqTR to generate the target sequence of ground-truth mask, which is then assembled into the predicted mask by connecting the points, run the following script:

python tools/train.py configs/seqtr/segmentation/seqtr_mask_[DATASET_NAME].py --cfg-options ema=True

Note that instead of sampling 18 points and does not shuffle the sequence for RefCOCO dataset, for RefCOCO+ and RefCOCOg, we uniformly sample 12 points on the mask contour and randomly shffle the sequence with 20% percentage. Therefore, to execute the training on RefCOCO+/g datasets, modify num_ray at line 1 to 18 and model.head.shuffle_fraction to 0.2 at line 35, in configs/seqtr/segmentation/seqtr_mask_darknet.py.

Evaluation

python tools/test.py [PATH_TO_CONFIG_FILE] --load-from [PATH_TO_CHECKPOINT_FILE]

Pre-training + fine-tuning

We train SeqTR on 8 V100 GPUs while disabling Large Scale Jittering (LSJ) and Exponential Moving Average (EMA):

bash tools/dist_train.sh configs/seqtr/detection/seqtr_det_pretraining-vg.py 8

Models

	RefCOCO				RefCOCO+				RefCOCOg
	val	testA	testB	model	val	testA	testB	model	val-g	val-u	val-u	model
SeqTR on REC	81.23	85.00	76.08		68.82	75.37	58.78		-	71.35	71.58
SeqTR* on REC	83.72	86.51	81.24		71.45	76.26	64.88		71.50	74.86	74.21
SeqTR pre-trained+finetuned on REC	87.00	90.15	83.59		78.69	84.51	71.87		-	82.69	83.37
SeqTR on RES	67.26	69.79	64.12		54.14	58.93	48.19		-	55.67	55.64

SeqTR* denotes that its visual encoder is initialized with yolov3.weights, while the visual encoder of the rest are initialized with darknet.weights.

Contributing

Our codes are highly modularized and flexible to be extended to new architectures,. For instance, one can register new components such as head, fusion to promote your research ideas, or register new data augmentation techniques just as in mmdetection library. Feel free to play :-).

Citation

@article{zhu2022seqtr,
  title={SeqTR: A Simple yet Universal Network for Visual Grounding},
  author={Zhu, ChaoYang and Zhou, YiYi and Shen, YunHang and Luo, Gen and Pan, XingJia and Lin, MingBao and Chen, Chao and Cao, LiuJuan and Sun, XiaoShuai and Ji, RongRong},
  journal={arXiv preprint arXiv:2203.16265},
  year={2022}
}

Acknowledgement

Our code is built upon the open-sourced mmcv and mmdetection libraries.

Comments

size mismatch for head.transformer.seq_positional_encoding.embedding.weight:

Dear Author, I am trying to use the model for Refcocog (pre-trained + fine-tuned SeqTR segmentation) and test it on Refcoco dataset and visualize the results.

The code I run is "python tools/inference.py /home/chch3470/SeqTR/configs/seqtr/segmentation/seqtr_segm_refcoco-unc.py "/home/chch3470/SeqTR/work_dir/segm_best.pth" --output-dir="/home/chch3470/SeqTR/attention_map_output" --with-gt --which-set="testA" "

I meet the error below. Do you have any idea why it happens? Is Refcocog (pre-trained + fine-tuned SeqTR segmentation) based on yolo or darknet? If it is based on yolo, what configs should we use? Also, should we change the vis_encs(currently the codebase only provides darknet.py for vis_encs)?

I can visualize the provided models for detection tasks so I guess I know the basic setups...

RuntimeError: Error(s) in loading state_dict for SeqTR: size mismatch for lan_enc.embedding.weight: copying a param with shape torch.Size([12692, 300]) from checkpoint, the shape in current model is torch.Size([10344, 300]). size mismatch for head.transformer.seq_positional_encoding.embedding.weight: copying a param with shape torch.Size([25, 256]) from checkpoint, the shape in current model is torch.Size([37, 256]).

opened by CCYChongyanChen 5
Version for packages?

Dear author, Could you please kindly share your versions for each of the following packages? torch, torchvision, mmdet, and mmcv-full

Thank you so much!

opened by CCYChongyanChen 3
Meeting a bug in "./seqtr/api/train.py" , 94 line, the accuracy function.

Thank author for prviding clear code. When the model trained by use "python tools/train.py configs/seqtr/segmentation/seqtr_mask_[DATASET_NAME].py --cfg-options ema=True", the accuracy function only receives 3 return values, and this cause the training failed. According to post code, the "batch_ie" isn't a significant parameter. It seems a reminder, so I delete the code about "batch_iz" in "./seqtr/api/train.py" that it can work well. Could author gives a description about "batch_ie"? It would be nice if the author provided weights trained for the model. Thank you!
good first issue

opened by zlj63501 3
Customized dataset?

Hi, thanks for the awesome work. Could I ask how could we obtain the token_to_ix.pkl, ix_to_token.pkl, and the word_emb.npz to generate customized dataset? Thank you so much!

opened by CCYChongyanChen 2
multi-task的配置文件

作者您好，我正在研读您的code，目前有两个问题存在一些疑问。

1.请问multi-task的训练是detection和segmentation两个任务统一训练吗？还是需要分开训呢？

2.在multi-task的配置文件中，比如 configs/seqtr/multi-task/seqtr_multi-task_refcocog-google.py，其中需要到 '../../base/datasets/multi-task/refcocog-google.py' 的配置文件，但在本项目中没有给出，请问作者会公开这部分的配置吗？或者您可以告诉我该如何更改配置吗？

opened by Azong-HQU 2
Seq_in的边界值问题

作者您好，请教您一个问题: seq_in[seq_in != self.end].clamp_(min=0, max=self.end-1) 这句code会将目标bbox的左上角和右下角坐标做一个最大最小值的约束，前提是seq_in != self.num_bin (eg: self.end=1000)，如果碰到刚好seq_in == self.end的情况该怎么办呢？即比如seq_in = [806, 59, 1000, 233], self.end=1000, 那么执行上述code时，1000会被过滤掉，不进行约束。同时这是不是就与targets label [X1，Y1，X2，Y2，1000]冲突了，这该怎么解决呢？

麻烦作者有空解答一下，万分感谢！

opened by Azong-HQU 2
Memory and BatchSize

Hi, thanks for the wonderful work. I am curious about why SeqTR is so memory-efficient. As shown in the config file, SeqTR is trained with a batch size of 128 on a single 32GB GPU! However, for object detectors like DETR, the batch size on each GPU is quite limited. Could you please give some insights about this? Thanks in advance.

opened by MasterBin-IIAU 2
mixed datasets

Hi, thanks for the awesome work. Datasets and most annotations can be normally downloaded following the README. But I did not find mixed in the provided google drive link. Have I missed something? Thanks in advance.

opened by MasterBin-IIAU 2
Visualization

Hi,

Congratulation!

I want to visualize the attention weights of segmentation points similar to Fig. 5.

According to the paper: "We visualize the cross attention map averaged over decoder layers and attention heads in Fig. 5.", but I am not sure how to incorporate these weights into the original image.

Would you like to share the script or provide a workable idea?

Thanks~

opened by zlj63501 2
Source of tokenizer files?

Thanks for your great work! I am new to VG and want to know the source of servel files (ix_to_token.pkl, token_to_ix.pkl and word_emb.npz) below work_dir/data/annotations/dataset_name/. Do you define these vocabs and embeddings yourself or learn from other works? Thanks again!

opened by DLUT-yyc 1
multi-task

Hi, here are some questions about multi-task: KeyError: "RefCOCOgUMD: 'GenerateMaskVertices is not in the PIPELINES registry'" Thank you very much for your project and look forward to more code and configuration for multi-task

opened by maxLWS 1
ImportError: cannot import name 'imshow_expr_bbox' from 'seqtr.core' (...../SeqTR/seqtr/core/__init__.py)

Hi! The following two functions imshow_expr_bbox, imshow_expr_mask are called in seqtr/apis/inference.py https://github.com/sean-zhuh/SeqTR/blob/36f74bb9da4bcf81775f9f3bb3e54b170860c536/seqtr/apis/inference.py#L6` But I can't find them from seqtr.core. Am I missing anything?

Thanks so much for your help!

opened by zdxdsw 1
setting "is_crowd = 1" for multiple masks/ polygons resulting in inaccurate evaluation?

Hi, thanks for sharing the great work. I have a question about the is_crowd flag. Why do you need to set it to 1 for multiple masks/ polygons when loading the data? https://github.com/sean-zhuh/SeqTR/blob/36f74bb9da4bcf81775f9f3bb3e54b170860c536/seqtr/datasets/pipelines/loading.py#L126

If it looks like if is_crowd=1, the IoU computation from pycocotool will use a modified criterion that considers the union of gt_mask and pred_mask as pred_mask alone, resulting in a higher number than the standard IoU definition.

https://github.com/sean-zhuh/SeqTR/blob/36f74bb9da4bcf81775f9f3bb3e54b170860c536/seqtr/apis/test.py#L19

(See the note in pycocotool https://github.com/cocodataset/cocoapi/blob/8c9bcc3cf640524c4c20a9c40e89cb6a2f2fa0e9/PythonAPI/pycocotools/mask.py#L65)

Do I understand this correctly? Thanks for your help!

opened by leookami 1
Errors in finetuning

After completing pre-training, I finetuned to refcoco-unc and found the following error messages File "SeqTR/seqtr/utils/checkpoint.py", line 57, in load_pretrained_checkpoint state, ema_state = ckpt['state_dict'], ckpt['ema_state_dict'] KeyError: 'ema_state_dict' Even after fixing this bug, I still found many bugs (e.g. lan_enc.embedding.weight, model.head) in load_pretrained_checkpoint(). Can you please check it?

opened by pqviet 7

Owner

seanZhuh

what/why then how

GitHub https://arxiv.org/abs/2203.16265

A Fast and Accurate One-Stage Approach to Visual Grounding, ICCV 2019 (Oral)

One-Stage Visual Grounding ***** New: Our recent work on One-stage VG is available at ReSC.***** A Fast and Accurate One-Stage Approach to Visual Grou

118 Dec 5, 2022

The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation This repository is the official implementation of CVPR 2021 paper:

9 Nov 14, 2022

[ICCV2021] 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds

3DVG-Transformer This repository is for the ICCV 2021 paper "3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds" Our method "3DV

22 Dec 11, 2022

URIE: Universal Image Enhancementfor Visual Recognition in the Wild

URIE: Universal Image Enhancementfor Visual Recognition in the Wild This is the implementation of the paper "URIE: Universal Image Enhancement for Vis

43 Sep 12, 2022

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

piglet PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021] This repo contains code and data for PIGLeT. If you like

51 Oct 8, 2022

A PyTorch implementation of the baseline method in Panoptic Narrative Grounding (ICCV 2021 Oral)

52 Dec 19, 2022

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

2D-TAN (Optimized) Introduction This is an optimized re-implementation repository for AAAI'2020 paper: Learning 2D Temporal Localization Networks for

112 Dec 31, 2022

[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

TubeDETR: Spatio-Temporal Video Grounding with Transformers Website • STVG Demo • Paper This repository provides the code for our paper. This includes

108 Dec 27, 2022

This is the source code for our ICLR2021 paper: Adaptive Universal Generalized PageRank Graph Neural Network.

GPRGNN This is the source code for our ICLR2021 paper: Adaptive Universal Generalized PageRank Graph Neural Network. Hidden state feature extraction i

92 Jan 3, 2023

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022

SeqTR: A Simple yet Universal Network for Visual Grounding

Related tags

Overview

SeqTR

Installation

Prerequisites

Data Preparation

Training

Phrase Localization and Referring Expression Comprehension

Referring Expression Segmentation

Evaluation

Pre-training + fine-tuning

Models

Contributing

Citation

Acknowledgement

Comments

Owner

seanZhuh

A Fast and Accurate One-Stage Approach to Visual Grounding, ICCV 2019 (Oral)

The official implementation of CVPR 2021 Paper: Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation.

[ICCV2021] 3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds

URIE: Universal Image Enhancementfor Visual Recognition in the Wild

PIGLeT: Language Grounding Through Neuro-Symbolic Interaction in a 3D World [ACL 2021]

A PyTorch implementation of the baseline method in Panoptic Narrative Grounding (ICCV 2021 Oral)

Negative Sample Matters: A Renaissance of Metric Learning for Temporal Grounding

[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transformers

This is the source code for our ICLR2021 paper: Adaptive Universal Generalized PageRank Graph Neural Network.

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

A clear, concise, simple yet powerful and efficient API for deep learning.

Deep universal probabilistic programming with Python and PyTorch

Official codebase for Pretrained Transformers as Universal Computation Engines.

MagFace: A Universal Representation for Face Recognition and Quality Assessment

git《USD-Seg:Learning Universal Shape Dictionary for Realtime Instance Segmentation》(2020) GitHub: [fig2]

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

A universal framework for learning timestamp-level representations of time series

LiDAR R-CNN: An Efficient and Universal 3D Object Detector

[CVPR2021] Domain Consensus Clustering for Universal Domain Adaptation