Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021

Jiacheng Chen

Last update: Jan 6, 2023

Related tags

Deep Learning pytorch vse visual-semantic vision-language cross-modal-retrieval image-text-matching

Overview

Learning the Best Pooling Strategy for Visual Semantic Embedding

Official PyTorch implementation of the paper Learning the Best Pooling Strategy for Visual Semantic Embedding (CVPR 2021 Oral).

Please use the following bib entry to cite this paper if you are using any resources from the repo.

@inproceedings{chen2021vseinfty,
     title={Learning the Best Pooling Strategy for Visual Semantic Embedding},
     author={Chen, Jiacheng and Hu, Hexiang and Wu, Hao and Jiang, Yuning and Wang, Changhu},
     booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
     year={2021}
}

We referred to the implementations of VSE++ and SCAN to build up our codebase.

Introduction

Illustration of the standard Visual Semantic Embedding (VSE) framework with the proposed pooling-based aggregator, i.e., Generalized Pooling Operator (GPO). It is simple and effective, which automatically adapts to the appropriate pooling strategy given different data modality and feature extractor, and improves VSE models at negligible extra computation cost.

Image-text Matching Results

The following tables show partial results of image-to-text retrieval on COCO and Flickr30K datasets. In these experiments, we use BERT-base as the text encoder for our methods. This branch provides our code and pre-trained models for using BERT as the text backbone, please check out to the bigru branch for the code and pre-trained models for using BiGRU as the text backbone.

Note that the VSE++ entries in the following tables are the VSE++ model with the specified feature backbones, thus the results are different from the original VSE++ paper.

Results of 5-fold evaluation on COCO 1K Test Split

	Visual Backbone	Text Backbone	R1	R5	R1	R5	Link
VSE++	BUTD region	BERT-base	67.9	91.9	54.0	85.6	-
VSEInfty	BUTD region	BERT-base	79.7	96.4	64.8	91.4	Here
VSEInfty	BUTD grid	BERT-base	80.4	96.8	66.4	92.1	Here
VSEInfty	WSL grid	BERT-base	84.5	98.1	72.0	93.9	Here

Results on Flickr30K Test Split

	Visual Backbone	Text Backbone	R1	R5	R1	R5	Link
VSE++	BUTD region	BERT-base	63.4	87.2	45.6	76.4	-
VSEInfty	BUTD region	BERT-base	81.7	95.4	61.4	85.9	Here
VSEInfty	BUTD grid	BERT-base	81.5	97.1	63.7	88.3	Here
VSEInfty	WSL grid	BERT-base	88.4	98.3	74.2	93.7	Here

Result (in R@1) on Crisscrossed Caption benchmark (trained on COCO)

	Visual Backbone	Text Backbone	I2T	T2I	T2T	I2I
VSRN	BUTD region	BiGRU	52.4	40.1	41.0	44.2
DE	EfficientNet-B4 grid	BERT-base	55.9	41.7	42.6	38.5
VSEInfty	BUTD grid	BERT-base	60.6	46.2	45.9	44.4
VSEInfty	WSL grid	BERT-base	67.9	53.6	46.7	51.3

Preparation

Environment

We trained and evaluated our models with the following key dependencies:

Python 3.7.3
Pytorch 1.2.0
Transformers 2.1.0

Run pip install -r requirements.txt to install the exactly same dependencies as our experiments. However, we also verified that using the latest Pytorch 1.8.0 and Transformers 4.4.2 can also produce similar results.

Data

We organize all data used in the experiments in the following manner:

data
├── coco
│   ├── precomp  # pre-computed BUTD region features for COCO, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   ├── images   # raw coco images
│   │      ├── train2014
│   │      └── val2014
│   │
│   ├── cxc_annots # annotations for evaluating COCO-trained models on the CxC benchmark
│   │
│   └── id_mapping.json  # mapping from coco-id to image's file name
│   
│
├── f30k
│   ├── precomp  # pre-computed BUTD region features for Flickr30K, provided by SCAN
│   │      ├── train_ids.txt
│   │      ├── train_caps.txt
│   │      ├── ......
│   │
│   ├── flickr30k-images   # raw coco images
│   │      ├── xxx.jpg
│   │      └── ...
│   └── id_mapping.json  # mapping from f30k index to image's file name
│   
├── weights
│      └── original_updown_backbone.pth # the BUTD CNN weights
│
└── vocab  # vocab files provided by SCAN (only used when the text backbone is BiGRU)

The download links for original COCO/F30K images, precomputed BUTD features, and corresponding vocabularies are from the offical repo of SCAN. The precomp folders contain pre-computed BUTD region features, data/coco/images contains raw MS-COCO images, and data/f30k/flickr30k-images contains raw Flickr30K images.

The id_mapping.json files are the mapping from image index (ie, the COCO id for COCO images) to corresponding filenames, we generated these mappings to eliminate the need of the pycocotools package.

weights/original_updowmn_backbone.pth is the pre-trained ResNet-101 weights from Bottom-up Attention Model, we converted the original Caffe weights into Pytorch. Please download it from this link.

The data/coco/cxc_annots directory contains the necessary data files for running the Criscrossed Caption (CxC) evaluation. Since there is no official evaluation protocol in the CxC repo, we processed their raw data files and generated these data files to implement our own evaluation. We have verified our implementation by aligning the evaluation results of the official VSRN model with the ones reported by the CxC paper Please download the data files at this link.

Please download all necessary data files and organize them in the above manner, the path to the data directory will be the argument to the training script as shown below.

Training

Assuming the data root is /tmp/data, we provide example training scripts for:

Grid feature with BUTD CNN for the image feature, BERT-base for the text feature. See train_grid.sh
BUTD Region feature for the image feature, BERT-base for the text feature. See train_region.sh

To use other CNN initializations for the grid image feature, change the --backbone_source argument to different values:

(1). the default detector is to use the BUTD ResNet-101, we have adapted the original Caffe weights into Pytorch and provided the download link above;
(2). wsl is to use the backbones from large-scale weakly supervised learning;
(3). imagenet_res152 is to use the ResNet-152 pre-trained on ImageNet.

Evaluation

Run eval.py to evaluate specified models on either COCO and Flickr30K. For evaluting pre-trained models on COCO, use the following command (assuming there are 4 GPUs, and the local data path is /tmp/data):

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 eval.py --dataset coco --data_path /tmp/data/coco

For evaluting pre-trained models on Flickr-30K, use the command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 eval.py --dataset f30k --data_path /tmp/data/f30k

For evaluating pre-trained COCO models on the CxC dataset, use the command:

CUDA_VISIBLE_DEVICES=0,1,2,3 python3 eval.py --dataset coco --data_path /tmp/data/coco --evaluate_cxc

For evaluating two-model ensemble, first run single-model evaluation commands above with the argument --save_results, and then use eval_ensemble.py to get the results (need to manually specify the paths to the saved results).

Comments

Can't reproduce the results with grid features on single GPU

Hi, congratulations on being accepted for your job

I'm confused that I can't reproduce the result with the grid features (BUTD). Because of the limit of GPU memory, I use a single V100 to run the model, and use the torch.utils.checkpoint to save the memory when finetuning the ResNet. But the result is not good, has a large margin with the result in the paper, the results on f30k as follow:

INFO:lib.evaluation:calculate similarity time: 0.683558464050293 INFO:lib.evaluation:rsum: 493.9 INFO:lib.evaluation:Average i2t Recall: 88.8 INFO:lib.evaluation:Image to text: 77.1 92.7 96.5 1.0 3.0 INFO:lib.evaluation:Average t2i Recall: 75.9 INFO:lib.evaluation:Text to image: 56.6 82.0 88.9 1.0 8.1

where is the problem? multiple GPUs? or torch.utils.checkpoint?

opened by liuyyy111 4
Effect of finetuning CNN backbones

Hi Jiacheng @woodfrog and Hexiang @hexiang-hu,

Nice work on image-text retrieval! I am really amazed by the performance obtained by this simple model and am trying to dissect what factors lead to this success.

I noticed from the paper that the CNN backbone is finetuned along with the whole model when using grid features, but did not find anywhere in the paper that discusses the effect of such end-to-end fine-tuning. Could you provide some details on the performance gain obtained with this strategy?

Best, Jie

opened by jayleicn 4
Where can I find train_ids.txt, testall_caps.txt .. etc?

Hello, Thank you for sharing your nice work!

I wonder where I can find the text files in /data/coco/precomp/., such as train_ids.txt, train_caps.txt, testall_caps.txt .. and so on. I would really appreciate your help! Thanks!!

opened by pseulki 2
The order for fc layer and pooling operation

Thanks for your great work.

I notice the paper use the pooling operation after the fc layer (mapping the original feature-dims to embedding dims) https://github.com/woodfrog/vse_infty/blob/c9943b2327a568bb9d1a628bc53f74f21eb97c75/lib/encoders.py#L99-L104

Why not use the pooling operation before the fc layer? I think it can reduce computation, or it will bring worse performance? Have you tried it?

opened by darkpromise98 2
Question about dataloader and lengths input

Hi, I have a question about your dataloader for GPO's input "lengths". In fact, the TextEncoder's GPO module need captions' lengths but the dataloader return the "index"?? Is any problems in your code or i make a misunderstood？ https://github.com/woodfrog/vse_infty/blob/c9943b2327a568bb9d1a628bc53f74f21eb97c75/lib/datasets/image_caption.py#L95 https://github.com/woodfrog/vse_infty/blob/c9943b2327a568bb9d1a628bc53f74f21eb97c75/lib/encoders.py#L175

opened by Cloveryww 2
About download

It is really a great work! But there is a question, I cannot download original_updowmn_backbone.pth, can you provide another download address about it. Thank you very much!

opened by YangYL18 0

Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021

Related tags

Overview

Learning the Best Pooling Strategy for Visual Semantic Embedding

Introduction

Image-text Matching Results

Results of 5-fold evaluation on COCO 1K Test Split

Results on Flickr30K Test Split

Result (in R@1) on Crisscrossed Caption benchmark (trained on COCO)

Preparation

Environment

Data

Training

Evaluation

Comments

Can't reproduce the results with grid features on single GPU

Effect of finetuning CNN backbones

Where can I find train_ids.txt, testall_caps.txt .. etc?

The order for fc layer and pooling operation

Question about dataloader and lengths input

About download

Owner

Jiacheng Chen

[CVPR 21] Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.

[CVPR 2022] CoTTA Code for our CVPR 2022 paper Continual Test-Time Domain Adaptation

[CVPR 2021] Released code for Counterfactual Zero-Shot and Open-Set Visual Recognition

Code for Multiple Instance Active Learning for Object Detection, CVPR 2021

Code for our CVPR 2021 paper "MetaCam+DSCE"

Official code of the paper "ReDet: A Rotation-equivariant Detector for Aerial Object Detection" (CVPR 2021)

Code for "Neural Parts: Learning Expressive 3D Shape Abstractions with Invertible Neural Networks", CVPR 2021

Code for Multiple Instance Active Learning for Object Detection, CVPR 2021

Official code for the paper: Deep Graph Matching under Quadratic Constraint (CVPR 2021)

Code for CVPR 2021 paper: Anchor-Free Person Search

Code of paper "CDFI: Compression-Driven Network Design for Frame Interpolation", CVPR 2021

Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021)

Code for "LoFTR: Detector-Free Local Feature Matching with Transformers", CVPR 2021

Code for "NeuralRecon: Real-Time Coherent 3D Reconstruction from Monocular Video", CVPR 2021 oral

Official PyTorch Code of GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection (CVPR 2021)

Official code for the CVPR 2021 paper "How Well Do Self-Supervised Models Transfer?"

Demo code for paper "Learning optical flow from still images", CVPR 2021.

Official PyTorch code of Holistic 3D Scene Understanding from a Single Image with Implicit Representation (CVPR 2021)

Official code of CVPR 2021's PLOP: Learning without Forgetting for Continual Semantic Segmentation