Probabilistic Cross-Modal Embedding (PCME) CVPR 2021

NAVER AI

Last update: Dec 21, 2022

Related tags

Deep Learning pcme

Overview

Probabilistic Cross-Modal Embedding (PCME) CVPR 2021

Official Pytorch implementation of PCME | Paper

Sanghyuk Chun¹ Seong Joon Oh¹ Rafael Sampaio de Rezende² Yannis Kalantidis² Diane Larlus²

¹_{NAVER AI LAB}
²_{NAVER LABS Europe}

Updates

23 Jun, 2021: Initial upload.

Installation

Install dependencies using the following command.

pip install cython && pip install -r requirements.txt
python -c 'import nltk; nltk.download("punkt", download_dir="/opt/conda/nltk_data")'
git clone https://github.com/NVIDIA/apex && cd apex && pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Dockerfile

You can use my docker image as well

docker pull sanghyukchun/pcme:torch1.2-apex-dali

Please Add --model__cache_dir /vector_cache when you run the code

Configuration

All experiments are based on configuration files (see config/coco and config/cub). If you want to change only a few options, instead of re-writing a new configuration file, you can override the configuration as the follows:

python .py --dataloader__batch_size 32 --dataloader__eval_batch_size 8 --model__eval_method matching_prob

See config/parser.py for details

Dataset preparation

COCO Caption

We followed the same split provided by VSE++. Dataset splits can be found in datasets/annotations.

Note that we also need instances_2014.json for computing PMRP score.

CUB Caption

Download images from this link, and download caption from reedscot/cvpr2016. You can use the image path and the caption path separately in the code.

Evaluate pretrained models

NOTE: the current implementation of plausible match R-Precision (PMRP) is not efficient:
It first dumps all ranked items for each item to a local file, and compute R-precision.
We are planning to re-implement efficient PMRP as soon as possible.

COCO Caption

# Compute recall metrics
python evaluate_recall_coco.py ./config/coco/pcme_coco.yaml \
    --dataset_root  \
    --model_path model_last.pth \
    # --model__cache_dir /vector_cache # if you use my docker image

# Compute plausible match R-Precision (PMRP) metric
python extract_rankings_coco.py ./config/coco/pcme_coco.yaml \
    --dataset_root  \
    --model_path model_last.pth \
    --dump_to  \
    # --model__cache_dir /vector_cache # if you use my docker image

python evaluate_pmrp_coco.py --ranking_file

Method	I2T PMRP	I2T R@1	T2I PMRP	T2I R@1	Model file
PCME	45.0	68.8	46.0	54.6	link
PVSE K=1	40.3	66.7	41.8	53.5	-
PVSE K=2	42.8	69.2	43.6	55.2	-
VSRN	41.2	76.2	42.4	62.8	-
VSRN + AOQ	44.7	77.5	45.6	63.5	-

CUB Caption

python evaluate_cub.py ./config/cub/pcme_cub.yaml \
    --dataset_root  \
    --caption_root  \
    --model_path model_last.pth \
    # --model__cache_dir /vector_cache # if you use my docker image

NOTE: If you just download file from reedscot/cvpr2016, then caption_root will be cvpr2016_cub/text_c10

If you want to test other probabilistic distances, such as Wasserstein distance or KL-divergence, try the following command:

python evaluate_cub.py ./config/cub/pcme_cub.yaml \
    --dataset_root  \
    --caption_root  \
    --model_path model_last.pth \
    --model__eval_method  \
    # --model__cache_dir /vector_cache # if you use my docker image

You can choose distance_method in ['elk', 'l2', 'min', 'max', 'wasserstein', 'kl', 'reverse_kl', 'js', 'bhattacharyya', 'matmul', 'matching_prob']

How to train

NOTE: we train each model with mixed-precision training (O2) on a single V100.
Since, the current code does not support multi-gpu training, if you use different hardware, the batchsize should be reduced.
Please note that, hence, the results couldn't be reproduced if you use smaller hardware than V100.

COCO Caption

python train_coco.py ./config/coco/pcme_coco.yaml --dataset_root  \
    # --model__cache_dir /vector_cache # if you use my docker image

It takes about 46 hours in a single V100 with mixed precision training.

CUB Caption

We use CUB Caption dataset (Reed, et al. 2016) as a new cross-modal retrieval benchmark. Here, instead of matching the sparse paired image-caption pairs, we treat all image-caption pairs in the same class as positive. Since our split is based on the zero-shot learning benchmark (Xian, et al. 2017), we leave out 50 classes from 200 bird classes for the evaluation.

Reed, Scott, et al. "Learning deep representations of fine-grained visual descriptions." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Xian, Yongqin, Bernt Schiele, and Zeynep Akata. "Zero-shot learning-the good, the bad and the ugly." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

hyperparameter search

We additionally use cross-validation splits by (Xian, et el. 2017), namely using 100 classes for training and 50 classes for validation.

python train_cub.py ./config/cub/pcme_cub.yaml \
    --dataset_root  \
    --caption_root  \
    --dataset_name cub_trainval1 \
    # --model__cache_dir /vector_cache # if you use my docker image

Similarly, you can use cub_trainval2 and cub_trainval3 as well.

training with full training classes

python train_cub.py ./config/cub/pcme_cub.yaml \
    --dataset_root  \
    --caption_root  \
    # --model__cache_dir /vector_cache # if you use my docker image

It takes about 4 hours in a single V100 with mixed precision training.

How to cite

@inproceedings{chun2021pcme,
    title={Probabilistic Embeddings for Cross-Modal Retrieval},
    author={Chun, Sanghyuk and Oh, Seong Joon and De Rezende, Rafael Sampaio and Kalantidis, Yannis and Larlus, Diane},
    year={2021},
    booktitle={Conference on Computer Vision and Pattern Recognition (CVPR)},
}

License

MIT License

Copyright (c) 2021-present NAVER Corp.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.

[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

EPro-PnP EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation In CVPR 2022 (Oral). [paper] Hanshen

同济大学智能汽车研究所综合感知研究组 ( Comprehensive Perception Research Group under Institute of Intelligent Vehicles, School of Automotive Studies, Tongji University)

842 Jan 4, 2023

Comments

confused about loss function

Great work! But I am confused about some details in loss function.

-((logit * matched - torch.stack((logit, -logit), dim=2).logsumexp(dim=2, keepdim=False)).logsumexp(dim=1)) + np.log(logit.size(1)) I guess this part is equivalent to y = 1 / (1+exp(-2x)), which is compressed sigmoid, not sigmoid itself. Although this will not affect performance, but may report wrong learned parameters (a, b).

In addition, reduction is always done in 'sum' whatever self.reduction is set.
good first issue

opened by helson73 8
.t7 caption files

I'm already following the readme to download the CUB dataset, but captions in there are not suitable for this code. Because of the captions file type is .t7 not .txt.
good first issue

opened by YangYang 6
Codebase/models as baseline: Performances with other losses

Hi,

First of all, great work!

In my opinion, the codebase of this work can serve as a solid baseline for trying new optimization functions for image-text matching.

I was wondering, do you have any results on the performance of other optimization functions such as the Triplet loss, Triplet loss with semi-hard negatives, or the InfoNCE loss using this codebase? In the paper, PCME is compared with several other methods, such as VSRN, PVSE and VSE++. However, these methods are optimized with the Triplet loss with semi-hard negatives. Do you have any insights on how much better PCME is compared to InfoNCE/Triplet loss (SH) (when optimizing with the same method and training hyper-parameters).

Thanks, Maurits

opened by MauritsBleeker 2
loss functions

Hi, I found your work to be very interesting. But I am a bit confused about your loss functions. You computed the i2t_loss and t2i_loss separately but aren't they the same? Am i getting something wrong?

opened by jinhyunj 2

Probabilistic Cross-Modal Embedding (PCME) CVPR 2021

Related tags

Overview

Probabilistic Cross-Modal Embedding (PCME) CVPR 2021

Updates

Installation

Dockerfile

Configuration

Dataset preparation

COCO Caption

CUB Caption

Evaluate pretrained models

COCO Caption

CUB Caption

How to train

COCO Caption

CUB Caption

hyperparameter search

training with full training classes

How to cite

License

You might also like...

Cross-Modal Contrastive Learning for Text-to-Image Generation

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

X-modaler is a versatile and high-performance codebase for cross-modal analytics.

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Pytorch Implementation of Adversarial Deep Network Embedding for Cross-Network Node Classification

[CVPR'21] Multi-Modal Fusion Transformer for End-to-End Autonomous Driving

[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

Comments

confused about loss function

.t7 caption files

Codebase/models as baseline: Performances with other losses

loss functions

Owner

NAVER AI

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

《Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis》(2021)

Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

Paddle implementation for "Cross-Lingual Word Embedding Refinement by ℓ1 Norm Optimisation" (NAACL 2021)

Deep RGB-D Saliency Detection with Depth-Sensitive Attention and Automatic Multi-Modal Fusion (CVPR'2021, Oral)

Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021

Official PyTorch Implementation of Embedding Transfer with Label Relaxation for Improved Metric Learning, CVPR 2021

Code for Referring Image Segmentation via Cross-Modal Progressive Comprehension, CVPR2020.

Cross-modal Deep Face Normals with Deactivable Skip Connections