Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Related tags

Overview

Diverse Image Captioning with Context-Object Split Latent Spaces

This repository is the PyTorch implementation of the paper:

Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

We additionally include evaluation code from Luo et al. in the folder GoogleConceptualCaptioning , which has been patched for compatibility.

Requirements

The following code is written in Python 3.6.10 and CUDA 9.0.

Requirements:

torch 1.1.0
torchvision 0.3.0
nltk 3.5
inflect 4.1.0
tqdm 4.46.0
sklearn 0.0
h5py 2.10.0

To install requirements:

conda config --add channels pytorch
conda config --add channels anaconda
conda config --add channels conda-forge
conda config --add channels conda-forge/label/cf202003
conda create -n <environment_name> --file requirements.txt
conda activate <environment_name>

Preprocessed data

The dataset used in this project for assessing accuracy and diversity is COCO 2014 (m-RNN split). The full dataset is available here.

We use the Faster R-CNN features for images similar to Anderson et al.. We additionally require "classes"/"scores" fields detected for image regions. The classes correspond to Visual Genome.

Download instructions

Preprocessed training data is available here as hdf5 files. The provided hdf5 files contain the following fields:

image_id: ID of the COCO image
num_boxes: The proposal regions detected from Faster R-CNN
features: ResNet-101 features of the extracted regions
classes: Visual genome classes of the extracted regions
scores: Scores of the Visual genome classes of the extracted regions

Note that the ["image_id","num_boxes","features"] fields are identical to Anderson et al.

Create a folder named coco and download the preprocessed training and test datasets from the coco folder in the drive link above as follows (it is also possible to directly download the entire coco folder from the drive link):

Download the following files for training on COCO 2014 (m-RNN split):

coco/coco_train_2014_adaptive_withclasses.h5
coco/coco_val_2014_adaptive_withclasses.h5
coco/coco_val_mRNN.txt
coco/coco_test_mRNN.txt

Download the following files for training on held-out COCO (novel object captioning):

coco/coco_train_2014_noc_adaptive_withclasses.h5
coco/coco_train_extra_2014_noc_adaptive_withclasses.h5

Download the following files for testing on held-out COCO (novel object captioning):

coco/coco_test_2014_noc_adaptive_withclasses.h5

Download the (caption) annotation files and place them in a subdirectory coco/annotations (mirroring the Google drive folder structure)

coco/annotations/captions_train2014.json
coco/annotations/captions_val2014.json

Download the following files from the drive link in a seperate folder data (outside coco). These files contain the contextual neighbours for pseudo supervision:

data/nn_final.pkl
data/nn_noc.pkl

For running the train/test scripts (described in the following) "pathToData"/"nn_dict_path" in params.json and params_noc.json needs to be set to the coco/data folder created above.

Verify Folder Structure after Download

The folder structure of coco after data download should be as follows,

coco
 - annotations
   - captions_train2014.json
   - captions_val2014.json
 - coco_val_mRNN.txt
 - coco_test_mRNN.txt
 - coco_train_2014_adaptive_withclasses.h5
 - coco_val_2014_adaptive_withclasses.h5
 - coco_train_2014_noc_adaptive_withclasses.h5
 - coco_train_extra_2014_noc_adaptive_withclasses.h5
 - coco_test_2014_noc_adaptive_withclasses.h5
data
 - coco_classname.txt
 - visual_genome_classes.txt
 - vocab_coco_full.pkl
 - nn_final.pkl
 - nn_noc.pkl

Training

Please follow the following instructions for training:

Set hyperparameters for training in params.json and params_noc.json.
Train a model on COCO 2014 for captioning,

   	python ./scripts/train.py

Train a model for diverse novel object captioning,

   	python ./scripts/train_noc.py

Please note that the data folder provides the required vocabulary.

Memory requirements

The models were trained on a single nvidia V100 GPU with 32 GB memory. 16 GB is sufficient for training a single run.

Pre-trained models and evaluation

We provide pre-trained models for both captioning on COCO 2014 (mRNN split) and novel object captioning. Please follow the following steps:

Download the pre-trained models from here to the ckpts folder.
For evaluation of oracle scores and diversity, we follow Luo et al.. In the folder GoogleConceptualCaptioning download the cider and in the cococaption folder run the download scripts,

   	./GoogleConceptualCaptioning/cococaption/get_google_word2vec_model.sh
   	./GoogleConceptualCaptioning/cococaption/get_stanford_models.sh
   	python ./scripts/eval.py

For diversity evaluation create the required numpy file for consensus re-ranking using,

   	python ./scripts/eval_diversity.py

For consensus re-ranking follow the steps here. To obtain the final diversity scores, follow the instructions of DiversityMetrics. Convert the numpy file to required json format and run the script evalscripts.py

To evaluate the F1 score for novel object captioning,

   	python ./scripts/eval_noc.py

Results

Oracle evaluation on the COCO dataset

	B4	B3	B2	B1	CIDEr	METEOR	ROUGE	SPICE
COS-CVAE	0.633	0.739	0.842	0.942	1.893	0.450	0.770	0.339

Diversity evaluation on the COCO dataset

	Unique	Novel	mBLEU	Div-1	Div-2
COS-CVAE	96.3	4404	0.53	0.39	0.57

F1-score evaluation on the held-out COCO dataset

	bottle	bus	couch	microwave	pizza	racket	suitcase	zebra	average
COS-CVAE	35.4	83.6	53.8	63.2	86.7	69.5	46.1	81.7	65.0

Bibtex

@inproceedings{coscvae20neurips,
  title     = {Diverse Image Captioning with Context-Object Split Latent Spaces},
  author    = {Mahajan, Shweta and Roth, Stefan},
  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
  year = {2020}
}

You might also like...

Official Implementation of Swapping Autoencoder for Deep Image Manipulation (NeurIPS 2020)

Swapping Autoencoder for Deep Image Manipulation Taesung Park, Jun-Yan Zhu, Oliver Wang, Jingwan Lu, Eli Shechtman, Alexei A. Efros, Richard Zhang UC

449 Dec 27, 2022

Implementation based on Paper - Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling

3 Jul 8, 2022

Official code of the paper "Expanding Low-Density Latent Regions for Open-Set Object Detection" (CVPR 2022)

OpenDet Expanding Low-Density Latent Regions for Open-Set Object Detection (CVPR2022) Jiaming Han, Yuqiang Ren, Jian Ding, Xingjia Pan, Ke Yan, Gui-So

64 Jan 7, 2023

EPSANet：An Efficient Pyramid Split Attention Block on Convolutional Neural Network

EPSANet：An Efficient Pyramid Split Attention Block on Convolutional Neural Network This repo contains the official Pytorch implementaion code and conf

175 Jan 7, 2023

An NVDA add-on to split screen reader and audio from other programs to different sound channels

An NVDA add-on to split screen reader and audio from other programs to different sound channels (add-on idea credit: Tony Malykh)

7 Dec 25, 2022

Aiming at the common training datsets split, spectrum preprocessing, wavelength select and calibration models algorithm involved in the spectral analysis process

Aiming at the common training datsets split, spectrum preprocessing, wavelength select and calibration models algorithm involved in the spectral analysis process, a complete algorithm library is established, which is named opensa (openspectrum analysis).

50 Jan 7, 2023

Split your patch similarly to `git add -p` but supporting multiple buckets

Comments

There are no diversity metrics in eval_diversity.py

Hi,

Thank you for your source code!

I wonder what the role of eval_diversity.py is, since there are no diversity metrics, like novel words, new sentences in the file. So what is the difference between eval.py and eval_diversity.py?

Thank you Ryan.

opened by RyanLiut 2
Beam search seems not process the token?

Hi,

Thank you for your code! It seems the function of beam search does not end the sentence when it meets the token, causing the sentences by beam search (e.g. BS=5) not to be different from each other. Could you check it again? And by the way, could you add DBS in the reference stage?

Thank you

opened by RyanLiut 0

Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Related tags

Overview

Diverse Image Captioning with Context-Object Split Latent Spaces

Requirements

Preprocessed data

Download instructions

Verify Folder Structure after Download

Training

Memory requirements

Pre-trained models and evaluation

Results

Oracle evaluation on the COCO dataset

Diversity evaluation on the COCO dataset

F1-score evaluation on the held-out COCO dataset

Bibtex

You might also like...

Official Implementation of Swapping Autoencoder for Deep Image Manipulation (NeurIPS 2020)

Implementation based on Paper - Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling

Official code of the paper "Expanding Low-Density Latent Regions for Open-Set Object Detection" (CVPR 2022)

EPSANet：An Efficient Pyramid Split Attention Block on Convolutional Neural Network

An NVDA add-on to split screen reader and audio from other programs to different sound channels

Aiming at the common training datsets split, spectrum preprocessing, wavelength select and calibration models algorithm involved in the spectral analysis process

Split your patch similarly to `git add -p` but supporting multiple buckets

CVPR 2021: "Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE"

Implementation of Diverse Semantic Image Synthesis via Probability Distribution Modeling

Comments

There are no diversity metrics in eval_diversity.py

Beam search seems not process the token?

Owner

Visual Inference Lab @TU Darmstadt

Aligning Latent and Image Spaces to Connect the Unconnectable

Simple image captioning model - CLIP prefix captioning.

Codes for paper "Towards Diverse Paragraph Captioning for Untrimmed Videos". CVPR 2021

DCSL - Generalizable Crowd Counting via Diverse Context Style Learning

Minimal PyTorch implementation of Generative Latent Optimization from the paper "Optimizing the Latent Space of Generative Networks"

This is the PyTorch implementation of GANs N’ Roses: Stable, Controllable, Diverse Image to Image Translation

Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]

[CVPR 2020] Interpreting the Latent Space of GANs for Semantic Face Editing

git《Commonsense Knowledge Base Completion with Structural and Semantic Context》(AAAI 2020) GitHub: [fig1]

PyTorch implementation of "ContextNet: Improving Convolutional Neural Networks for Automatic Speech Recognition with Global Context" (INTERSPEECH 2020)