NSGDC
Some codes in this repo are copied/modified from opensource implementations made available by UNITER, PyTorch, HuggingFace, OpenNMT, and Nvidia. The image features are extracted using BUTD.
Requirements
This is following UNITER. We provide Docker image for easier reproduction. Please install the following:
- nvidia driver (418+),
- Docker (19.03+),
- nvidia-container-toolkit.
Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.
Image-Text Retrieval
Download Data
bash scripts/download_itm.sh $PATH_TO_STORAGE
Launch the Docker Container
# docker image should be automatically pulled
source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/img_db \
$PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained
In case you would like to reproduce the whole preprocessing pipeline.
The launch script respects $CUDA_VISIBLE_DEVICES environment variable. Note that the source code is mounted into the container under /src
instead of built into the image so that user modification will be reflected without re-building the image. (Data folders are mounted into the container separately for flexibility on folder structures.)
Image-Text Retrieval (Flickr30k)
# Train wit the base setting
bash run_cmds/tran_pnsgd_base_flickr.sh
bash run_cmds/tran_pnsgd2_base_flickr.sh
# Train wit the large setting
bash run_cmds/tran_pnsgd_large_flickr.sh
bash run_cmds/tran_pnsgd2_large_flickr.sh
Image-Text Retrieval (COCO)
# Train wit the base setting
bash run_cmds/tran_pnsgd_base_coco.sh
bash run_cmds/tran_pnsgd2_base_coco.sh
# Train wit the large setting
bash run_cmds/tran_pnsgd_large_coco.sh
bash run_cmds/tran_pnsgd2_large_coco.sh
Run Inference
bash run_cmds/inf_nsgd.sh
Results
Our models achieve the following performance.
MS-COCO
Model | Image-to-Text | Text-to-Image | ||||
---|---|---|---|---|---|---|
R@1 | R@5 | R@110 | R@1 | R@5 | R@10 | |
NSGDC-Base | 66.6 | 88.6 | 94.0 | 51.6 | 79.1 | 87.5 |
NSGDC-Large | 67.8 | 89.6 | 94.2 | 53.3 | 80.0 | 88.0 |
Flickr30K
Model | Image-to-Text | Text-to-Image | ||||
---|---|---|---|---|---|---|
R@1 | R@5 | R@110 | R@1 | R@5 | R@10 | |
NSGDC-Base | 87.9 | 98.1 | 99.3 | 74.5 | 93.3 | 96.3 |
NSGDC-Large | 90.6 | 98.8 | 99.1 | 77.3 | 94.3 | 97.3 |