Moer Grounded Image Captioning by Distilling Image-Text Matching Model
Requirements
- Python 3.7
- Pytorch 1.2
Prepare data
- Please use git clone --recurse-submodules to clone this repository and remember to follow initialization steps in coco-caption/README.md. Then download and place the Flickr30k reference file under coco-caption/annotations. Also, download Stanford CoreNLP 3.9.1 for grounding evaluation and place the uncompressed folder under the tools/ directory.
- Download the preprocessd dataset from this link and extract it to data/.
- For Flickr30k-Entities, please download bottom-up visual feature extracted by Anderson's extractor (Zhou's extractor) from this link ( link) and place the uncompressed folders under data/flickrbu/. For MSCOCO, please follow this instruction to prepare the bottom-up features and place them under data/mscoco/.
- Download the pretrained models from here and extract them to log/.
- Download the pretrained SCAN models from this link and extract them to misc/SCAN/runs.
Evaluation
To reproduce the results reported in the paper, just simply run
bash eval_flickr.sh
fro Flickr30k-Entities and
bash eval_coco.sh
for MSCOCO.
Training
- In the first training stage, run like
python train.py --id CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att --input_box_dir data/flickrbu/flickrbu_box --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log/CE-scan-sup-0.1kl --save_checkpoint_every 1000 --val_images_use -1 --max_epochs 30 --att_supervise True --att_supervise_weight 0.1
- In the second training stage, run like
python train.py --id sc-ground-CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att --input_box_dir data/flickrbu/flickrbu_box --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-5 --start_from log/CE-scan-sup-0.1kl --checkpoint_path log/sc-ground-CE-scan-sup-0.1kl --save_checkpoint_every 1000 --language_eval 1 --val_images_use -1 --self_critical_after 30 --max_epochs 110 --cider_reward_weight 1
--ground_reward_weight 1
Citation
@inproceedings{zhou2020grounded,
title={More Grounded Image Captioning by Distilling Image-Text Matching Model},
author={Zhou, Yuanen and Wang, Meng and Liu, Daqing and Hu, Zhenzhen and Zhang, Hanwang},
booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
year={2020}
}
Acknowledgements
This repository is built upon self-critical.pytorch, SCAN and grounded-video-description. Thanks for their released code.