Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Overview

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Requirements

  • Python 3.7
  • Pytorch 1.2

Prepare data

  1. Please use git clone --recurse-submodules to clone this repository and remember to follow initialization steps in coco-caption/README.md. Then download and place the Flickr30k reference file under coco-caption/annotations. Also, download Stanford CoreNLP 3.9.1 for grounding evaluation and place the uncompressed folder under the tools/ directory.
  2. Download the preprocessd dataset from this link and extract it to data/.
  3. For Flickr30k-Entities, please download bottom-up visual feature extracted by Anderson's extractor (Zhou's extractor) from this link ( link) and place the uncompressed folders under data/flickrbu/. For MSCOCO, please follow this instruction to prepare the bottom-up features and place them under data/mscoco/.
  4. Download the pretrained models from here and extract them to log/.
  5. Download the pretrained SCAN models from this link and extract them to misc/SCAN/runs.

Evaluation

To reproduce the results reported in the paper, just simply run

bash eval_flickr.sh

fro Flickr30k-Entities and

bash eval_coco.sh

for MSCOCO.

Training

  1. In the first training stage, run like
python train.py --id CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att  --input_box_dir data/flickrbu/flickrbu_box  --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log/CE-scan-sup-0.1kl --save_checkpoint_every 1000 --val_images_use -1 --max_epochs 30  --att_supervise  True   --att_supervise_weight 0.1
  1. In the second training stage, run like
python train.py --id sc-ground-CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att  --input_box_dir data/flickrbu/flickrbu_box  --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-5 --start_from log/CE-scan-sup-0.1kl --checkpoint_path log/sc-ground-CE-scan-sup-0.1kl --save_checkpoint_every 1000 --language_eval 1 --val_images_use -1 --self_critical_after 30  --max_epochs  110      --cider_reward_weight  1
--ground_reward_weight   1 

Citation

@inproceedings{zhou2020grounded,
  title={More Grounded Image Captioning by Distilling Image-Text Matching Model},
  author={Zhou, Yuanen and Wang, Meng and Liu, Daqing and  Hu, Zhenzhen and Zhang, Hanwang},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  year={2020}
}

Acknowledgements

This repository is built upon self-critical.pytorch, SCAN and grounded-video-description. Thanks for their released code.

Comments
  • about reproduce performance

    about reproduce performance

    Thanks for opening source code. The provided pretrained model can perform the results claimed in your paper, but when I fail to training model on flicker dataset to reproduce the results.

    opened by wanboyang 7
  • Training Detail

    Training Detail

    Hi, I am trying to reproduce the baseline(Topdown in Flickr 30k, with CIDEr optimization in RL phase) in your paper, and I used different kinds of code including yours, but I failed, The result of CIDEr score is around 60, can not reach the similar result in your paper (over 67), can you share the detailed training setting?

    Thanks!

    opened by SmallYour 3
  • 请教Grounded-Image-Captioning代码复现缺少infos_CE-gt-sup-0.1-nll.pkl,model.pth文件的问题

    请教Grounded-Image-Captioning代码复现缺少infos_CE-gt-sup-0.1-nll.pkl,model.pth文件的问题

    您好,我是北京理工大学计算机学院的一名学生,我最近在学习您2020年发布在cvpr的论文Moer Grounded Image Captioning by Distilling Image-Text Matching Model,在我复现论文代码的时候我按照readme的要求配置好了环境,并顺利进行了代码的评估部分,但是在我训练模型的过程中缺少了/log/CE-gt-sup-0.1-nll/infos_CE-gt-sup-0.1-nll.pkl文件,而/log/CE-gt-sup-0.1-nll文件夹下只有infos_CE-gt-sup-0.1-nll-best.pkl文件,报错信息如下: image

    然后我以为两个文件是相同的,因此将infos_CE-gt-sup-0.1-nll-best.pkl,model-best.pth更改为代码中的infos_CE-gt-sup-0.1-nll.pkl,model.pth,但是又报以下错误 image 因此我想请教您log.tar.gz是否缺少了这两个infos_CE-gt-sup-0.1-nll.pkl,model.pth,这两个文件。期待您的指导,非常感谢您。

    opened by mrldj 3
  • Infer on new images, dataloaderraw not return box_feats

    Infer on new images, dataloaderraw not return box_feats

    Hi, thanks for your work. I'm curious to use your code on new raw images, how can I do that? From the vis_attn_cap.ipynb, I modified this part : opt = parser.parse_args(args=["--image_folder","my_images_folder"]) And it use dataloaderraw

    The problem is dataloaderraw doesn't yield box_feats. How do I get the box_feats? Thanks :)

    opened by vinson2233 0
  • Evaluation results

    Evaluation results

    Hi,

    I run your evaluation results on both F50k and Coco. The results of F30k is absolutely right. But the results on coco is not quite ok. The only detail I find strange is misc/SCAN/runs/f30k_SCAN_POS1. Could you please offer me the model of misc/SCAN/runs/coco... or something else I made some error? Many thanks.

    Results Reported in Table 4. Eval Up-Down+XE Constructing SCAN model... scan_model_path:misc/SCAN/runs/f30k_SCAN_POS1/checkpoint/model_best.pth.tar Done DataLoader loading json file: data/cocotalk.json vocab size is 9487 DataLoader loading h5 file: data/mscoco/cocobu_fc data/mscoco/cocobu_att data/mscoco/cocobu_box data/cocotalk_label.h5 max sequence length in data is 16 read 123287 image features assigned 113287 images to split train assigned 5000 images to split val assigned 5000 images to split test loading annotations into memory... Done (t=0.59s) creating index... index created! using 5000/5000 predictions Loading and preparing results... DONE (t=0.05s) creating index... index created! tokenization... PTBTokenizer tokenized 307085 tokens at 801422.82 tokens per second. PTBTokenizer tokenized 51973 tokens at 244262.23 tokens per second. setting up scorers... computing Bleu score... {'testlen': 46974, 'reflen': 46862, 'guess': [46974, 41974, 36974, 31974], 'correct': [35147, 19145, 9763, 4961]} ratio: 1.0023899961589133 Bleu_1: 0.748 Bleu_2: 0.584 Bleu_3: 0.448 Bleu_4: 0.344 computing METEOR score... METEOR: 0.269 computing Rouge score... ROUGE_L: 0.554 computing CIDEr score... CIDEr: 1.084

    opened by detectiveli 4
  • Key Error during training

    Key Error during training

    Hi,

    I am trying to use your codebase for some experiments and during training I get the following error:

    $ python train.py --id CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att --input_box_dir data/flickrbu/flickrbu_box --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log/CE-scan-sup-0.1kl --save_checkpoint_every 1000 --val_images_use -1 --max_epochs 30 --att_supervise True --att_supervise_weight 0.1 Constructing SCAN model... scan_model_path:misc/SCAN/runs/f30k_SCAN_POS1/checkpoint/model_best.pth.tar Done tensorboardX is not installed DataLoader loading json file: data/flickrtalk.json vocab size is 7000 DataLoader loading h5 file: data/flickrbu/flickrbu_fc data/flickrbu/flickrbu_att data/flickrbu/flickrbu_box data/flickrtalk_label.h5 max sequence length in data is 16 read 31014 image features assigned 29000 images to split train assigned 1014 images to split val assigned 1000 images to split test Read data: 0.7768440246582031 Traceback (most recent call last): File "train.py", line 291, in <module> train(opt) File "train.py", line 180, in train model_out = dp_lw_model(fc_feats, att_feats, labels, masks, att_masks, data['gts'], torch.arange(0, len(data['gts'])), sc_flag, box_inds) File "/users/vlad/anaconda3/envs/gvd/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/users/vlad/anaconda3/envs/gvd/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/users/vlad/anaconda3/envs/gvd/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/users/vlad/anaconda3/envs/gvd/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/users/vlad/anaconda3/envs/gvd/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, **kwargs) File "/users/vlad/anaconda3/envs/gvd/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/users/vlad/image_grounding/Grounded-Image-Captioning/misc/loss_wrapper.py", line 45, in forward _, grd_weights,noun_mask= get_self_critical_reward(self.model, fc_feats, att_feats, att_masks, gts, labels[:,1:].detach(), vars(self.opt)) File "/users/vlad/image_grounding/Grounded-Image-Captioning/misc/rewards.py", line 68, in get_self_critical_reward gts = {i: gts[i % batch_size // seq_per_img] for i in range(2 * batch_size)} File "/users/vlad/image_grounding/Grounded-Image-Captioning/misc/rewards.py", line 68, in <dictcomp> gts = {i: gts[i % batch_size // seq_per_img] for i in range(2 * batch_size)} KeyError: 29 Terminating BlobFetcher

    I followed all the steps described and I am able to run the evaluation, but for training I get the above error. I did some investigations and I think a reshape is missing somewhere since the dataloader loads multiple captions per image but this doesn't seem to be reflected here https://github.com/YuanEZhou/Grounded-Image-Captioning/blob/77295a6e36de817f173435e809effc3396469ee3/misc/rewards.py#L70 Any suggestions on how to overcome this?

    Thanks!

    opened by vladbogo 2
Owner
YE Zhou
YE Zhou
Code for C2-Matching (CVPR2021). Paper: Robust Reference-based Super-Resolution via C2-Matching.

C2-Matching (CVPR2021) This repository contains the implementation of the following paper: Robust Reference-based Super-Resolution via C2-Matching Yum

Yuming Jiang 151 Dec 26, 2022
XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale

XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks ACL 2020 Microsoft Research [Paper] [Video] Releasing [XtremeDistilTransf

Microsoft 125 Jan 4, 2023
Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks"

HKD Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks" cifia-100 result The implementation of compared methods are ba

Wang Yucheng 30 Dec 18, 2022
Deep Structured Instance Graph for Distilling Object Detectors (ICCV 2021)

DSIG Deep Structured Instance Graph for Distilling Object Detectors Authors: Yixin Chen, Pengguang Chen, Shu Liu, Liwei Wang, Jiaya Jia. [pdf] [slide]

DV Lab 31 Nov 17, 2022
Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation (CoRL 2021)

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation [Project website] [Paper] This project is a PyTorch i

Cognitive Learning for Vision and Robotics (CLVR) lab @ USC 6 Feb 28, 2022
PyTorch source code for Distilling Knowledge by Mimicking Features

LSHFM.detection This is the PyTorch source code for Distilling Knowledge by Mimicking Features. And this project contains code for object detection wi

Guo-Hua Wang 4 Dec 17, 2022
LBK 35 Dec 26, 2022
Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

ResDAVEnet-VQ Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech What is in this repo? M

Wei-Ning Hsu 21 Aug 23, 2022
DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations

DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations This repository contains the data, scripts and baseline co

Alexa 51 Dec 17, 2022
ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

ALFRED A Benchmark for Interpreting Grounded Instructions for Everyday Tasks Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han,

ALFRED 204 Dec 15, 2022
Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Knover Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out eff

null 607 Dec 31, 2022
PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner [Li et al., 2020].

VGPL-Visual-Prior PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner (VGPL). Give

Toru 8 Dec 29, 2022
[CVPR'22] Official PyTorch Implementation of Collaborative Transformers for Grounded Situation Recognition

[CVPR'22] Collaborative Transformers for Grounded Situation Recognition Paper | Model Checkpoint This is the official PyTorch implementation of Collab

Junhyeong Cho 29 Dec 10, 2022
The code of “Similarity Reasoning and Filtration for Image-Text Matching” [AAAI2021]

SGRAF PyTorch implementation for AAAI2021 paper of “Similarity Reasoning and Filtration for Image-Text Matching”. It is built on top of the SCAN and C

Ronnie_IIAU 149 Dec 22, 2022
A 1.3B text-to-image generation model trained on 14 million image-text pairs

minDALL-E on Conceptual Captions minDALL-E, named after minGPT, is a 1.3B text-to-image generation model trained on 14 million image-text pairs for no

Kakao Brain 604 Dec 14, 2022
Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Diverse Image Captioning with Context-Object Split Latent Spaces This repository is the PyTorch implementation of the paper: Diverse Image Captioning

Visual Inference Lab @TU Darmstadt 34 Nov 21, 2022
Semi-Autoregressive Transformer for Image Captioning

Semi-Autoregressive Transformer for Image Captioning Requirements Python 3.6 Pytorch 1.6 Prepare data Please use git clone --recurse-submodules to clo

YE Zhou 23 Dec 9, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

VisualGPT Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning Main Architecture of Our VisualGPT Downloa

Vision CAIR Research Group, KAUST 140 Dec 28, 2022