Moer Grounded Image Captioning by Distilling Image-Text Matching Model

YE Zhou

Last update: Dec 16, 2022

Related tags

Deep Learning Grounded-Image-Captioning

Overview

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Requirements

Python 3.7
Pytorch 1.2

Prepare data

Please use git clone --recurse-submodules to clone this repository and remember to follow initialization steps in coco-caption/README.md. Then download and place the Flickr30k reference file under coco-caption/annotations. Also, download Stanford CoreNLP 3.9.1 for grounding evaluation and place the uncompressed folder under the tools/ directory.
Download the preprocessd dataset from this link and extract it to data/.
For Flickr30k-Entities, please download bottom-up visual feature extracted by Anderson's extractor (Zhou's extractor) from this link ( link) and place the uncompressed folders under data/flickrbu/. For MSCOCO, please follow this instruction to prepare the bottom-up features and place them under data/mscoco/.
Download the pretrained models from here and extract them to log/.
Download the pretrained SCAN models from this link and extract them to misc/SCAN/runs.

Evaluation

To reproduce the results reported in the paper, just simply run

bash eval_flickr.sh

fro Flickr30k-Entities and

bash eval_coco.sh

for MSCOCO.

Training

In the first training stage, run like

python train.py --id CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att  --input_box_dir data/flickrbu/flickrbu_box  --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log/CE-scan-sup-0.1kl --save_checkpoint_every 1000 --val_images_use -1 --max_epochs 30  --att_supervise  True   --att_supervise_weight 0.1

In the second training stage, run like

python train.py --id sc-ground-CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att  --input_box_dir data/flickrbu/flickrbu_box  --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-5 --start_from log/CE-scan-sup-0.1kl --checkpoint_path log/sc-ground-CE-scan-sup-0.1kl --save_checkpoint_every 1000 --language_eval 1 --val_images_use -1 --self_critical_after 30  --max_epochs  110      --cider_reward_weight  1
--ground_reward_weight   1

Citation

@inproceedings{zhou2020grounded,
  title={More Grounded Image Captioning by Distilling Image-Text Matching Model},
  author={Zhou, Yuanen and Wang, Meng and Liu, Daqing and  Hu, Zhenzhen and Zhang, Hanwang},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  year={2020}
}

Acknowledgements

This repository is built upon self-critical.pytorch, SCAN and grounded-video-description. Thanks for their released code.

Comments

about reproduce performance

Thanks for opening source code. The provided pretrained model can perform the results claimed in your paper, but when I fail to training model on flicker dataset to reproduce the results.

opened by wanboyang 7
Training Detail

Hi, I am trying to reproduce the baseline(Topdown in Flickr 30k, with CIDEr optimization in RL phase) in your paper, and I used different kinds of code including yours, but I failed, The result of CIDEr score is around 60, can not reach the similar result in your paper (over 67), can you share the detailed training setting?

Thanks!

opened by SmallYour 3
请教Grounded-Image-Captioning代码复现缺少infos_CE-gt-sup-0.1-nll.pkl，model.pth文件的问题

您好，我是北京理工大学计算机学院的一名学生，我最近在学习您2020年发布在cvpr的论文Moer Grounded Image Captioning by Distilling Image-Text Matching Model，在我复现论文代码的时候我按照readme的要求配置好了环境，并顺利进行了代码的评估部分，但是在我训练模型的过程中缺少了/log/CE-gt-sup-0.1-nll/infos_CE-gt-sup-0.1-nll.pkl文件，而/log/CE-gt-sup-0.1-nll文件夹下只有infos_CE-gt-sup-0.1-nll-best.pkl文件，报错信息如下：

然后我以为两个文件是相同的，因此将infos_CE-gt-sup-0.1-nll-best.pkl，model-best.pth更改为代码中的infos_CE-gt-sup-0.1-nll.pkl，model.pth,但是又报以下错误因此我想请教您log.tar.gz是否缺少了这两个infos_CE-gt-sup-0.1-nll.pkl，model.pth,这两个文件。期待您的指导，非常感谢您。

opened by mrldj 3
Infer on new images, dataloaderraw not return box_feats

Hi, thanks for your work. I'm curious to use your code on new raw images, how can I do that? From the vis_attn_cap.ipynb, I modified this part : opt = parser.parse_args(args=["--image_folder","my_images_folder"]) And it use dataloaderraw

The problem is dataloaderraw doesn't yield box_feats. How do I get the box_feats? Thanks :)

opened by vinson2233 0
Evaluation results

Hi,

I run your evaluation results on both F50k and Coco. The results of F30k is absolutely right. But the results on coco is not quite ok. The only detail I find strange is misc/SCAN/runs/f30k_SCAN_POS1. Could you please offer me the model of misc/SCAN/runs/coco... or something else I made some error? Many thanks.

Results Reported in Table 4. Eval Up-Down+XE Constructing SCAN model... scan_model_path:misc/SCAN/runs/f30k_SCAN_POS1/checkpoint/model_best.pth.tar Done DataLoader loading json file: data/cocotalk.json vocab size is 9487 DataLoader loading h5 file: data/mscoco/cocobu_fc data/mscoco/cocobu_att data/mscoco/cocobu_box data/cocotalk_label.h5 max sequence length in data is 16 read 123287 image features assigned 113287 images to split train assigned 5000 images to split val assigned 5000 images to split test loading annotations into memory... Done (t=0.59s) creating index... index created! using 5000/5000 predictions Loading and preparing results... DONE (t=0.05s) creating index... index created! tokenization... PTBTokenizer tokenized 307085 tokens at 801422.82 tokens per second. PTBTokenizer tokenized 51973 tokens at 244262.23 tokens per second. setting up scorers... computing Bleu score... {'testlen': 46974, 'reflen': 46862, 'guess': [46974, 41974, 36974, 31974], 'correct': [35147, 19145, 9763, 4961]} ratio: 1.0023899961589133 Bleu_1: 0.748 Bleu_2: 0.584 Bleu_3: 0.448 Bleu_4: 0.344 computing METEOR score... METEOR: 0.269 computing Rouge score... ROUGE_L: 0.554 computing CIDEr score... CIDEr: 1.084

opened by detectiveli 4
Key Error during training

Hi,

I am trying to use your codebase for some experiments and during training I get the following error:

$ python train.py --id CE-scan-sup-0.1kl --caption_model topdown --input_json data/flickrtalk.json --input_fc_dir data/flickrbu/flickrbu_fc --input_att_dir data/flickrbu/flickrbu_att --input_box_dir data/flickrbu/flickrbu_box --input_label_h5 data/flickrtalk_label.h5 --batch_size 29 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log/CE-scan-sup-0.1kl --save_checkpoint_every 1000 --val_images_use -1 --max_epochs 30 --att_supervise True --att_supervise_weight 0.1 Constructing SCAN model... scan_model_path:misc/SCAN/runs/f30k_SCAN_POS1/checkpoint/model_best.pth.tar Done tensorboardX is not installed DataLoader loading json file: data/flickrtalk.json vocab size is 7000 DataLoader loading h5 file: data/flickrbu/flickrbu_fc data/flickrbu/flickrbu_att data/flickrbu/flickrbu_box data/flickrtalk_label.h5 max sequence length in data is 16 read 31014 image features assigned 29000 images to split train assigned 1014 images to split val assigned 1000 images to split test Read data: 0.7768440246582031 Traceback (most recent call last): File "train.py", line 291, in <module> train(opt) File "train.py", line 180, in train model_out = dp_lw_model(fc_feats, att_feats, labels, masks, att_masks, data['gts'], torch.arange(0, len(data['gts'])), sc_flag, box_inds) File "/users/vlad/anaconda3/envs/gvd/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/users/vlad/anaconda3/envs/gvd/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/users/vlad/anaconda3/envs/gvd/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/users/vlad/anaconda3/envs/gvd/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 83, in parallel_apply raise output File "/users/vlad/anaconda3/envs/gvd/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 59, in _worker output = module(*input, **kwargs) File "/users/vlad/anaconda3/envs/gvd/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/users/vlad/image_grounding/Grounded-Image-Captioning/misc/loss_wrapper.py", line 45, in forward _, grd_weights,noun_mask= get_self_critical_reward(self.model, fc_feats, att_feats, att_masks, gts, labels[:,1:].detach(), vars(self.opt)) File "/users/vlad/image_grounding/Grounded-Image-Captioning/misc/rewards.py", line 68, in get_self_critical_reward gts = {i: gts[i % batch_size // seq_per_img] for i in range(2 * batch_size)} File "/users/vlad/image_grounding/Grounded-Image-Captioning/misc/rewards.py", line 68, in <dictcomp> gts = {i: gts[i % batch_size // seq_per_img] for i in range(2 * batch_size)} KeyError: 29 Terminating BlobFetcher

I followed all the steps described and I am able to run the evaluation, but for training I get the above error. I did some investigations and I think a reshape is missing somewhere since the dataloader loads multiple captions per image but this doesn't seem to be reflected here https://github.com/YuanEZhou/Grounded-Image-Captioning/blob/77295a6e36de817f173435e809effc3396469ee3/misc/rewards.py#L70 Any suggestions on how to overcome this?

Thanks!

opened by vladbogo 2

Owner

YE Zhou

GitHub

Code for C2-Matching (CVPR2021). Paper: Robust Reference-based Super-Resolution via C2-Matching.

C2-Matching (CVPR2021) This repository contains the implementation of the following paper: Robust Reference-based Super-Resolution via C2-Matching Yum

151 Dec 26, 2022

XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale

XtremeDistilTransformers for Distilling Massive Multilingual Neural Networks ACL 2020 Microsoft Research [Paper] [Video] Releasing [XtremeDistilTransf

125 Jan 4, 2023

Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks"

HKD Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks" cifia-100 result The implementation of compared methods are ba

30 Dec 18, 2022

Deep Structured Instance Graph for Distilling Object Detectors (ICCV 2021)

DSIG Deep Structured Instance Graph for Distilling Object Detectors Authors: Yixin Chen, Pengguang Chen, Shu Liu, Liwei Wang, Jiaya Jia. [pdf] [slide]

31 Nov 17, 2022

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation (CoRL 2021)

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation [Project website] [Paper] This project is a PyTorch i

Cognitive Learning for Vision and Robotics (CLVR) lab @ USC

6 Feb 28, 2022

PyTorch source code for Distilling Knowledge by Mimicking Features

LSHFM.detection This is the PyTorch source code for Distilling Knowledge by Mimicking Features. And this project contains code for object detection wi

4 Dec 17, 2022

Adversarial-Information-Bottleneck - Distilling Robust and Non-Robust Features in Adversarial Examples by Information Bottleneck (NeurIPS21)

NeurIPS 2021 Title: Distilling Robust and Non-Robust Features in Adversarial Exa

35 Dec 26, 2022

Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

ResDAVEnet-VQ Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech What is in this repo? M

21 Aug 23, 2022

DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations

DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations This repository contains the data, scripts and baseline co

51 Dec 17, 2022

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

ALFRED A Benchmark for Interpreting Grounded Instructions for Everyday Tasks Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han,

204 Dec 15, 2022

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Knover Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out eff

607 Dec 31, 2022

PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner [Li et al., 2020].

VGPL-Visual-Prior PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner (VGPL). Give

8 Dec 29, 2022

[CVPR'22] Official PyTorch Implementation of Collaborative Transformers for Grounded Situation Recognition

[CVPR'22] Collaborative Transformers for Grounded Situation Recognition Paper | Model Checkpoint This is the official PyTorch implementation of Collab

29 Dec 10, 2022

The code of “Similarity Reasoning and Filtration for Image-Text Matching” [AAAI2021]

SGRAF PyTorch implementation for AAAI2021 paper of “Similarity Reasoning and Filtration for Image-Text Matching”. It is built on top of the SCAN and C

149 Dec 22, 2022

A 1.3B text-to-image generation model trained on 14 million image-text pairs

minDALL-E on Conceptual Captions minDALL-E, named after minGPT, is a 1.3B text-to-image generation model trained on 14 million image-text pairs for no

604 Dec 14, 2022

Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Diverse Image Captioning with Context-Object Split Latent Spaces This repository is the PyTorch implementation of the paper: Diverse Image Captioning

34 Nov 21, 2022

Semi-Autoregressive Transformer for Image Captioning

Semi-Autoregressive Transformer for Image Captioning Requirements Python 3.6 Pytorch 1.6 Prepare data Please use git clone --recurse-submodules to clo

23 Dec 9, 2022

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

VisualGPT Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning Main Architecture of Our VisualGPT Downloa

140 Dec 28, 2022

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Related tags

Overview

Moer Grounded Image Captioning by Distilling Image-Text Matching Model

Requirements

Prepare data

Evaluation

Training

Citation

Acknowledgements

Comments

about reproduce performance

Training Detail

请教Grounded-Image-Captioning代码复现缺少infos_CE-gt-sup-0.1-nll.pkl，model.pth文件的问题

Infer on new images, dataloaderraw not return box_feats

Evaluation results

Key Error during training

Owner

YE Zhou

Code for C2-Matching (CVPR2021). Paper: Robust Reference-based Super-Resolution via C2-Matching.

XtremeDistil framework for distilling/compressing massive multilingual neural network models to tiny and efficient models for AI at scale

Code for ICCV 2021 paper "Distilling Holistic Knowledge with Graph Neural Networks"

Deep Structured Instance Graph for Distilling Object Detectors (ICCV 2021)

Distilling Motion Planner Augmented Policies into Visual Control Policies for Robot Manipulation (CoRL 2021)

PyTorch source code for Distilling Knowledge by Mimicking Features

Adversarial-Information-Bottleneck - Distilling Robust and Non-Robust Features in Adversarial Examples by Information Bottleneck (NeurIPS21)

Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

DSTC10 Track 2 - Knowledge-grounded Task-oriented Dialogue Modeling on Spoken Conversations

ALFRED - A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

PyTorch implementation for the visual prior component (i.e. perception module) of the Visually Grounded Physics Learner [Li et al., 2020].

[CVPR'22] Official PyTorch Implementation of Collaborative Transformers for Grounded Situation Recognition

The code of “Similarity Reasoning and Filtration for Image-Text Matching” [AAAI2021]

A 1.3B text-to-image generation model trained on 14 million image-text pairs

Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Semi-Autoregressive Transformer for Image Captioning

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning