Official pytorch implementation of paper Dual-Level Collaborative Transformer for Image Captioning (AAAI 2021).

Overview

Dual-Level Collaborative Transformer for Image Captioning

This repository contains the reference code for the paper Dual-Level Collaborative Transformer for Image Captioning.

Experiment setup

please refer to m2 transformer

Data preparation

  • Annotation. Download the annotation file annotation.zip. Extarct and put it in the project root directory.
  • Feature. You can download our ResNeXt-101 feature (hdf5 file) here. Acess code: jcj6.
  • evaluation. Download the evaluation tools here. Acess code: jcj6. Extarct and put it in the project root directory.

There are five kinds of keys in our .hdf5 file. They are

  • ['%d_features' % image_id]: region features (N_regions, feature_dim)
  • ['%d_boxes' % image_id]: bounding box of region features (N_regions, 4)
  • ['%d_size' % image_id]: size of original image (for normalizing bounding box), (2,)
  • ['%d_grids' % image_id]: grid features (N_grids, feature_dim)
  • ['%d_mask' % image_id]: geometric alignment graph, (N_regions, N_grids)

We extract feature with the code in grid-feats-vqa.

The first three keys can be obtained when extracting region features with extract_region_feature.py. The forth key can be obtained when extracting grid features with code in grid-feats-vqa. The last key can be obtained with align.ipynb

Training

python train.py --exp_name dlct --batch_size 50 --head 8 --features_path ./data/coco_all_align.hdf5 --annotation annotation --workers 8 --rl_batch_size 100 --image_field ImageAllFieldWithMask --model DLCT --rl_at 17 --seed 118

Evaluation

python eval.py --annotation annotation --workers 4 --features_path ./data/coco_all_align.hdf5 --model_path path_of_model_to_eval --model DLCT --image_field ImageAllFieldWithMask --grid_embed --box_embed --dump_json gen_res.json --beam_size 5

Important args:

  • --features_path path to hdf5 file
  • --model_path
  • --dump_json dump generated captions to

Pretrained model is available here. Acess code: jcj6. By evaluating the pretrained model, you will get

{'BLEU': [0.8136727001615207, 0.6606095421082421, 0.5167535314080227, 0.39790755018790197], 'METEOR': 0.29522868252436046, 'ROUGE': 0.5914367650104326, 'CIDEr': 1.3382047139781112, 'SPICE': 0.22953477359195887}

References

[1] M2

[2] grid-feats-vqa

[3] butd

Acknowledgements

Thanks the original m2 and amazing work of grid-feats-vqa.

Comments
  • 关于evaluation阶段的报错

    关于evaluation阶段的报错

    作者您好,打扰了,我在服务器上跑您的代码在evaluation阶段出现了如下报错,请问您遇到过么?期待您的回复! Traceback (most recent call last): File "train.py", line 353, in scores = evaluate_metrics(model, dict_dataloader_val, text_field) File "train.py", line 62, in evaluate_metrics out, _ = model.beam_search(images, 20, text_field.vocab.stoi[''], 5, out_size=1, File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/models/captioning_model.py", line 70, in beam_search return bs.apply(visual, out_size, return_probs, **kwargs) File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/models/beam_search/beam_search.py", line 71, in apply visual, outputs = self.iter(t, visual, outputs, return_probs, **kwargs) File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/models/beam_search/beam_search.py", line 121, in iter self.model.apply_to_states(self._expand_state(selected_beam, cur_beam_size)) File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/models/containers.py", line 30, in apply_to_states self._buffers[name] = fn(self._buffers[name]) File "/mnt/hdd1/alluser/yanjialuo/image-captioning-DLCT/models/beam_search/beam_search.py", line 26, in fn s = torch.gather(s.view(*([self.b_s, cur_beam_size] + shape[1:])), 1, RuntimeError: gather_out_cuda(): Expected dtype int64 for index

    opened by yanjialele 6
  • Can you upload resources to another cloud like Gdrive,  Onedrive or Dropbox?

    Can you upload resources to another cloud like Gdrive, Onedrive or Dropbox?

    In my country, the download speed from Baidu is very slow and I can't down needed resources. Can you please upload them to GDrive, OneDrive, or Dropbox? Thank you!

    opened by khiemledev 5
  • RuntimeError: stack expects each tensor to be equal size

    RuntimeError: stack expects each tensor to be equal size

    您好,请问您使用https://github.com/facebookresearch/grid-feats-vqa提取的特征维度是固定的吗?我使用[extract_grid_feature.py]提取得到的特征维度不同,所以在batch size训练时,会出现RuntimeError: stack expects each tensor to be equal size, but got [1, 2048, 18, 32] at entry 0 and [1, 2048, 19, 29] at entry 1

    opened by competent-s 4
  • TypeError: __init__() got an unexpected keyword argument 'train_on_pred_boxes'

    TypeError: __init__() got an unexpected keyword argument 'train_on_pred_boxes'

    /usr/local/lib/python3.7/dist-packages/detectron2/config/config.py in wrapped(self, *args, **kwargs) 188 if _called_with_cfg(*args, **kwargs): 189 explicit_args = _get_args_from_config(from_config_func, *args, **kwargs) --> 190 init_func(self, **explicit_args) 191 else: 192 init_func(self, *args, **kwargs)

    TypeError: init() got an unexpected keyword argument 'train_on_pred_boxes' 作者您好,我在复现您论文时,在实现提取特征的代码时出现该问题,我查询了相关的,但是没有解决?

    opened by hwbhwbgao 4
  • About test performance

    About test performance

    Hello, Thanks for your opening code. When I train this model from the scratch, training loss is down, but testing performances do not change in training stage

    opened by wanboyang 4
  • About features on Baiduyun disk

    About features on Baiduyun disk

    Hello, could you explain about the features displayed on net disk? is coco_all_align.hdf5 in the zip file? And what are the files end with z01, z02,z03? I have tried to exact the zip file, but fails. feature

    opened by john2019-warwick 3
  • 区域特征提取

    区域特征提取

    我尝试使用您的代码自己常见特征集,但是还有部分疑问。region_after和region_before 分别取自proposal_box_features和proposal_box_features1(均值),这有何区别?? 另外我去https://github.com/facebookresearch/grid-feats-vqa下也并未找到相应区域yaml文件,请问您是否可以分享下您的X-101-region.yaml和X-152-region.yaml文件??

    opened by HN123-123 2
  • A code bug

    A code bug

    Traceback (most recent call last): File "/mnt/Pycharm_Remote/DLCT_test/train.py", line 335, in scores = evaluate_metrics(model, dict_dataloader_val, text_field) File "/mnt/Pycharm_Remote/DLCT_test/train.py", line 61, in evaluate_metrics **{'boxes': boxes, 'grids': grids, 'masks': masks}) File "/mnt/Pycharm_Remote/DLCT_test/models/captioning_model.py", line 70, in beam_search return bs.apply(visual, out_size, return_probs, **kwargs) File "/mnt/Pycharm_Remote/DLCT_test/models/beam_search/beam_search.py", line 71, in apply visual, outputs = self.iter(t, visual, outputs, return_probs, **kwargs) File "/mnt/Pycharm_Remote/DLCT_test/models/beam_search/beam_search.py", line 121, in iter self.model.apply_to_states(self._expand_state(selected_beam, cur_beam_size)) File "/mnt/Pycharm_Remote/DLCT_test/models/containers.py", line 30, in apply_to_states self._buffers[name] = fn(self._buffers[name]) File "/mnt/Pycharm_Remote/DLCT_test/models/beam_search/beam_search.py", line 27, in fn beam.expand(*([self.b_s, self.beam_size] + shape[1:]))) RuntimeError: gather_out_cuda(): Expected dtype int64 for index

    the beam is float and come from "selected_beam = selected_idx / candidate_logprob.shape[-1]",so it's float.But index need int.

    opened by GX77 2
  • how are alignment graph obtained for new datasets

    how are alignment graph obtained for new datasets

    Hi, in your coding,h5py.File features has keys like ['%d_features' % image_id] , ['%d_grids' % image_id], ['%d_boxes' % image_id], ['%d_size' % image_id], ['%d_mask' % image_id], If I have a new data set, can I just use align.py to get geometric alignment graph after I get grid features and region features using extract_region_feature.py and grid-feats-vqa.

    opened by Davidwdq 1
  • Can you share the pretrained model?

    Can you share the pretrained model?

    I seems that I have managed to combine the code to generate all the features needed. I want to test it on my custom images without re-training the network. Can you please provide the pretrained checkpoint?

    opened by hcl14 1
  • What is the relation between X-101 and X-152 extractors in the code?

    What is the relation between X-101 and X-152 extractors in the code?

    Your script extract_region_feature.py has weights for X-101 hard-coded. But the file with features has name "region_before_X152.hdf5". Also, there is no information from you which checkpoint to use for extracting grid features. In the paper you mention both X-101 and X-152 as extractors.

    Which extractor checkpoint should I use for grid-feats-vqa: X-101 or X-152? Can I use X-152 for region features as well?

    opened by hcl14 1
  • Unable to download the pretrained model

    Unable to download the pretrained model

    It is not possible to download the trained pth file from pan.baidu.com It seems we need to install some kind of download software (the Baidu page is in Chinese so it is hard to understand how to proceed).

    Could you please host the files in a more user-friendly repository? Like osf.io or Dropbox?

    Thx

    opened by luileito 0
  • The Code to generate the caption of a figure.

    The Code to generate the caption of a figure.

    Thank you so much for sharing your brilliant code with us. But could you share me the code to generate the caption for a image like the Figure 1 in your artical? I would appreciate it if you could share it with me! ^_^

    opened by z972778371 1
Owner
lyricpoem
lyricpoem
Simple image captioning model - CLIP prefix captioning.

Simple image captioning model - CLIP prefix captioning.

null 688 Jan 4, 2023
Source code for paper "Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling", AAAI 2021

ATLOP Code for AAAI 2021 paper Document-Level Relation Extraction with Adaptive Thresholding and Localized Context Pooling. If you make use of this co

Wenxuan Zhou 146 Nov 29, 2022
PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

PyTorch implementation of our ICCV 2021 paper, Interpretation of Emergent Communication in Heterogeneous Collaborative Embodied Agents.

Saim Wani 4 May 8, 2022
Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

RGBT Crowd Counting Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, Liang Lin. "Cross-Modal Collaborative Representation Learning and a L

null 37 Dec 8, 2022
The official repo of the CVPR 2021 paper Group Collaborative Learning for Co-Salient Object Detection .

GCoNet The official repo of the CVPR 2021 paper Group Collaborative Learning for Co-Salient Object Detection . Trained model Download final_gconet.pth

Qi Fan 46 Nov 17, 2022
Official implementation for paper Knowledge Bridging for Empathetic Dialogue Generation (AAAI 2021).

Knowledge Bridging for Empathetic Dialogue Generation This is the official implementation for paper Knowledge Bridging for Empathetic Dialogue Generat

Qintong Li 50 Dec 20, 2022
Official PyTorch implementation of "Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning" (AAAI 2021)

Proxy Synthesis: Learning with Synthetic Classes for Deep Metric Learning Official PyTorch implementation of "Proxy Synthesis: Learning with Synthetic

NAVER/LINE Vision 30 Dec 6, 2022
Implementation of accepted AAAI 2021 paper: Deep Unsupervised Image Hashing by Maximizing Bit Entropy

Deep Unsupervised Image Hashing by Maximizing Bit Entropy This is the PyTorch implementation of accepted AAAI 2021 paper: Deep Unsupervised Image Hash

null 62 Dec 30, 2022
Semi-Autoregressive Transformer for Image Captioning

Semi-Autoregressive Transformer for Image Captioning Requirements Python 3.6 Pytorch 1.6 Prepare data Please use git clone --recurse-submodules to clo

YE Zhou 23 Dec 9, 2022
A transformer-based method for Healthcare Image Captioning in Vietnamese

vieCap4H Challenge 2021: A transformer-based method for Healthcare Image Captioning in Vietnamese This repo GitHub contains our solution for vieCap4H

Doanh B C 4 May 5, 2022
[CVPR'22] Official PyTorch Implementation of Collaborative Transformers for Grounded Situation Recognition

[CVPR'22] Collaborative Transformers for Grounded Situation Recognition Paper | Model Checkpoint This is the official PyTorch implementation of Collab

Junhyeong Cho 29 Dec 10, 2022
Code for Dual Contrastive Learning for Unsupervised Image-to-Image Translation, NTIRE, CVPRW 2021.

arXiv Dual Contrastive Learning Adversarial Generative Networks (DCLGAN) We provide our PyTorch implementation of DCLGAN, which is a simple yet powerf

null 119 Dec 4, 2022
Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

Ruotian(RT) Luo 906 Jan 3, 2023
Official pytorch implementation of "DSPoint: Dual-scale Point Cloud Recognition with High-frequency Fusion"

DSPoint Official implementation of "DSPoint: Dual-scale Point Cloud Recognition with High-frequency Fusion". Paper link: https://arxiv.org/abs/2111.10

Ziyao Zeng 10 Nov 24, 2021
Official implementation for (Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching, AAAI-2021)

Show, Attend and Distill: Knowledge Distillation via Attention-based Feature Matching Official pytorch implementation of "Show, Attend and Distill: Kn

Clova AI Research 80 Dec 16, 2022
Implementation for our ICCV 2021 paper: Dual-Camera Super-Resolution with Aligned Attention Modules

DCSR: Dual Camera Super-Resolution Implementation for our ICCV 2021 oral paper: Dual-Camera Super-Resolution with Aligned Attention Modules paper | pr

Tengfei Wang 110 Dec 20, 2022
Implementation for our ICCV 2021 paper: Dual-Camera Super-Resolution with Aligned Attention Modules

DCSR: Dual Camera Super-Resolution Implementation for our ICCV 2021 oral paper: Dual-Camera Super-Resolution with Aligned Attention Modules paper | pr

Tengfei Wang 110 Dec 20, 2022
git《Self-Attention Attribution: Interpreting Information Interactions Inside Transformer》(AAAI 2021) GitHub:

Self-Attention Attribution This repository contains the implementation for AAAI-2021 paper Self-Attention Attribution: Interpreting Information Intera

null 60 Dec 29, 2022
《LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classification》(AAAI 2021) GitHub:

LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classification

null 76 Dec 5, 2022