Meshed-Memory Transformer for Image Captioning. CVPR 2020

Overview

MΒ²: Meshed-Memory Transformer

This repository contains the reference code for the paper Meshed-Memory Transformer for Image Captioning (CVPR 2020).

Please cite with the following BibTeX:

@inproceedings{cornia2020m2,
  title={{Meshed-Memory Transformer for Image Captioning}},
  author={Cornia, Marcella and Stefanini, Matteo and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2020}
}

Meshed-Memory Transformer

Environment setup

Clone the repository and create the m2release conda environment using the environment.yml file:

conda env create -f environment.yml
conda activate m2release

Then download spacy data by executing the following command:

python -m spacy download en

Note: Python 3.6 is required to run our code.

Data preparation

To run the code, annotations and detection features for the COCO dataset are needed. Please download the annotations file annotations.zip and extract it.

Detection features are computed with the code provided by [1]. To reproduce our result, please download the COCO features file coco_detections.hdf5 (~53.5 GB), in which detections of each image are stored under the _features key. is the id of each COCO image, without leading zeros (e.g. the for COCO_val2014_000000037209.jpg is 37209), and each value should be a (N, 2048) tensor, where N is the number of detections.

Evaluation

To reproduce the results reported in our paper, download the pretrained model file meshed_memory_transformer.pth and place it in the code folder.

Run python test.py using the following arguments:

Argument Possible values
--batch_size Batch size (default: 10)
--workers Number of workers (default: 0)
--features_path Path to detection features file
--annotation_folder Path to folder with COCO annotations

Expected output

Under output_logs/, you may also find the expected output of the evaluation code.

Training procedure

Run python train.py using the following arguments:

Argument Possible values
--exp_name Experiment name
--batch_size Batch size (default: 10)
--workers Number of workers (default: 0)
--m Number of memory vectors (default: 40)
--head Number of heads (default: 8)
--warmup Warmup value for learning rate scheduling (default: 10000)
--resume_last If used, the training will be resumed from the last checkpoint.
--resume_best If used, the training will be resumed from the best checkpoint.
--features_path Path to detection features file
--annotation_folder Path to folder with COCO annotations
--logs_folder Path folder for tensorboard logs (default: "tensorboard_logs")

For example, to train our model with the parameters used in our experiments, use

python train.py --exp_name m2_transformer --batch_size 50 --m 40 --head 8 --warmup 10000 --features_path /path/to/features --annotation_folder /path/to/annotations

Sample Results

References

[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

Comments
  • Generating HDF5 detections from custom dataset or bottom-up-attention TSV

    Generating HDF5 detections from custom dataset or bottom-up-attention TSV

    I have a custom dataset,

    I have generated the detections TSV using : https://github.com/airsplay/py-bottom-up-attention But the model requires HDF5.

    TSV has these per each example:

    {
       'image_id': image_id,
       'image_h': np.size(im, 0),
       'image_w': np.size(im, 1),
       'num_boxes' : len(keep_boxes),
       'boxes': base64.b64encode(cls_boxes[keep_boxes]),
       'features': base64.b64encode(pool5[keep_boxes])
    }  
    

    When examining the coco dataset examples I see the following for example:

    >>> dts["35368_boxes"]
    <HDF5 dataset "35368_boxes": shape (37, 4), type "<f4">
    >>> dts["35368_features"]
    <HDF5 dataset "35368_features": shape (37, 2048), type "<f4">
    >>> dts["35368_cls_prob"]
    <HDF5 dataset "35368_cls_prob": shape (37, 1601), type "<f4">
    
    >>> dts["35368_boxes"][36]
    array([349.57147, 154.07967, 420.0327 , 408.64462], dtype=float32)
    

    I'll try to figure out how to convert my TSV to required HDF5 myself from the code but guide would be appreciated.

    Thank you.

    opened by SandroJijavadze 10
  • Feature for COCO on-line Test images

    Feature for COCO on-line Test images

    Thanks for sharing the code of this brilliant work!

    I'm wondering that is it possible to make the detection file for coco online test images available as the one for train/val hdf5 file. Or is there some available online resource that I did not spot it.

    Thanks in advance!

    opened by Jian-Xi 5
  • Unable to replicate results after retraining

    Unable to replicate results after retraining

    Hello and thank you for this fantastic repo!

    I am trying to retrain your model using COCO features I have extracted myself using the bottom-up attention repo as you have suggested in #2. I am currently on epoch 15 and the highest CIDEr score on the test set has been 1.13. This is much less than the 1.31 that I get when using your pretrained model. Other than the new features, I am using your default values for all hyperparameters.

    Could you give me some guidance in order to better replicate your results?

    opened by aemrey 5
  • visualize the attention part like results.png

    visualize the attention part like results.png

    I really appreciate it that you offer such good work for everyone. I am very interested that how to visualize the image like the results.png in images folder. Thanks a lot !!!!

    opened by jkllbn2563 4
  • Flag for cpu only evalutaion.

    Flag for cpu only evalutaion.

    I tried running the evaluation script "test.py" along with arguments but an error occured as follow

    Meshed-Memory Transformer Evaluation
    Traceback (most recent call last):
      File "test.py", line 69, in <module>
        model = Transformer(text_field.vocab.stoi['<bos>'], encoder, decoder).to(device)
      File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/nn/modules/module.py", line 386, in to
        return self._apply(convert)
      File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
        module._apply(fn)
      File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
        module._apply(fn)
      File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/nn/modules/module.py", line 193, in _apply
        module._apply(fn)
      [Previous line repeated 3 more times]
      File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/nn/modules/module.py", line 199, in _apply
        param.data = fn(param.data)
      File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/nn/modules/module.py", line 384, in convert
        return t.to(device, dtype if t.is_floating_point() else None, non_blocking)
      File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/cuda/__init__.py", line 162, in _lazy_init
        _check_driver()
      File "/home/tarl-group/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/cuda/__init__.py", line 82, in _check_driver
        http://www.nvidia.com/Download/index.aspx""")
    AssertionError: 
    Found no NVIDIA driver on your system. Please check that you
    have an NVIDIA GPU and installed a driver from
    http://www.nvidia.com/Download/index.aspx
    

    Looks like the code uses cuda somewhere. Does the code support cpu only execution ?

    opened by Akashtyagi 4
  • evaluation error

    evaluation error

    When I conduct eval script test.py and it report error as:

    RuntimeError: Expected object of scalar type Byte but got scalar type Bool for sequence element 1 in sequence argument at position #1 'tensors'

    Is there any mistake in evaluation?

    opened by lixiangpengcs 4
  • From features of new images to M2 transformer

    From features of new images to M2 transformer

    First of all, congrats for your work and thanks for releasing the code! πŸ˜„

    Following #2 and #5, I'm trying to run the network on a new set of images. To get the image features I went to the bottom-up attention repo you suggested here, using the Faster-R-CNN-ResNet101 model with these weights.

    My problem is the following: how to transform the outputs of this feature extractor into the format you require?

    Following the Readme and code, I understand that you need to express the features as a Nx2048 tensor. Following this line, I understand that you also need a cls_prob vector to sort your feature vector.

    Now, I took the blob res5c for the features and cls_prob for the probabilities, but the dimensions are not quite as I expected. res5c has dimension Nx2048x14x14, so the 14x14 should be mapped into one number I guess. And cls_prob has Nx1061 which is not coherent with the rest.

    Am I missing something?

    Thanks!

    opened by alesolano 3
  • questions regarding the paper

    questions regarding the paper

    Hello. Congratulations on your brilliant work! I'd like to ask some questions regarding the paper: In section 4.3, you mentioned that

    we firstly introduce a reduced version of our approach in which the i-th decoder layer is only connected to the corresponding i-th encoder layer (1-to-1), instead of being connected to all encoders.

    Is that step included in your code? As what i see, your query is all the same for the visual attention in the decoder part, as you are doing:

            enc_att1 = self.enc_att(self_att, enc_output[:, 0], enc_output[:, 0], mask_enc_att) * mask_pad
            enc_att2 = self.enc_att(self_att, enc_output[:, 1], enc_output[:, 1], mask_enc_att) * mask_pad
            enc_att3 = self.enc_att(self_att, enc_output[:, 2], enc_output[:, 2], mask_enc_att) * mask_pad
    

    Do you mean that you set alphas to 1 (simply take the sum of all encoder layers)? Because if the ith decoder layer is connected to the ith encoder layer, that means the queries are different. And may i also kindly know if you have examined the importance of taking a weighted sum rather than a sum of the encoder layers?

    opened by fawazsammani 3
  • Learned a priori knowledge & New dataset which is very different from MSCOCO

    Learned a priori knowledge & New dataset which is very different from MSCOCO

    Hi, in the paper, you mentioned "encodes relationships between image regions exploiting learned a priori knowledge". I am confused about it. The learned priori knowledge exists there before you train the model? In the code, which part you input the learned priori knowledge? How to get the learned priori knowledge for a new dataset which is very different from MSCOCO?

    opened by XuMengyaAmy 2
  • Beamsearch

    Beamsearch

    hello, thanks for your work. (1)why input is the one array words [5,1] generated (beam size :5) at x timestep, not is the generated sequence [5,x] , and then got the last word logprob. (2) with the code beam search in our work, it 's stoped untill runing all steptimes, I think it's not reasonable for some sentences generated have been over.

    These problems happened in our work with your beam search codes

    Please, help me.! thanks

    opened by y78h11b09 2
  • Confirming Training Time/Memory Information

    Confirming Training Time/Memory Information

    Hello, thank you for your great work!

    I just wanted to confirm some simple information about training time and memory usage, as I didn't see them in the paper/repo, and I wanted to make sure that the code is running correctly on my machine.

    I am running on a single V100 with your parameters: --batch_size 50 --m 40 --head 8. I find that this consumes around 6GB of GPU memory and that each epoch takes around 3 hours (so around 30 epochs should take around ~90h = ~4 days). Does this match your training time/memory usage?

    I see from the paper that you are training with a (single?) 2080TI, and I see in the code that you stop training dynamically using the patience variable. Do you know how many epochs it took for you to stop training on your final run (around 130 CIDEr) and how long this took?

    Thank you again for your work!

    opened by lukemelas 2
  • Vocabulary of the test split

    Vocabulary of the test split

    Hi! Thanks for the written paper and the availabe code.

    I have what may be a stupid question, but I didn't find a straight answer to it anywhere:

    When evaluating the model with the karpathy test split, some words might not be present on the vocabulary from the train split. What do you do? Simple remove these words from the captions of the test split?

    opened by gondimjoaom 0
  • Ensemble problem

    Ensemble problem

    Thanks for this amazing work! The confusion about the paper when I read is that how to use the ensemble trick. In my opinion, ensembles of mutiple models means training with mutiple independent M2 transformers with different seeds and average the final predictions during inferring.

    opened by DeidaraYang 0
  • no file found

    no file found

    File "train.py", line 317, in }, 'saved_models/%s_last.pth' % args.exp_name) File "/media/lianjunliang/liansang/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/serialization.py", line 376, in save with _open_file_like(f, 'wb') as opened_file: File "/media/lianjunliang/liansang/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/serialization.py", line 230, in _open_file_like return _open_file(name_or_buffer, mode) File "/media/lianjunliang/liansang/anaconda3/envs/m2release/lib/python3.6/site-packages/torch/serialization.py", line 211, in init super(_open_file, self).init(open(name, mode)) FileNotFoundError: [Errno 2] No such file ordirectory:'saved_models/m2_transformer_last.pth'

    opened by MrLianSYSU 0
  • FileNotFoundError: [Errno 2] No such file or directory: 'java'

    FileNotFoundError: [Errno 2] No such file or directory: 'java'

    Meshed-Memory Transformer Evaluation Evaluation: 0%| | 0/500 [00:00<?, ?it/s]/home/mmc_xiaolinhui/mmc_15_exp_202206/meshed-memory-transformer/models/transformer/attention.py:62: UserWarning: masked_fill_ received a mask with dtype torch.uint8, this behavior is now deprecated,please use a mask with dtype torch.bool instead. (Triggered internally at ../aten/src/ATen/native/cuda/Indexing.cu:967.) att = att.masked_fill(attention_mask, -np.inf) Evaluation: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 500/500 [01:25<00:00, 5.84it/s] Traceback (most recent call last): File "test.py", line 79, in scores = predict_captions(model, dict_dataloader_test, text_field) File "test.py", line 34, in predict_captions gts = evaluation.PTBTokenizer.tokenize(gts) File "/home/mmc_xiaolinhui/mmc_15_exp_202206/meshed-memory-transformer/evaluation/tokenizer.py", line 51, in tokenize p_tokenizer = subprocess.Popen(cmd, cwd=path_to_jar_dirname,
    File "/home/mmc_xiaolinhui/anaconda3/envs/pytorch_clip/lib/python3.8/subprocess.py", line 858, in init self._execute_child(args, executable, preexec_fn, close_fds, File "/home/mmc_xiaolinhui/anaconda3/envs/pytorch_clip/lib/python3.8/subprocess.py", line 1704, in _execute_child raise child_exception_type(errno_num, err_msg, err_filename) FileNotFoundError: [Errno 2] No such file or directory: 'java'

    opened by linhuixiao 2
  • RuntimeError: gather(): Expected dtype int64 for index, in beam_search/beam_search.py, line 26, in fn

    RuntimeError: gather(): Expected dtype int64 for index, in beam_search/beam_search.py, line 26, in fn

    Meshed-Memory Transformer Evaluation Evaluation: 0%|
    Evaluation: 0%| | 0/500 [00:00<?, ?it/s]

    Traceback (most recent call last): File "test.py", line 78, in scores = predict_captions(model, dict_dataloader_test, text_field) File "test.py", line 26, in predict_captions out, _ = model.beam_search(images, 20, text_field.vocab.stoi[''], 5, out_size=1) File "/home/lhxiao/pcl_experiment_202203/meshed-memory-transformer/models/captioning_model.py", line 70, in beam_search return bs.apply(visual, out_size, return_probs, **kwargs) File "/home/lhxiao/pcl_experiment_202203/meshed-memory-transformer/models/beam_search/beam_search.py", line 71, in apply visual, outputs = self.iter(t, visual, outputs, return_probs, **kwargs) File "/home/lhxiao/pcl_experiment_202203/meshed-memory-transformer/models/beam_search/beam_search.py", line 121, in iter self.model.apply_to_states(self._expand_state(selected_beam, cur_beam_size)) File "/home/lhxiao/pcl_experiment_202203/meshed-memory-transformer/models/containers.py", line 30, in apply_to_states self._buffers[name] = fn(self._buffers[name]) File "/home/lhxiao/pcl_experiment_202203/meshed-memory-transformer/models/beam_search/beam_search.py", line 26, in fn s = torch.gather(s.view(*([self.b_s, cur_beam_size] + shape[1:])), 1, RuntimeError: gather(): Expected dtype int64 for index

    opened by linhuixiao 1
Owner
AImageLab
AImageLab
Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Diverse Image Captioning with Context-Object Split Latent Spaces This repository is the PyTorch implementation of the paper: Diverse Image Captioning

Visual Inference Lab @TU Darmstadt 34 Nov 21, 2022
Semi-Autoregressive Transformer for Image Captioning

Semi-Autoregressive Transformer for Image Captioning Requirements Python 3.6 Pytorch 1.6 Prepare data Please use git clone --recurse-submodules to clo

YE Zhou 23 Dec 9, 2022
A transformer-based method for Healthcare Image Captioning in Vietnamese

vieCap4H Challenge 2021: A transformer-based method for Healthcare Image Captioning in Vietnamese This repo GitHub contains our solution for vieCap4H

Doanh B C 4 May 5, 2022
Segcache: a memory-efficient and scalable in-memory key-value cache for small objects

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects This repo contains the code of Segcache described in the followi

TheSys Group @ CMU CS 78 Jan 7, 2023
PyTorch Code of "Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics"

Memory In Memory Networks It is based on the paper Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spati

Yang Li 12 May 30, 2022
Episodic-memory - Ego4D Episodic Memory Benchmark

Ego4D Episodic Memory Benchmark EGO4D is the world's largest egocentric (first p

null 3 Feb 18, 2022
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(nΒ²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

Phil Wang 180 Jan 5, 2023
Codes for paper "Towards Diverse Paragraph Captioning for Untrimmed Videos". CVPR 2021

Towards Diverse Paragraph Captioning for Untrimmed Videos This repository contains PyTorch implementation of our paper Towards Diverse Paragraph Capti

Yuqing Song 61 Oct 11, 2022
Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

null 61 Jan 1, 2023
git gitγ€ŠTransformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking》(CVPR 2021) GitHub:git2] γ€ŠMasksembles for Uncertainty Estimation》(CVPR 2021) GitHub:git3]

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li Accepted by CVPR

NingWang 236 Dec 22, 2022
IJCAI2020 & IJCV 2020 :city_sunrise: Unsupervised Scene Adaptation with Memory Regularization in vivo

Seg_Uncertainty In this repo, we provide the code for the two papers, i.e., MRNet:Unsupervised Scene Adaptation with Memory Regularization in vivo, IJ

Zhedong Zheng 348 Jan 5, 2023
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

VisualGPT Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning Main Architecture of Our VisualGPT Downloa

Vision CAIR Research Group, KAUST 140 Dec 28, 2022
Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

Ruotian(RT) Luo 906 Jan 3, 2023
An unreferenced image captioning metric (ACL-21)

UMIC This repository provides an unferenced image captioning metric from our ACL 2021 paper UMIC: An Unreferenced Metric for Image Captioning via Cont

hwanheelee 14 Nov 20, 2022
Image Captioning using CNN and Transformers

Image-Captioning Keras/Tensorflow Image Captioning application using CNN and Transformer as encoder/decoder. In particulary, the architecture consists

null 24 Dec 28, 2022
Optimized code based on M2 for faster image captioning training

Transformer Captioning This repository contains the code for Transformer-based image captioning. Based on meshed-memory-transformer, we further optimi

lyricpoem 16 Dec 16, 2022
An Image Captioning codebase

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

Ruotian(RT) Luo 1.1k Oct 18, 2021
Image Captioning using CNN ,LSTM and Attention

Image Captioning using CNN ,LSTM and Attention This is a deeplearning model which tries to summarize an image into a text . Installation Install this

ASUTOSH GHANTO 1 Dec 16, 2021