Implementation of 'X-Linear Attention Networks for Image Captioning' [CVPR 2020]

Overview

Introduction

This repository is for X-Linear Attention Networks for Image Captioning (CVPR 2020). The original paper can be found here.

Please cite with the following BibTeX:

@inproceedings{xlinear2020cvpr,
  title={X-Linear Attention Networks for Image Captioning},
  author={Pan, Yingwei and Yao, Ting and Li, Yehao and Mei, Tao},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2020}
}

Requirements

Data preparation

  1. Download the bottom up features and convert them to npz files
python2 tools/create_feats.py --infeats bottom_up_tsv --outfolder ./mscoco/feature/up_down_10_100
  1. Download the annotations into the mscoco folder. More details about data preparation can be referred to self-critical.pytorch

  2. Download coco-caption and setup the path of __C.INFERENCE.COCO_PATH in lib/config.py

  3. The pretrained models and results can be downloaded here.

  4. The pretrained SENet-154 model can be downloaded here.

Training

Train X-LAN model

bash experiments/xlan/train.sh

Train X-LAN model using self critical

Copy the pretrained model into experiments/xlan_rl/snapshot and run the script

bash experiments/xlan_rl/train.sh

Train X-LAN transformer model

bash experiments/xtransformer/train.sh

Train X-LAN transformer model using self critical

Copy the pretrained model into experiments/xtransformer_rl/snapshot and run the script

bash experiments/xtransformer_rl/train.sh

Evaluation

CUDA_VISIBLE_DEVICES=0 python3 main_test.py --folder experiments/model_folder --resume model_epoch

Acknowledgements

Thanks the contribution of self-critical.pytorch and awesome PyTorch team.

Comments
  • _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

    _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

    Hello, Thank you for your work and the code. When I run this python3 tools/create_feats.py --infeats bottom_up_tsv --outfolder ./mscoco/feature/up_down_10_100, I am getting this error _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?) I am running on python 3.7 Please help me out.

    opened by Tushar-Faroque 4
  • How to generate the visualization of attended image regions along the caption generation processes.

    How to generate the visualization of attended image regions along the caption generation processes.

    I have a question for how to generate the visualization of attended image regions along the caption generation processes. Would you mind releasing some codes?

    opened by Archer-Fang 3
  • Problems about the provided annotations file

    Problems about the provided annotations file

    Hello, thanks for this good work. I think I downloaded and placed the annotations file right, because I can read the captions and image_ids out of the data. The problem I met is a KeyError in coco-caption/pycocotools/coco.py, when running if self.dataset['type'] == 'instances':. This suggests that the dict self.data read from ./mscoco/misc/captions_val5k.json, should has the key "type", but it doesn't. Please help!

    opened by jlxy 3
  • Reg. Training time

    Reg. Training time

    Hi,

    Thanks for sharing your code here.

    Can you please tell on what type of GPUs you training your model, how much time it took to complete one epoch and the number of epochs till you run your model?

    Regards Deepak Mittal

    opened by deepak242424 3
  • transformer results

    transformer results

    Hello. Thanks for your work and for sharing the code. May you please tell me the details of the pure Transformer model you implemented which achieves 128.3 cider? To best of my knowledge, all implementations could achieve a maximum of around 126.6, according to all papers which utilized the transformer model. In your paper, you don't provide details on Transformer, and there is no supplementary material. So may I kindly know the details for your re-implementation of the pure transformer which achieves 128.3?

    opened by homelifes 3
  • Test results are all 0 when using the author's checkpoint

    Test results are all 0 when using the author's checkpoint

    1. I used the provided caption_model_47.pth from xlan experiment and the following commands to do the test. However, all test metrics are 0 when I used the def decode_beam which is inspired by meshed-memory-transformer. https://github.com/JDAI-CV/image-captioning/blob/master/models/att_basic_model.py https://github.com/JDAI-CV/image-captioning/blob/master/models/xtransformer.py

    And it's the same case for the xlan+transformer test. I feel the decode_beam (from m2transformer) has problems. Do you have the same issue? Can you help to solve it? Thanks.

    What I changed in the decode_beam(self, **kwargs) is adding .long() after some variable because of the error issue. 2) What's inside coco_train_cider.pkl? How to generate this file for custom dataset? 3) And I noticed that you used input_seq and target seq. From my understanding, both of them are ground truth caption which represented by word index, correct? The only difference is 0 and -1 at the end. Can you provide the preprocess code which convert the ground truth caption to word index?

    Test Results: python3.8 main_test.py --folder experiments/xlan --resume 47 Called with args: Namespace(folder='experiments/xlan', resume=47) /media/mmlab/data_2/mengya/Code/ImageCaption/image-captioning/lib/config.py:381: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. yaml_cfg = edict(yaml.load(f)) loading annotations into memory... Done (t=0.03s) creating index... index created! 139it [00:11, 12.16it/s] Loading and preparing results... DONE (t=0.01s) creating index... index created! tokenization... PTBTokenizer tokenized 307085 tokens at 1576551.76 tokens per second. PTBTokenizer tokenized 4999 tokens at 130967.27 tokens per second. setting up scorers... computing Bleu score... {'testlen': 0, 'reflen': 42485, 'guess': [0, 0, 0, 0], 'correct': [0, 0, 0, 0]} ratio: 2.3537719195009454e-20 Bleu_1: 0.000 Bleu_2: 0.000 Bleu_3: 0.000 Bleu_4: 0.000 computing METEOR score... METEOR: 0.000 computing Rouge score... ROUGE_L: 0.000 computing CIDEr score... CIDEr: 0.000 computing SPICE score... Parsing reference captions Parsing test captions SPICE evaluation took: 3.045 s SPICE: 0.000

    opened by XuMengyaAmy 2
  • what to set for input_seq and target_seq for train-data -- asked by a beginner in NLP domain :-)

    what to set for input_seq and target_seq for train-data -- asked by a beginner in NLP domain :-)

    I want to use your sent-decoder and generate captions over my 2D images. for each image, i have a ground-truth caption.

    i will encode each word of my captions using coco_vocabulary.txt to an integer; and if i am not mistaken, this is what you did over your captions and you saved encoded captions to target_seq for train-data at pkl files https://github.com/JDAI-CV/image-captioning/blob/master/datasets/coco_dataset.py#L27

    my understanding is that target_seq for train-data are dictionaries keys represent image-ids values represent encoded captions

    am i right?

    looking forward to hearing from you!

    Thanks!

    opened by Hengameh400 2
  • How to run mulitple training processes

    How to run mulitple training processes

    Hi, I am very interested in your codes. I successfully run the codes on four GPUs as the given example suggests. However, when I want to run two training processes (each process takes two GPUs), I got errors in distributed parallel training.

    File "main.py", line 354, in <module>
       trainer = Trainer(args)
     File "main.py", line 48, in __init__
       self.setup_network()
     File "main.py", line 98, in setup_network
       find_unused_parameters=True
     File "/home/huangyq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 298, in __init__
       self.broadcast_bucket_size)
     File "/home/huangyq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 480, in _distributed_broadcast_coalesced
       dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
    RuntimeError: NCCL error in: /tmp/pip-req-build-ocx5vxk7/torch/lib/c10d/../c10d/NCCLUtils.hpp:48, invalid argument
    

    and

    Traceback (most recent call last):
      File "main.py", line 352, in <module>
        trainer = Trainer(args)
      File "main.py", line 40, in __init__
        backend="nccl", init_method="env://"
      File "/home/huangyq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 400, in init_process_group
        store, rank, world_size = next(rendezvous(url))
      File "/home/huangyq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
        store = TCPStore(master_addr, master_port, world_size, start_daemon)
    RuntimeError: Address already in use
    

    I wonder how I can solve these problems. Thanks! Looking forward to your reply ^_^

    opened by RubickH 2
  • X-transformer model batch_size setting

    X-transformer model batch_size setting

    Hello, did you train all the model using one GPU once? if so, when training the X-transformer model, the default batch_size setting is 40 and it will out of memory(1080TI 12G), but if we set it 10, the cider score will lower than you.

    opened by yaopengzero 2
  • How to configure the number of GPUs being used?

    How to configure the number of GPUs being used?

    I have a system with 10 GPUs, and when I start training, multiple (4) GPUs are used for processing.

    1. Where can I configure the number of GPUs being used? e.g. if I only have 2 free GPUs of the 10
    2. How can I specify which GPUs are being used? e.g. GPUs 0-5 are occupied, but I'd like to train on GPUs 6 and above.
    opened by nathanielhobbs 1
  • Diversity augmented beam_search for XLAN with group_size set as 1??

    Diversity augmented beam_search for XLAN with group_size set as 1??

    Hello, thanks for this great work. When I was reading the code, I found that in models/att_basic_model.py, DBS: diversity-augmented beam search is recommended for xlan model.

        # For the experiments of X-LAN, we use the following beam search code, 
        # which achieves slightly better results but much slower.
        
        #def decode_beam(self, **kwargs):
        #    beam_size = kwargs['BEAM_SIZE']
        #    gv_feat, att_feats, att_mask, p_att_feats = self.preprocess(**kwargs)
        #    batch_size = gv_feat.size(0)
        #
        #    sents = Variable(torch.zeros((cfg.MODEL.SEQ_LEN, batch_size), dtype=torch.long).cuda())
        #    logprobs = Variable(torch.zeros(cfg.MODEL.SEQ_LEN, batch_size).cuda())   
        #    self.done_beams = [[] for _ in range(batch_size)]
        #    for n in range(batch_size):
        #        state = self.init_hidden(beam_size)
        #        gv_feat_beam = gv_feat[n:n+1].expand(beam_size, gv_feat.size(1)).contiguous()
        #        att_feats_beam = att_feats[n:n+1].expand(*((beam_size,)+att_feats.size()[1:])).contiguous()
        #        att_mask_beam = att_mask[n:n+1].expand(*((beam_size,)+att_mask.size()[1:]))
        #        p_att_feats_beam = p_att_feats[n:n+1].expand(*((beam_size,)+p_att_feats.size()[1:])).contiguous() if p_att_feats is not None else None
        #
        #        wt = Variable(torch.zeros(beam_size, dtype=torch.long).cuda())
        #        kwargs = self.make_kwargs(wt, gv_feat_beam, att_feats_beam, att_mask_beam, p_att_feats_beam, state, **kwargs)
        #        logprobs_t, state = self.get_logprobs_state(**kwargs)
        #
        #        self.done_beams[n] = self.beam_search(state, logprobs_t, **kwargs)
        #        sents[:, n] = self.done_beams[n][0]['seq'] 
        #        logprobs[:, n] = self.done_beams[n][0]['logps']
        #    return sents.transpose(0, 1), logprobs.transpose(0, 1)
    

    However, in models/basic_model.py where DBS is implemented, the group size is forced to be 1, which means that diversity-augmented beam search is degraded to standard beam search.

            beam_size = kwargs['BEAM_SIZE']
            group_size = 1 #kwargs['GROUP_SIZE']
            diversity_lambda = 0.5 #kwargs['DIVERSITY_LAMBDA']
    

    So how can DBS with group_size=1 slightly outperform than standard BS for xlan as the comment above mentioned? Thanks a bunch!

    opened by ChenYutongTHU 1
  • Vocabulary from test split

    Vocabulary from test split

    Hi! Thanks for the written paper and the availabe code.

    I have what may be a stupid question, but I didn't find a straight answer to it anywhere:

    When evaluating the model with the karpathy test split, some words might not be present on the vocabulary from the train split. What do you do? Simple remove these words from the captions of the test split?

    opened by gondimjoaom 0
  • About the config of the updown model

    About the config of the updown model

    Thank you very much for making the code open source I can find 4 configs about xlan and transformer in the experiment folder Can you provide the basic model(updown) of config to run? thanks so much

    opened by WangLanxiao 0
Owner
JDAI-CV
JDAI Computer Vision
JDAI-CV
With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function. At the moment, only TensorFlow sequential models are supported. Interfaces to either the Pyomo or Gurobi modeling environments are offered.

ChemEngAI 40 Dec 27, 2022
Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Diverse Image Captioning with Context-Object Split Latent Spaces This repository is the PyTorch implementation of the paper: Diverse Image Captioning

Visual Inference Lab @TU Darmstadt 34 Nov 21, 2022
Image Captioning using CNN ,LSTM and Attention

Image Captioning using CNN ,LSTM and Attention This is a deeplearning model which tries to summarize an image into a text . Installation Install this

ASUTOSH GHANTO 1 Dec 16, 2021
Code for our paper at ECCV 2020: Post-Training Piecewise Linear Quantization for Deep Neural Networks

PWLQ Updates 2020/07/16 - We are working on getting permission from our institution to release our source code. We will release it once we are granted

null 54 Dec 15, 2022
Simple Linear 2nd ODE Solver GUI - A 2nd constant coefficient linear ODE solver with simple GUI using euler's method

Simple_Linear_2nd_ODE_Solver_GUI Description It is a 2nd constant coefficient li

:) 4 Feb 5, 2022
Hitters Linear Regression - Hitters Linear Regression With Python

Hitters_Linear_Regression Kullanacağımız veri seti Carnegie Mellon Üniversitesi'

AyseBuyukcelik 2 Jan 26, 2022
PyTorch implementation of the Deep SLDA method from our CVPRW-2020 paper "Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis"

Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis This is a PyTorch implementation of the Deep Streaming Linear Discriminant

Tyler Hayes 41 Dec 25, 2022
A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks

A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks without the use of any outside machine learning libraries - all from scratch.

Kordel K. France 2 Nov 14, 2022
A PyTorch Implementation of the Luna: Linear Unified Nested Attention

Unofficial PyTorch implementation of Luna: Linear Unified Nested Attention The quadratic computational and memory complexities of the Transformer’s at

Soohwan Kim 32 Nov 7, 2022
PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

ALiBi PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. Quickstart Clone this reposit

Jake Tae 4 Jul 27, 2022
Codes for paper "Towards Diverse Paragraph Captioning for Untrimmed Videos". CVPR 2021

Towards Diverse Paragraph Captioning for Untrimmed Videos This repository contains PyTorch implementation of our paper Towards Diverse Paragraph Capti

Yuqing Song 61 Oct 11, 2022
Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

Ruotian(RT) Luo 906 Jan 3, 2023
The code release of paper 'Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization' NIPS 2020.

Domain Generalization for Medical Imaging Classification with Linear Dependency Regularization The code release of paper 'Domain Generalization for Me

Yufei Wang 56 Dec 28, 2022
PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

Reference-Based-Sketch-Image-Colorization-ImageNet This is a PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization usin

Yuzhi ZHAO 11 Jul 28, 2022
Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement (NeurIPS 2020)

MTTS-CAN: Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement Paper Xin Liu, Josh Fromm, Shwetak Patel, Daniel M

Xin Liu 106 Dec 30, 2022
Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

Transformer in Transformer Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image c

Phil Wang 272 Dec 23, 2022
[NeurIPS 2021] Galerkin Transformer: a linear attention without softmax

[NeurIPS 2021] Galerkin Transformer: linear attention without softmax Summary A non-numerical analyst oriented explanation on Toward Data Science abou

Shuhao Cao 159 Dec 20, 2022
Attention for PyTorch with Linear Memory Footprint

Attention for PyTorch with Linear Memory Footprint Unofficially implements https://arxiv.org/abs/2112.05682 to get Linear Memory Cost on Attention (+

null 11 Jan 9, 2022