Implementation of 'X-Linear Attention Networks for Image Captioning' [CVPR 2020]

JDAI-CV

Last update: Dec 17, 2022

Related tags

Overview

Introduction

This repository is for X-Linear Attention Networks for Image Captioning (CVPR 2020). The original paper can be found here.

Please cite with the following BibTeX:

@inproceedings{xlinear2020cvpr,
  title={X-Linear Attention Networks for Image Captioning},
  author={Pan, Yingwei and Yao, Ting and Li, Yehao and Mei, Tao},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2020}
}

Requirements

Python 3
CUDA 10
numpy
tqdm
easydict
PyTorch (>1.0)
torchvision
coco-caption

Data preparation

Download the bottom up features and convert them to npz files

python2 tools/create_feats.py --infeats bottom_up_tsv --outfolder ./mscoco/feature/up_down_10_100

Download the annotations into the mscoco folder. More details about data preparation can be referred to self-critical.pytorch
Download coco-caption and setup the path of __C.INFERENCE.COCO_PATH in lib/config.py
The pretrained models and results can be downloaded here.
The pretrained SENet-154 model can be downloaded here.

Training

Train X-LAN model

bash experiments/xlan/train.sh

Train X-LAN model using self critical

Copy the pretrained model into experiments/xlan_rl/snapshot and run the script

bash experiments/xlan_rl/train.sh

Train X-LAN transformer model

bash experiments/xtransformer/train.sh

Train X-LAN transformer model using self critical

Copy the pretrained model into experiments/xtransformer_rl/snapshot and run the script

bash experiments/xtransformer_rl/train.sh

Evaluation

CUDA_VISIBLE_DEVICES=0 python3 main_test.py --folder experiments/model_folder --resume model_epoch

Acknowledgements

Thanks the contribution of self-critical.pytorch and awesome PyTorch team.

Comments

_csv.Error: iterator should return strings, not bytes (did you open the file in text mode?)

Hello, Thank you for your work and the code. When I run this python3 tools/create_feats.py --infeats bottom_up_tsv --outfolder ./mscoco/feature/up_down_10_100, I am getting this error _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?) I am running on python 3.7 Please help me out.

opened by Tushar-Faroque 4
How to generate the visualization of attended image regions along the caption generation processes.

I have a question for how to generate the visualization of attended image regions along the caption generation processes. Would you mind releasing some codes?

opened by Archer-Fang 3
Problems about the provided annotations file

Hello, thanks for this good work. I think I downloaded and placed the annotations file right, because I can read the captions and image_ids out of the data. The problem I met is a KeyError in coco-caption/pycocotools/coco.py, when running if self.dataset['type'] == 'instances':. This suggests that the dict self.data read from ./mscoco/misc/captions_val5k.json, should has the key "type", but it doesn't. Please help!

opened by jlxy 3
Reg. Training time

Hi,

Thanks for sharing your code here.

Can you please tell on what type of GPUs you training your model, how much time it took to complete one epoch and the number of epochs till you run your model?

Regards Deepak Mittal

opened by deepak242424 3
transformer results

Hello. Thanks for your work and for sharing the code. May you please tell me the details of the pure Transformer model you implemented which achieves 128.3 cider? To best of my knowledge, all implementations could achieve a maximum of around 126.6, according to all papers which utilized the transformer model. In your paper, you don't provide details on Transformer, and there is no supplementary material. So may I kindly know the details for your re-implementation of the pure transformer which achieves 128.3?

opened by homelifes 3
Test results are all 0 when using the author's checkpoint
I used the provided caption_model_47.pth from xlan experiment and the following commands to do the test. However, all test metrics are 0 when I used the def decode_beam which is inspired by meshed-memory-transformer. https://github.com/JDAI-CV/image-captioning/blob/master/models/att_basic_model.py https://github.com/JDAI-CV/image-captioning/blob/master/models/xtransformer.py

And it's the same case for the xlan+transformer test. I feel the decode_beam (from m2transformer) has problems. Do you have the same issue? Can you help to solve it? Thanks.

What I changed in the decode_beam(self, **kwargs) is adding .long() after some variable because of the error issue. 2) What's inside coco_train_cider.pkl? How to generate this file for custom dataset? 3) And I noticed that you used input_seq and target seq. From my understanding, both of them are ground truth caption which represented by word index, correct? The only difference is 0 and -1 at the end. Can you provide the preprocess code which convert the ground truth caption to word index?

Test Results: python3.8 main_test.py --folder experiments/xlan --resume 47 Called with args: Namespace(folder='experiments/xlan', resume=47) /media/mmlab/data_2/mengya/Code/ImageCaption/image-captioning/lib/config.py:381: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. yaml_cfg = edict(yaml.load(f)) loading annotations into memory... Done (t=0.03s) creating index... index created! 139it [00:11, 12.16it/s] Loading and preparing results... DONE (t=0.01s) creating index... index created! tokenization... PTBTokenizer tokenized 307085 tokens at 1576551.76 tokens per second. PTBTokenizer tokenized 4999 tokens at 130967.27 tokens per second. setting up scorers... computing Bleu score... {'testlen': 0, 'reflen': 42485, 'guess': [0, 0, 0, 0], 'correct': [0, 0, 0, 0]} ratio: 2.3537719195009454e-20 Bleu_1: 0.000 Bleu_2: 0.000 Bleu_3: 0.000 Bleu_4: 0.000 computing METEOR score... METEOR: 0.000 computing Rouge score... ROUGE_L: 0.000 computing CIDEr score... CIDEr: 0.000 computing SPICE score... Parsing reference captions Parsing test captions SPICE evaluation took: 3.045 s SPICE: 0.000
opened by XuMengyaAmy 2
what to set for input_seq and target_seq for train-data -- asked by a beginner in NLP domain :-)

I want to use your sent-decoder and generate captions over my 2D images. for each image, i have a ground-truth caption.

i will encode each word of my captions using coco_vocabulary.txt to an integer; and if i am not mistaken, this is what you did over your captions and you saved encoded captions to target_seq for train-data at pkl files https://github.com/JDAI-CV/image-captioning/blob/master/datasets/coco_dataset.py#L27

my understanding is that target_seq for train-data are dictionaries keys represent image-ids values represent encoded captions

am i right?

looking forward to hearing from you!

Thanks!

opened by Hengameh400 2

How to run mulitple training processes

Hi, I am very interested in your codes. I successfully run the codes on four GPUs as the given example suggests. However, when I want to run two training processes (each process takes two GPUs), I got errors in distributed parallel training.

File "main.py", line 354, in <module>
   trainer = Trainer(args)
 File "main.py", line 48, in __init__
   self.setup_network()
 File "main.py", line 98, in setup_network
   find_unused_parameters=True
 File "/home/huangyq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 298, in __init__
   self.broadcast_bucket_size)
 File "/home/huangyq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 480, in _distributed_broadcast_coalesced
   dist._broadcast_coalesced(self.process_group, tensors, buffer_size)
RuntimeError: NCCL error in: /tmp/pip-req-build-ocx5vxk7/torch/lib/c10d/../c10d/NCCLUtils.hpp:48, invalid argument

and

Traceback (most recent call last):
  File "main.py", line 352, in <module>
    trainer = Trainer(args)
  File "main.py", line 40, in __init__
    backend="nccl", init_method="env://"
  File "/home/huangyq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 400, in init_process_group
    store, rank, world_size = next(rendezvous(url))
  File "/home/huangyq/anaconda3/envs/py36/lib/python3.6/site-packages/torch/distributed/rendezvous.py", line 143, in _env_rendezvous_handler
    store = TCPStore(master_addr, master_port, world_size, start_daemon)
RuntimeError: Address already in use

I wonder how I can solve these problems. Thanks! Looking forward to your reply ^_^

opened by RubickH 2

X-transformer model batch_size setting

Hello, did you train all the model using one GPU once? if so, when training the X-transformer model, the default batch_size setting is 40 and it will out of memory(1080TI 12G), but if we set it 10, the cider score will lower than you.

opened by yaopengzero 2
How to configure the number of GPUs being used?
I have a system with 10 GPUs, and when I start training, multiple (4) GPUs are used for processing.

Where can I configure the number of GPUs being used? e.g. if I only have 2 free GPUs of the 10

How can I specify which GPUs are being used? e.g. GPUs 0-5 are occupied, but I'd like to train on GPUs 6 and above.
opened by nathanielhobbs 1

Diversity augmented beam_search for XLAN with group_size set as 1??

Hello, thanks for this great work. When I was reading the code, I found that in models/att_basic_model.py, DBS: diversity-augmented beam search is recommended for xlan model.

    # For the experiments of X-LAN, we use the following beam search code, 
    # which achieves slightly better results but much slower.
    
    #def decode_beam(self, **kwargs):
    #    beam_size = kwargs['BEAM_SIZE']
    #    gv_feat, att_feats, att_mask, p_att_feats = self.preprocess(**kwargs)
    #    batch_size = gv_feat.size(0)
    #
    #    sents = Variable(torch.zeros((cfg.MODEL.SEQ_LEN, batch_size), dtype=torch.long).cuda())
    #    logprobs = Variable(torch.zeros(cfg.MODEL.SEQ_LEN, batch_size).cuda())   
    #    self.done_beams = [[] for _ in range(batch_size)]
    #    for n in range(batch_size):
    #        state = self.init_hidden(beam_size)
    #        gv_feat_beam = gv_feat[n:n+1].expand(beam_size, gv_feat.size(1)).contiguous()
    #        att_feats_beam = att_feats[n:n+1].expand(*((beam_size,)+att_feats.size()[1:])).contiguous()
    #        att_mask_beam = att_mask[n:n+1].expand(*((beam_size,)+att_mask.size()[1:]))
    #        p_att_feats_beam = p_att_feats[n:n+1].expand(*((beam_size,)+p_att_feats.size()[1:])).contiguous() if p_att_feats is not None else None
    #
    #        wt = Variable(torch.zeros(beam_size, dtype=torch.long).cuda())
    #        kwargs = self.make_kwargs(wt, gv_feat_beam, att_feats_beam, att_mask_beam, p_att_feats_beam, state, **kwargs)
    #        logprobs_t, state = self.get_logprobs_state(**kwargs)
    #
    #        self.done_beams[n] = self.beam_search(state, logprobs_t, **kwargs)
    #        sents[:, n] = self.done_beams[n][0]['seq'] 
    #        logprobs[:, n] = self.done_beams[n][0]['logps']
    #    return sents.transpose(0, 1), logprobs.transpose(0, 1)

However, in models/basic_model.py where DBS is implemented, the group size is forced to be 1, which means that diversity-augmented beam search is degraded to standard beam search.

        beam_size = kwargs['BEAM_SIZE']
        group_size = 1 #kwargs['GROUP_SIZE']
        diversity_lambda = 0.5 #kwargs['DIVERSITY_LAMBDA']

So how can DBS with group_size=1 slightly outperform than standard BS for xlan as the comment above mentioned? Thanks a bunch!

opened by ChenYutongTHU 1

Vocabulary from test split

Hi! Thanks for the written paper and the availabe code.

I have what may be a stupid question, but I didn't find a straight answer to it anywhere:

When evaluating the model with the karpathy test split, some words might not be present on the vocabulary from the train split. What do you do? Simple remove these words from the captions of the test split?

opened by gondimjoaom 0
About the config of the updown model

Thank you very much for making the code open source I can find 4 configs about xlan and transformer in the experiment folder Can you provide the basic model(updown) of config to run? thanks so much

opened by WangLanxiao 0

Implementation of 'X-Linear Attention Networks for Image Captioning' [CVPR 2020]

Related tags

Overview

Introduction

Requirements

Data preparation

Training

Train X-LAN model

Train X-LAN model using self critical

Train X-LAN transformer model

Train X-LAN transformer model using self critical

Evaluation

Acknowledgements

Comments

Owner

JDAI-CV

With this package, you can generate mixed-integer linear programming (MIP) models of trained artificial neural networks (ANNs) using the rectified linear unit (ReLU) activation function

Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Image Captioning using CNN ,LSTM and Attention

Code for our paper at ECCV 2020: Post-Training Piecewise Linear Quantization for Deep Neural Networks

Simple Linear 2nd ODE Solver GUI - A 2nd constant coefficient linear ODE solver with simple GUI using euler's method

Hitters Linear Regression - Hitters Linear Regression With Python

PyTorch implementation of the Deep SLDA method from our CVPRW-2020 paper "Lifelong Machine Learning with Deep Streaming Linear Discriminant Analysis"

A framework that constructs deep neural networks, autoencoders, logistic regressors, and linear networks

A PyTorch Implementation of the Luna: Linear Unified Nested Attention

PyTorch implementation of Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.

Codes for paper "Towards Diverse Paragraph Captioning for Untrimmed Videos". CVPR 2021

Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.

The code release of paper 'Domain Generalization for Medical Imaging Classification with Linear-Dependency Regularization' NIPS 2020.

PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

Multi-Task Temporal Shift Attention Networks for On-Device Contactless Vitals Measurement (NeurIPS 2020)

Implementation of Transformer in Transformer, pixel level attention paired with patch level attention for image classification, in Pytorch

[NeurIPS 2021] Galerkin Transformer: a linear attention without softmax

Attention for PyTorch with Linear Memory Footprint