Semi-Autoregressive Transformer for Image Captioning

Related tags

Deep Learning satic
Overview

Semi-Autoregressive Transformer for Image Captioning

Requirements

  • Python 3.6
  • Pytorch 1.6

Prepare data

  1. Please use git clone --recurse-submodules to clone this repository and remember to follow initialization steps in coco-caption/README.md.
  2. Download the preprocessd dataset from this link and extract it to data/.
  3. Please follow this instruction to prepare the adaptive bottom-up features and place them under data/mscoco/. Please follow this instruction to prepare the features and place them under data/cocotest/ for online test evaluation.
  4. Download part checkpoints from here and extract them to save/.

Offline Evaluation

To reproduce the results, such as SATIC(K=2, bw=1) after self-critical training, just run

python3 eval.py  --model  save/nsc-sat-2-from-nsc-seqkd/model-best.pth   --infos_path  save/nsc-sat-2-from-nsc-seqkd/infos_nsc-sat-2-from-nsc-seqkd-best.pkl    --batch_size  1   --beam_size   1   --id  nsc-sat-2-from-nsc-seqkd   

Online Evaluation

Please first run

python3 eval_cocotest.py  --input_json  data/cocotest.json  --input_fc_dir data/cocotest/cocotest_bu_fc --input_att_dir  data/cocotest/cocotest_bu_att   --input_label_h5    data/cocotalk_label.h5  --num_images -1    --language_eval 0
--model  save/nsc-sat-4-from-nsc-seqkd/model-best.pth   --infos_path  save/nsc-sat-4-from-nsc-seqkd/infos_nsc-sat-4-from-nsc-seqkd-best.pkl    --batch_size  32   --beam_size   3   --id   captions_test2014_alg_results  

and then follow the instruction to upload results.

Training

  1. In the first training stage, such as SATIC(K=2) model with sequence-level distillation and weight initialization, run
python3  train.py   --noamopt --noamopt_warmup 20000 --label_smoothing 0.0  --seq_per_img 5 --batch_size 10 --beam_size 1 --learning_rate 5e-4 --num_layers 6 --input_encoding_size 512 --rnn_size 2048 --learning_rate_decay_start 0 --scheduled_sampling_start 0  --save_checkpoint_every 3000 --language_eval 1 --val_images_use 5000 --max_epochs 15    --input_label_h5   data/cocotalk_seq-kd-from-nsc-transformer-baseline-b5_label.h5   --checkpoint_path   save/sat-2-from-nsc-seqkd   --id   sat-2-from-nsc-seqkd   --K  2
  1. Then in the second training stage, copy the above pretrained model first
cd save
./copy_model.sh  sat-2-from-nsc-seqkd    nsc-sat-2-from-nsc-seqkd
cd ..

and then run

python3  train.py    --seq_per_img 5 --batch_size 10 --beam_size 1 --learning_rate 1e-5 --num_layers 6 --input_encoding_size 512 --rnn_size 2048  --save_checkpoint_every 3000 --language_eval 1 --val_images_use 5000 --self_critical_after 10  --max_epochs    40   --input_label_h5    data/cocotalk_label.h5   --start_from   save/nsc-sat-2-from-nsc-seqkd   --checkpoint_path   save/nsc-sat-2-from-nsc-seqkd  --id  nsc-sat-2-from-nsc-seqkd    --K 2

Citation

@article{zhou2021semi,
  title={Semi-Autoregressive Transformer for Image Captioning},
  author={Zhou, Yuanen and Zhang, Yong and Hu, Zhenzhen and Wang, Meng},
  journal={arXiv preprint arXiv:2106.09436},
  year={2021}
}

Acknowledgements

This repository is built upon self-critical.pytorch. Thanks for the released code.

You might also like...
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning
VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

VisualGPT Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning Main Architecture of Our VisualGPT Downloa

Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

An unreferenced image captioning metric (ACL-21)

UMIC This repository provides an unferenced image captioning metric from our ACL 2021 paper UMIC: An Unreferenced Metric for Image Captioning via Cont

Image Captioning using CNN and Transformers
Image Captioning using CNN and Transformers

Image-Captioning Keras/Tensorflow Image Captioning application using CNN and Transformer as encoder/decoder. In particulary, the architecture consists

Optimized code based on M2 for faster image captioning training

Transformer Captioning This repository contains the code for Transformer-based image captioning. Based on meshed-memory-transformer, we further optimi

An Image Captioning codebase

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

Image Captioning using CNN ,LSTM and Attention

Image Captioning using CNN ,LSTM and Attention This is a deeplearning model which tries to summarize an image into a text . Installation Install this

 Image Captioning on google cloud platform based on iot
Image Captioning on google cloud platform based on iot

Image-Captioning-on-google-cloud-platform-based-on-iot - Image Captioning on google cloud platform based on iot

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).
VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

Comments
  • beam size problem

    beam size problem

    Hi,

    I trained the code with beam size of 1 and it worked well. Now I want to try it with other values but when I try beam size 3 in train script I got this error:

    iter 2999 (epoch 0), train_loss = 0.770, time/batch = 0.202 250.90925693511963 ms needed to decode one sentece under batch size 10 and beam size 3 Traceback (most recent call last): File "train.py", line 325, in train(opt) File "train.py", line 273, in train dp_model, lw_model.crit, loader, eval_kwargs) File "/mnt/f/satic/eval_utils.py", line 138, in eval_split sents_list = [utils.decode_sequence(loader.get_vocab(), _['seq'].unsqueeze(0))[0] for _ in model.done_beams[i]] File "/home/maryam/anaconda3/envs/satic/lib/python3.7/site-packages/torch/nn/modules/module.py", line 772, in getattr type(self).name, name)) torch.nn.modules.module.ModuleAttributeError: 'DataParallel' object has no attribute 'done_beams'

    Can you help me how to fix that?(because you provided results with different beam size in your paper and I guess the code should be ok )

    opened by maryawwm 5
  • the code does not convert IntTensor to LongTensor

    the code does not convert IntTensor to LongTensor

    I'm gonna train this code with the same environmental requirements: python 3.6 pytorch 1.6

    but when I run the first training stage I got error:

    DataLoader loading json file: data/cocotalk.json vocab size is 9487 DataLoader loading h5 file: data/mscoco/cocobu_fc data/mscoco/cocobu_att data/mscoco/cocobu_box data/cocotalk_seq-kd-from-nsc-transformer-baseline-b5_label.h5 max sequence length in data is 16 read 123287 image features assigned 113287 images to split train assigned 5000 images to split val assigned 5000 images to split test Read data: 0.046845197677612305 Save ckpt on exception ... model saved to save/sat-2-from-nsc-seqkd\model.pth Save ckpt done. Traceback (most recent call last): File "train.py", line 213, in train model_out = dp_lw_model(fc_feats, att_feats, labels, masks, att_masks, data['gts'], torch.arange(0, len(data['gts'])), sc_flag).to(device).long() File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\parallel\data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\parallel\data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 85, in parallel_apply output.reraise() File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "C:\Users\vision\satic\misc\loss_wrapper.py", line 30, in forward student_output = self.model(fc_feats, att_feats, labels, att_masks) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\module.py", line 722, in call_impl result = self.forward(*input, **kwargs) File "C:\Users\vision\satic\models\CaptionModel.py", line 33, in forward return getattr(self, ''+mode)(*args, **kwargs) File "C:\Users\vision\satic\models\SAT.py", line 347, in _forward out = self.model(att_feats, seq, att_masks, seq_mask) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "C:\Users\vision\satic\models\SAT.py", line 42, in forward tgt, tgt_mask) File "C:\Users\vision\satic\models\SAT.py", line 48, in decode return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\container.py", line 117, in forward input = module(input) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "C:\Users\vision\satic\models\SAT.py", line 228, in forward return self.lut(x) * math.sqrt(self.d_model) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\sparse.py", line 126, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\functional.py", line 1814, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.cuda.IntTensor instead (while checking arguments for embedding)

    opened by maryawwm 4
Owner
YE Zhou
YE Zhou
Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Protein GLM (wip) Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capabil

Phil Wang 17 May 6, 2022
A transformer-based method for Healthcare Image Captioning in Vietnamese

vieCap4H Challenge 2021: A transformer-based method for Healthcare Image Captioning in Vietnamese This repo GitHub contains our solution for vieCap4H

Doanh B C 4 May 5, 2022
Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

Non-AR Spatial-Temporal Transformer Introduction Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series For

Chen Kai 66 Nov 28, 2022
Pytorch implementation of “Recursive Non-Autoregressive Graph-to-Graph Transformer for Dependency Parsing with Iterative Refinement”

Graph-to-Graph Transformers Self-attention models, such as Transformer, have been hugely successful in a wide range of natural language processing (NL

Idiap Research Institute 40 Aug 14, 2022
Implementation of "Glancing Transformer for Non-Autoregressive Neural Machine Translation"

GLAT Implementation for the ACL2021 paper "Glancing Transformer for Non-Autoregressive Neural Machine Translation" Requirements Python >= 3.7 Pytorch

null 117 Jan 9, 2023
CoSMA: Convolutional Semi-Regular Mesh Autoencoder. From Paper "Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes"

Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes Implementation of CoSMA: Convolutional Semi-Regular Mesh Autoencoder arXiv p

Fraunhofer SCAI 10 Oct 11, 2022
[ACM MM 2021] Diverse Image Inpainting with Bidirectional and Autoregressive Transformers

Diverse Image Inpainting with Bidirectional and Autoregressive Transformers Installation pip install -r requirements.txt Dataset Preparation Given the

Yingchen Yu 25 Nov 9, 2022
Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

null 61 Jan 1, 2023
Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Diverse Image Captioning with Context-Object Split Latent Spaces This repository is the PyTorch implementation of the paper: Diverse Image Captioning

Visual Inference Lab @TU Darmstadt 34 Nov 21, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022