Semi-Autoregressive Transformer for Image Captioning

YE Zhou

Last update: Dec 9, 2022

Related tags

Deep Learning satic

Overview

Semi-Autoregressive Transformer for Image Captioning

Requirements

Python 3.6
Pytorch 1.6

Prepare data

Please use git clone --recurse-submodules to clone this repository and remember to follow initialization steps in coco-caption/README.md.
Download the preprocessd dataset from this link and extract it to data/.
Please follow this instruction to prepare the adaptive bottom-up features and place them under data/mscoco/. Please follow this instruction to prepare the features and place them under data/cocotest/ for online test evaluation.
Download part checkpoints from here and extract them to save/.

Offline Evaluation

To reproduce the results, such as SATIC(K=2, bw=1) after self-critical training, just run

python3 eval.py  --model  save/nsc-sat-2-from-nsc-seqkd/model-best.pth   --infos_path  save/nsc-sat-2-from-nsc-seqkd/infos_nsc-sat-2-from-nsc-seqkd-best.pkl    --batch_size  1   --beam_size   1   --id  nsc-sat-2-from-nsc-seqkd

Online Evaluation

Please first run

python3 eval_cocotest.py  --input_json  data/cocotest.json  --input_fc_dir data/cocotest/cocotest_bu_fc --input_att_dir  data/cocotest/cocotest_bu_att   --input_label_h5    data/cocotalk_label.h5  --num_images -1    --language_eval 0
--model  save/nsc-sat-4-from-nsc-seqkd/model-best.pth   --infos_path  save/nsc-sat-4-from-nsc-seqkd/infos_nsc-sat-4-from-nsc-seqkd-best.pkl    --batch_size  32   --beam_size   3   --id   captions_test2014_alg_results

and then follow the instruction to upload results.

Training

In the first training stage, such as SATIC(K=2) model with sequence-level distillation and weight initialization, run

python3  train.py   --noamopt --noamopt_warmup 20000 --label_smoothing 0.0  --seq_per_img 5 --batch_size 10 --beam_size 1 --learning_rate 5e-4 --num_layers 6 --input_encoding_size 512 --rnn_size 2048 --learning_rate_decay_start 0 --scheduled_sampling_start 0  --save_checkpoint_every 3000 --language_eval 1 --val_images_use 5000 --max_epochs 15    --input_label_h5   data/cocotalk_seq-kd-from-nsc-transformer-baseline-b5_label.h5   --checkpoint_path   save/sat-2-from-nsc-seqkd   --id   sat-2-from-nsc-seqkd   --K  2

Then in the second training stage, copy the above pretrained model first

cd save
./copy_model.sh  sat-2-from-nsc-seqkd    nsc-sat-2-from-nsc-seqkd
cd ..

and then run

python3  train.py    --seq_per_img 5 --batch_size 10 --beam_size 1 --learning_rate 1e-5 --num_layers 6 --input_encoding_size 512 --rnn_size 2048  --save_checkpoint_every 3000 --language_eval 1 --val_images_use 5000 --self_critical_after 10  --max_epochs    40   --input_label_h5    data/cocotalk_label.h5   --start_from   save/nsc-sat-2-from-nsc-seqkd   --checkpoint_path   save/nsc-sat-2-from-nsc-seqkd  --id  nsc-sat-2-from-nsc-seqkd    --K 2

Citation

@article{zhou2021semi,
  title={Semi-Autoregressive Transformer for Image Captioning},
  author={Zhou, Yuanen and Zhang, Yong and Hu, Zhenzhen and Wang, Meng},
  journal={arXiv preprint arXiv:2106.09436},
  year={2021}
}

Acknowledgements

This repository is built upon self-critical.pytorch. Thanks for the released code.

You might also like...

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

VisualGPT Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning Main Architecture of Our VisualGPT Downloa

140 Dec 28, 2022

Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

906 Jan 3, 2023

An unreferenced image captioning metric (ACL-21)

UMIC This repository provides an unferenced image captioning metric from our ACL 2021 paper UMIC: An Unreferenced Metric for Image Captioning via Cont

14 Nov 20, 2022

Image Captioning using CNN and Transformers

Image-Captioning Keras/Tensorflow Image Captioning application using CNN and Transformer as encoder/decoder. In particulary, the architecture consists

24 Dec 28, 2022

Optimized code based on M2 for faster image captioning training

Transformer Captioning This repository contains the code for Transformer-based image captioning. Based on meshed-memory-transformer, we further optimi

16 Dec 16, 2022

An Image Captioning codebase

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

1.1k Oct 18, 2021

Image Captioning using CNN ,LSTM and Attention

Image Captioning using CNN ,LSTM and Attention This is a deeplearning model which tries to summarize an image into a text . Installation Install this

1 Dec 16, 2021

Image Captioning on google cloud platform based on iot

Image-Captioning-on-google-cloud-platform-based-on-iot - Image Captioning on google cloud platform based on iot

1 Jan 20, 2022

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

225 Nov 13, 2022

Comments

beam size problem

Hi,

I trained the code with beam size of 1 and it worked well. Now I want to try it with other values but when I try beam size 3 in train script I got this error:

iter 2999 (epoch 0), train_loss = 0.770, time/batch = 0.202 250.90925693511963 ms needed to decode one sentece under batch size 10 and beam size 3 Traceback (most recent call last): File "train.py", line 325, in train(opt) File "train.py", line 273, in train dp_model, lw_model.crit, loader, eval_kwargs) File "/mnt/f/satic/eval_utils.py", line 138, in eval_split sents_list = [utils.decode_sequence(loader.get_vocab(), _['seq'].unsqueeze(0))[0] for _ in model.done_beams[i]] File "/home/maryam/anaconda3/envs/satic/lib/python3.7/site-packages/torch/nn/modules/module.py", line 772, in getattr type(self).name, name)) torch.nn.modules.module.ModuleAttributeError: 'DataParallel' object has no attribute 'done_beams'

Can you help me how to fix that?(because you provided results with different beam size in your paper and I guess the code should be ok )

opened by maryawwm 5
the code does not convert IntTensor to LongTensor

I'm gonna train this code with the same environmental requirements: python 3.6 pytorch 1.6

but when I run the first training stage I got error:

DataLoader loading json file: data/cocotalk.json vocab size is 9487 DataLoader loading h5 file: data/mscoco/cocobu_fc data/mscoco/cocobu_att data/mscoco/cocobu_box data/cocotalk_seq-kd-from-nsc-transformer-baseline-b5_label.h5 max sequence length in data is 16 read 123287 image features assigned 113287 images to split train assigned 5000 images to split val assigned 5000 images to split test Read data: 0.046845197677612305 Save ckpt on exception ... model saved to save/sat-2-from-nsc-seqkd\model.pth Save ckpt done. Traceback (most recent call last): File "train.py", line 213, in train model_out = dp_lw_model(fc_feats, att_feats, labels, masks, att_masks, data['gts'], torch.arange(0, len(data['gts'])), sc_flag).to(device).long() File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\parallel\data_parallel.py", line 155, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\parallel\data_parallel.py", line 165, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 85, in parallel_apply output.reraise() File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch_utils.py", line 395, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\parallel\parallel_apply.py", line 60, in _worker output = module(*input, **kwargs) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "C:\Users\vision\satic\misc\loss_wrapper.py", line 30, in forward student_output = self.model(fc_feats, att_feats, labels, att_masks) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\module.py", line 722, in call_impl result = self.forward(*input, **kwargs) File "C:\Users\vision\satic\models\CaptionModel.py", line 33, in forward return getattr(self, ''+mode)(*args, **kwargs) File "C:\Users\vision\satic\models\SAT.py", line 347, in _forward out = self.model(att_feats, seq, att_masks, seq_mask) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "C:\Users\vision\satic\models\SAT.py", line 42, in forward tgt, tgt_mask) File "C:\Users\vision\satic\models\SAT.py", line 48, in decode return self.decoder(self.tgt_embed(tgt), memory, src_mask, tgt_mask) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\container.py", line 117, in forward input = module(input) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "C:\Users\vision\satic\models\SAT.py", line 228, in forward return self.lut(x) * math.sqrt(self.d_model) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\modules\sparse.py", line 126, in forward self.norm_type, self.scale_grad_by_freq, self.sparse) File "C:\Users\vision.conda\envs\caption\lib\site-packages\torch\nn\functional.py", line 1814, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.cuda.IntTensor instead (while checking arguments for embedding)

opened by maryawwm 4

Owner

YE Zhou

GitHub

Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

Protein GLM (wip) Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capabil

17 May 6, 2022

A transformer-based method for Healthcare Image Captioning in Vietnamese

vieCap4H Challenge 2021: A transformer-based method for Healthcare Image Captioning in Vietnamese This repo GitHub contains our solution for vieCap4H

4 May 5, 2022

Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

Non-AR Spatial-Temporal Transformer Introduction Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series For

66 Nov 28, 2022

Pytorch implementation of “Recursive Non-Autoregressive Graph-to-Graph Transformer for Dependency Parsing with Iterative Refinement”

Graph-to-Graph Transformers Self-attention models, such as Transformer, have been hugely successful in a wide range of natural language processing (NL

40 Aug 14, 2022

Implementation of "Glancing Transformer for Non-Autoregressive Neural Machine Translation"

GLAT Implementation for the ACL2021 paper "Glancing Transformer for Non-Autoregressive Neural Machine Translation" Requirements Python >= 3.7 Pytorch

117 Jan 9, 2023

CoSMA: Convolutional Semi-Regular Mesh Autoencoder. From Paper "Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes"

Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes Implementation of CoSMA: Convolutional Semi-Regular Mesh Autoencoder arXiv p

10 Oct 11, 2022

[ACM MM 2021] Diverse Image Inpainting with Bidirectional and Autoregressive Transformers

Diverse Image Inpainting with Bidirectional and Autoregressive Transformers Installation pip install -r requirements.txt Dataset Preparation Given the

25 Nov 9, 2022

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

ImageProcessingTransformer Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

61 Jan 1, 2023

Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Diverse Image Captioning with Context-Object Split Latent Spaces This repository is the PyTorch implementation of the paper: Diverse Image Captioning

34 Nov 21, 2022

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022

Semi-Autoregressive Transformer for Image Captioning

Related tags

Overview

Semi-Autoregressive Transformer for Image Captioning

Requirements

Prepare data

Offline Evaluation

Online Evaluation

Training

Citation

Acknowledgements

You might also like...

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.

An unreferenced image captioning metric (ACL-21)

Image Captioning using CNN and Transformers

Optimized code based on M2 for faster image captioning training

An Image Captioning codebase

Image Captioning using CNN ,LSTM and Attention

Image Captioning on google cloud platform based on iot

VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

Comments

beam size problem

the code does not convert IntTensor to LongTensor

Owner

YE Zhou

Implementation of a protein autoregressive language model, but with autoregressive infilling objective (editing subsequences capability)

A transformer-based method for Healthcare Image Captioning in Vietnamese

Implementation of the paper NAST: Non-Autoregressive Spatial-Temporal Transformer for Time Series Forecasting.

Pytorch implementation of “Recursive Non-Autoregressive Graph-to-Graph Transformer for Dependency Parsing with Iterative Refinement”

Implementation of "Glancing Transformer for Non-Autoregressive Neural Machine Translation"

CoSMA: Convolutional Semi-Regular Mesh Autoencoder. From Paper "Mesh Convolutional Autoencoder for Semi-Regular Meshes of Different Sizes"

[ACM MM 2021] Diverse Image Inpainting with Bidirectional and Autoregressive Transformers

Third party Pytorch implement of Image Processing Transformer (Pre-Trained Image Processing Transformer arXiv:2012.00364v2)

Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.