VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Vision CAIR Research Group, KAUST

Last update: Dec 28, 2022

Related tags

Overview

VisualGPT

Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Main Architecture of Our VisualGPT

Download the GPT-2 pretrained weights

curl --output gpt2-pytorch_model.bin https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin

Enviroment setup

Clone the repository and create the visualgpt conda environmnet

conda env create -f environment.yml
conda activate visualgpt

Then download spacy data

python -m spacy download en

Data preparation

We provide the COCO dataset for downloading. Please download the annotations file annotations.zip and extract it. and coco_detections.hdf5, in which the data is stored in a where key is the image id and value is a tensor (N, 2048). N it the number of detections

code structure

create the log folder mkdir logs and start the training

Train the model

python train_visualGPT.py --batch_size 50 --head 12 --features_path coco_detections.hdf5 --annotation_folder annotations --lr 1e-4 --gpt_model_type gpt --random_seed 42 --log_file logs/log --exp_name experiment_log --lr 1e-4 --decoder_layer 12 --optimizer_type adamw  --gradient_accumulation_steps 2 --train_percentage 0.001 --split_train_data

Acknowledgement

This code used resources from Meshed Memory Transformer and Transformers

Please cite our paper from the following bibtex

@article{chen2021visualgpt,
  title={VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining},
  author={Chen, Jun and Guo, Han and Yi, Kai and Li, Boyang and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2102.10407},
  year={2021}
}

@article{chen2021visualgpt,
  title={VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning},
  author={Chen, Jun and Guo, Han and Yi, Kai and Li, Boyang and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2102.10407},
  year={2021}
}

Comments

Explaining the cross attention

Hi,

thank you for answering my last question!

I am currently trying to explain part of the caption generation process and I am interested in Figure.5 where you highlighted the visual scores on the generated captions.

However, if my understanding is correct, you have not put any code about your explaining method in the repo. It will be really appreciated if you would like to give some coding example about the visualization for better understanding!

Cheers!

opened by TCBpenta8 4
Trying to run code on IU X-ray database

Hi, I've been interested in image captioning and specifically automatic medical report generation, and I stumbled across your VisualGPT which seemed to take a promising approach, and I've been trying to get it to work with other databases, specifically IU as mentioned in your article.

I can't figure out how you guys have set up the COCO database, and how I should be trying to structure IU X-ray to fit into your code. Is it still supposed to use COCO_detections.hdf5? Or am I supposed to create a hdf5 file for IU?

opened by PurpleDish 4
About total data performance

Hi, I am glad to read this article. This essay is the first work that focuses on efficiently adapting large pretrained language models for image captioning, which inspires me a lot! In the result display section, it mainly shows the results of training using some data sets at different sampling rates. Therefore, I would like to ask, have you tested the results on all data sets without sampling? How does the performance compare to M2Transformer?

opened by bugczw 4
How to do inference?

There doesn't appear to be any examples of how to do inference with this model.

I'm pretty confused, in training you guys feed in the entire label along with the image then use the label for the loss. What am I supposed to feed in if I have a new image?

Additionally, do I just use gpt2-large tokenizer to decode the model outputs?

opened by bth5032 2
Running code on IU X-Ray dataset

Hello, I am interested in running your Visual GPT model on the IU Xray dataset. Can you please explain how I can use this model to train on the dataset? I saw issue #4 but I was not able to understand how to create a .h5 file for the IU-Xray dataset. Could you please walk me through how I can set up the .h5 file for IU-Xray?

opened by dwij2212 2
Batch size in Evaluation Strategy

Hi, Good to see a new idea in image captioning!

Here is my question. I noticed that in your evaluation stage, you input the sample one by one, that is in 'train_visualGPT.py' line 62. And also I found that you cast the batch size in validation and evaluation 5 times smaller than one in training. Will there be a specific reason for these? As I notice that the evaluation actually supports mini-batch strategy, and your strategy will cost tons of time if the evaluation set is huge.

I am kind of a freshman in this area so my question might be silly. Feel free to let me know what you think.

Regards,

opened by TCBpenta8 2
About memory overflow error during training

Hi! Thanks for the code. When I train the model to 42 batch, I met with the following error using 4*gtx2080ti: “CUDA: out of memory, tried to allocate...” I set the batchsize = 10, it still occurs. Is it totally a hardware problem? What device you use to train the model? I notice that you said your code doesn't support multi-gpu training 1 years ago, is it still not supported yet? Thank you!

opened by Wangyf1998 1

Using pip/requirements.txt instead of conda

Hi, the requirements.txt doesn't work in this repo because some packages are not available on pypi (or at least not for python 3.8).

I just wanted to dump the steps I had to take to make this work.

First I needed to find a set of libraries that could work together in requirements.txt. This config seems to work for me:

absl-py==0.8.1
asn1crypto==1.2.0
cachetools==4.1.1
certifi==2019.9.11
cffi==1.13.2
chardet==3.0.4
click==7.1.2
cryptography==2.8
cycler==0.10.0
cymem==2.0.2
Cython==0.29.14
cytoolz==0.9.0.1
#dataclasses==0.7
dill==0.3.2
#en-core-web-sm==2.0.0
filelock==3.0.12
future==0.17.1
google-auth==1.21.1
google-auth-oauthlib==0.4.1
grpcio==1.25.0
h5py==2.8.0
idna==2.8
joblib==0.16.0
kiwisolver==1.1.0
Markdown==3.1.1
matplotlib==2.2.3
mkl-fft==1.3.0
mkl-random==1.2.2
mkl-service==2.4.0
msgpack==0.6.2
msgpack-numpy==0.4.4.3
multiprocess==0.70.9
murmurhash==0.28.0
numpy==1.16.4
oauthlib==3.1.0
packaging==20.4
pathlib==1.0.1
pathos==0.2.3
Pillow==6.2.1
plac==0.9.6
pox==0.2.7
ppft==1.6.6.1
preshed==2.0.1
protobuf==3.10.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycocotools==2.0.3
pycparser==2.19
pyOpenSSL==19.1.0
pyparsing==2.4.5
PySocks==1.7.1
python-dateutil==2.8.1
pytz==2019.3
regex==2017.4.5
requests==2.22.0
requests-oauthlib==1.3.0
rsa==4.6
sacremoses==0.0.43
sentencepiece==0.1.91
six==1.13.0
spacy==2.1.0
tensorboard==2.3.0
tensorboard-plugin-wit==1.7.0
termcolor==1.1.0
thinc==7.0.2
tokenizers==0.8.1rc2
toolz==0.10.0
torch==1.6.0
torchtext==0.7.0
tqdm==4.32.2
transformers==3.1.0
ujson==1.35
urllib3==1.24.2
Werkzeug==0.16.0
wrapt==1.10.11

The issues with the normal requirements are thinc, spacy and the mkl libs. After, I needed to upgrade numpy to latest (numpy==1.22.0) in order to fix some runtime errors.

I also had to update torch after the fact to get cuda 11 working, seems like torch 1.8 works. Installed with pip install -U --force-reinstall --no-cache-dir torch==1.8.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html

After that it seems to be training.

opened by bth5032 1

Variable mention before initialization

https://github.com/Vision-CAIR/VisualGPT/blob/72d6b741da6ecdd626e486596bb5057f0ffb8511/train_visualGPT.py#L150

In line 150 of file train_visualGPT a variable named fp16 is mentioned. This variable was never initialized in the arguments

opened by KaratzasBasil 1
experiment_log_last.pth not found after Epoch 0 evaluation step has been completed

First of all, thanks for sharing your code.

After epoch 0 evaluation step completed , i got the following error "FileNotFoundError: [Errno 2] No such file or directory: 'saved_models/experiment_log_last.pth'"

Could you please help me how to solve it ?

opened by YasmineHamdy 1
About checkpoints on 0.1, 0.5, 1% on MS-COCO and IU X-ray?

Dear friend,

Thank you for your novel work for the low-resource image captioning datasets. However, I wonder why you do not provide the checkpoints from all baselines you used and your proposed method on MS-COCO, IU X-Ray as well as your 0.1, 0.5 and 1% MS-COCO training datasets?

It seems that this repo is only used to train on MS-COCO dataset, how about IU X-Ray? Did you modify from https://github.com/cuhksz-nlp/R2Gen or directly use this repo for experiments?

I think above points should be made clear. Thank you very much.

opened by caodoanh2001 0
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
ModuleNotFoundError: No module named 'transformers'

I fount the above error while runing the shell command in colab.

!python /content/VisualGPT/train_visualGPT.py --batch_size 50 --head 12 --tau 0.2 --features_path /content/drive/MyDrive/coco_detections.hdf5 --annotation_folder /content/annotations --lr 1e-4 --gpt_model_type gpt --random_seed 42 --log_file logs/log --exp_name experiment_log --lr 1e-4 --decoder_layer 12 --optimizer_type adamw --gradient_accumulation_steps 2 --train_percentage 0.001 --split_train_data

opened by Zah-Ram 0

Owner

Vision CAIR Research Group, KAUST

Vision CAIR Group, KAUST, supported by Mohamed Elhoseiny

GitHub

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Official code Cross-Covariance Image Transformer (XCiT)

605 Jan 2, 2023

Adversarial Adaptation with Distillation for BERT Unsupervised Domain Adaptation

Knowledge Distillation for BERT Unsupervised Domain Adaptation Official PyTorch implementation | Paper Abstract A pre-trained language model, BERT, ha

29 Nov 30, 2022

This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging To appear on KDD'21...[pdf] This project provides an unsupervised framework for mining and

146 Dec 22, 2022

Using pretrained language models for biomedical knowledge graph completion.

LMs for biomedical KG completion This repository contains code to run the experiments described in: Scientific Language Models for Biomedical Knowledg

41 Nov 30, 2022

Measuring and Improving Consistency in Pretrained Language Models

ParaRel ?? This repository contains the code and data for the paper: Measuring and Improving Consistency in Pretrained Language Models as well as the

26 Dec 2, 2022

Code for "LoRA: Low-Rank Adaptation of Large Language Models"

LoRA: Low-Rank Adaptation of Large Language Models This repo contains the implementation of LoRA in GPT-2 and steps to replicate the results in our re

394 Jan 8, 2023

Implementation of Squeezenet in pytorch, pretrained models on Cifar 10 data to come

Pytorch Squeeznet Pytorch implementation of Squeezenet model as described in https://arxiv.org/abs/1602.07360 on cifar-10 Data. The definition of Sque

86 Oct 28, 2022

The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

249 Dec 22, 2022

Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

Bridging Multi-Task Learning and Meta-Learning Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Trainin

57 Dec 15, 2022

This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models are Pix2Pix, Pix2PixHD, CycleGAN and PointWise.

RGB2NIR_Experimental This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models

5 Jan 4, 2023

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

DeCLIP Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. Our paper is available in arxiv Updates ** Ou

470 Dec 30, 2022

Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Diverse Image Captioning with Context-Object Split Latent Spaces This repository is the PyTorch implementation of the paper: Diverse Image Captioning

34 Nov 21, 2022

Semi-Autoregressive Transformer for Image Captioning

Semi-Autoregressive Transformer for Image Captioning Requirements Python 3.6 Pytorch 1.6 Prepare data Please use git clone --recurse-submodules to clo

23 Dec 9, 2022

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022

VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Related tags

Overview

VisualGPT

Main Architecture of Our VisualGPT

Download the GPT-2 pretrained weights

Enviroment setup

Data preparation

code structure

Train the model

Acknowledgement

Comments

Patching CVE-2007-4559

Owner

Vision CAIR Research Group, KAUST

PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Adversarial Adaptation with Distillation for BERT Unsupervised Domain Adaptation

This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

Using pretrained language models for biomedical knowledge graph completion.

Measuring and Improving Consistency in Pretrained Language Models

Code for "LoRA: Low-Rank Adaptation of Large Language Models"

Implementation of Squeezenet in pytorch, pretrained models on Cifar 10 data to come

The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models are Pix2Pix, Pix2PixHD, CycleGAN and PointWise.

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Semi-Autoregressive Transformer for Image Captioning

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.

An unreferenced image captioning metric (ACL-21)

Image Captioning using CNN and Transformers

Optimized code based on M2 for faster image captioning training

An Image Captioning codebase