VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Overview

VisualGPT

Our Paper VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning

Main Architecture of Our VisualGPT

image

Download the GPT-2 pretrained weights

curl --output gpt2-pytorch_model.bin https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin

Enviroment setup

Clone the repository and create the visualgpt conda environmnet

conda env create -f environment.yml
conda activate visualgpt

Then download spacy data

python -m spacy download en

Data preparation

We provide the COCO dataset for downloading. Please download the annotations file annotations.zip and extract it. and coco_detections.hdf5, in which the data is stored in a where key is the image id and value is a tensor (N, 2048). N it the number of detections

code structure

create the log folder mkdir logs and start the training

Train the model

python train_visualGPT.py --batch_size 50 --head 12 --features_path coco_detections.hdf5 --annotation_folder annotations --lr 1e-4 --gpt_model_type gpt --random_seed 42 --log_file logs/log --exp_name experiment_log --lr 1e-4 --decoder_layer 12 --optimizer_type adamw  --gradient_accumulation_steps 2 --train_percentage 0.001 --split_train_data

Acknowledgement

This code used resources from Meshed Memory Transformer and Transformers

Please cite our paper from the following bibtex

@article{chen2021visualgpt,
  title={VisualGPT: Data-efficient Image Captioning by Balancing Visual Input and Linguistic Knowledge from Pretraining},
  author={Chen, Jun and Guo, Han and Yi, Kai and Li, Boyang and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2102.10407},
  year={2021}
}

@article{chen2021visualgpt,
  title={VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning},
  author={Chen, Jun and Guo, Han and Yi, Kai and Li, Boyang and Elhoseiny, Mohamed},
  journal={arXiv preprint arXiv:2102.10407},
  year={2021}
}

Comments
  • Explaining the cross attention

    Explaining the cross attention

    Hi,

    thank you for answering my last question!

    I am currently trying to explain part of the caption generation process and I am interested in Figure.5 where you highlighted the visual scores on the generated captions.

    However, if my understanding is correct, you have not put any code about your explaining method in the repo. It will be really appreciated if you would like to give some coding example about the visualization for better understanding!

    Cheers!

    opened by TCBpenta8 4
  • Trying to run code on IU X-ray database

    Trying to run code on IU X-ray database

    Hi, I've been interested in image captioning and specifically automatic medical report generation, and I stumbled across your VisualGPT which seemed to take a promising approach, and I've been trying to get it to work with other databases, specifically IU as mentioned in your article.

    I can't figure out how you guys have set up the COCO database, and how I should be trying to structure IU X-ray to fit into your code. Is it still supposed to use COCO_detections.hdf5? Or am I supposed to create a hdf5 file for IU?

    opened by PurpleDish 4
  • About total data performance

    About total data performance

    Hi, I am glad to read this article. This essay is the first work that focuses on efficiently adapting large pretrained language models for image captioning, which inspires me a lot! In the result display section, it mainly shows the results of training using some data sets at different sampling rates. Therefore, I would like to ask, have you tested the results on all data sets without sampling? How does the performance compare to M2Transformer?

    opened by bugczw 4
  • How to do inference?

    How to do inference?

    There doesn't appear to be any examples of how to do inference with this model.

    I'm pretty confused, in training you guys feed in the entire label along with the image then use the label for the loss. What am I supposed to feed in if I have a new image?

    Additionally, do I just use gpt2-large tokenizer to decode the model outputs?

    opened by bth5032 2
  • Running code on IU X-Ray dataset

    Running code on IU X-Ray dataset

    Hello, I am interested in running your Visual GPT model on the IU Xray dataset. Can you please explain how I can use this model to train on the dataset? I saw issue #4 but I was not able to understand how to create a .h5 file for the IU-Xray dataset. Could you please walk me through how I can set up the .h5 file for IU-Xray?

    opened by dwij2212 2
  • Batch size in Evaluation Strategy

    Batch size in Evaluation Strategy

    Hi, Good to see a new idea in image captioning!

    Here is my question. I noticed that in your evaluation stage, you input the sample one by one, that is in 'train_visualGPT.py' line 62. And also I found that you cast the batch size in validation and evaluation 5 times smaller than one in training. Will there be a specific reason for these? As I notice that the evaluation actually supports mini-batch strategy, and your strategy will cost tons of time if the evaluation set is huge.

    I am kind of a freshman in this area so my question might be silly. Feel free to let me know what you think.

    Regards,

    opened by TCBpenta8 2
  • About memory overflow error during training

    About memory overflow error during training

    Hi! Thanks for the code. When I train the model to 42 batch, I met with the following error using 4*gtx2080ti: “CUDA: out of memory, tried to allocate...” I set the batchsize = 10, it still occurs. Is it totally a hardware problem? What device you use to train the model? I notice that you said your code doesn't support multi-gpu training 1 years ago, is it still not supported yet? Thank you!

    opened by Wangyf1998 1
  • Using pip/requirements.txt instead of conda

    Using pip/requirements.txt instead of conda

    Hi, the requirements.txt doesn't work in this repo because some packages are not available on pypi (or at least not for python 3.8).

    I just wanted to dump the steps I had to take to make this work.

    First I needed to find a set of libraries that could work together in requirements.txt. This config seems to work for me:

    absl-py==0.8.1
    asn1crypto==1.2.0
    cachetools==4.1.1
    certifi==2019.9.11
    cffi==1.13.2
    chardet==3.0.4
    click==7.1.2
    cryptography==2.8
    cycler==0.10.0
    cymem==2.0.2
    Cython==0.29.14
    cytoolz==0.9.0.1
    #dataclasses==0.7
    dill==0.3.2
    #en-core-web-sm==2.0.0
    filelock==3.0.12
    future==0.17.1
    google-auth==1.21.1
    google-auth-oauthlib==0.4.1
    grpcio==1.25.0
    h5py==2.8.0
    idna==2.8
    joblib==0.16.0
    kiwisolver==1.1.0
    Markdown==3.1.1
    matplotlib==2.2.3
    mkl-fft==1.3.0
    mkl-random==1.2.2
    mkl-service==2.4.0
    msgpack==0.6.2
    msgpack-numpy==0.4.4.3
    multiprocess==0.70.9
    murmurhash==0.28.0
    numpy==1.16.4
    oauthlib==3.1.0
    packaging==20.4
    pathlib==1.0.1
    pathos==0.2.3
    Pillow==6.2.1
    plac==0.9.6
    pox==0.2.7
    ppft==1.6.6.1
    preshed==2.0.1
    protobuf==3.10.0
    pyasn1==0.4.8
    pyasn1-modules==0.2.8
    pycocotools==2.0.3
    pycparser==2.19
    pyOpenSSL==19.1.0
    pyparsing==2.4.5
    PySocks==1.7.1
    python-dateutil==2.8.1
    pytz==2019.3
    regex==2017.4.5
    requests==2.22.0
    requests-oauthlib==1.3.0
    rsa==4.6
    sacremoses==0.0.43
    sentencepiece==0.1.91
    six==1.13.0
    spacy==2.1.0
    tensorboard==2.3.0
    tensorboard-plugin-wit==1.7.0
    termcolor==1.1.0
    thinc==7.0.2
    tokenizers==0.8.1rc2
    toolz==0.10.0
    torch==1.6.0
    torchtext==0.7.0
    tqdm==4.32.2
    transformers==3.1.0
    ujson==1.35
    urllib3==1.24.2
    Werkzeug==0.16.0
    wrapt==1.10.11
    

    The issues with the normal requirements are thinc, spacy and the mkl libs. After, I needed to upgrade numpy to latest (numpy==1.22.0) in order to fix some runtime errors.

    I also had to update torch after the fact to get cuda 11 working, seems like torch 1.8 works. Installed with pip install -U --force-reinstall --no-cache-dir torch==1.8.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html

    After that it seems to be training.

    opened by bth5032 1
  • Variable mention before initialization

    Variable mention before initialization

    https://github.com/Vision-CAIR/VisualGPT/blob/72d6b741da6ecdd626e486596bb5057f0ffb8511/train_visualGPT.py#L150

    In line 150 of file train_visualGPT a variable named fp16 is mentioned. This variable was never initialized in the arguments

    opened by KaratzasBasil 1
  • experiment_log_last.pth not found after Epoch 0 evaluation step has been completed

    experiment_log_last.pth not found after Epoch 0 evaluation step has been completed

    First of all, thanks for sharing your code.

    After epoch 0 evaluation step completed , i got the following error "FileNotFoundError: [Errno 2] No such file or directory: 'saved_models/experiment_log_last.pth'"

    Could you please help me how to solve it ?

    opened by YasmineHamdy 1
  • About checkpoints on 0.1, 0.5, 1% on MS-COCO and IU X-ray?

    About checkpoints on 0.1, 0.5, 1% on MS-COCO and IU X-ray?

    Dear friend,

    Thank you for your novel work for the low-resource image captioning datasets. However, I wonder why you do not provide the checkpoints from all baselines you used and your proposed method on MS-COCO, IU X-Ray as well as your 0.1, 0.5 and 1% MS-COCO training datasets?

    It seems that this repo is only used to train on MS-COCO dataset, how about IU X-Ray? Did you modify from https://github.com/cuhksz-nlp/R2Gen or directly use this repo for experiments?

    I think above points should be made clear. Thank you very much.

    opened by caodoanh2001 0
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • ModuleNotFoundError: No module named 'transformers'

    ModuleNotFoundError: No module named 'transformers'

    I fount the above error while runing the shell command in colab.

    !python /content/VisualGPT/train_visualGPT.py --batch_size 50 --head 12 --tau 0.2 --features_path /content/drive/MyDrive/coco_detections.hdf5 --annotation_folder /content/annotations --lr 1e-4 --gpt_model_type gpt --random_seed 42 --log_file logs/log --exp_name experiment_log --lr 1e-4 --decoder_layer 12 --optimizer_type adamw --gradient_accumulation_steps 2 --train_percentage 0.001 --split_train_data

    opened by Zah-Ram 0
Owner
Vision CAIR Research Group, KAUST
Vision CAIR Group, KAUST, supported by Mohamed Elhoseiny
Vision CAIR Research Group, KAUST
Facebook Research 605 Jan 2, 2023
Adversarial Adaptation with Distillation for BERT Unsupervised Domain Adaptation

Knowledge Distillation for BERT Unsupervised Domain Adaptation Official PyTorch implementation | Paper Abstract A pre-trained language model, BERT, ha

Minho Ryu 29 Nov 30, 2022
This project provides an unsupervised framework for mining and tagging quality phrases on text corpora with pretrained language models (KDD'21).

UCPhrase: Unsupervised Context-aware Quality Phrase Tagging To appear on KDD'21...[pdf] This project provides an unsupervised framework for mining and

Xiaotao Gu 146 Dec 22, 2022
Using pretrained language models for biomedical knowledge graph completion.

LMs for biomedical KG completion This repository contains code to run the experiments described in: Scientific Language Models for Biomedical Knowledg

Rahul Nadkarni 41 Nov 30, 2022
Measuring and Improving Consistency in Pretrained Language Models

ParaRel ?? This repository contains the code and data for the paper: Measuring and Improving Consistency in Pretrained Language Models as well as the

Yanai Elazar 26 Dec 2, 2022
Code for "LoRA: Low-Rank Adaptation of Large Language Models"

LoRA: Low-Rank Adaptation of Large Language Models This repo contains the implementation of LoRA in GPT-2 and steps to replicate the results in our re

Microsoft 394 Jan 8, 2023
Implementation of Squeezenet in pytorch, pretrained models on Cifar 10 data to come

Pytorch Squeeznet Pytorch implementation of Squeezenet model as described in https://arxiv.org/abs/1602.07360 on cifar-10 Data. The definition of Sque

gaurav pathak 86 Oct 28, 2022
The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

The PASS dataset: pretrained models and how to get the data - PASS: Pictures without humAns for Self-Supervised Pretraining

Yuki M. Asano 249 Dec 22, 2022
Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

Bridging Multi-Task Learning and Meta-Learning Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Trainin

AI Secure 57 Dec 15, 2022
This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models are Pix2Pix, Pix2PixHD, CycleGAN and PointWise.

RGB2NIR_Experimental This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models

null 5 Jan 4, 2023
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

DeCLIP Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. Our paper is available in arxiv Updates ** Ou

Sense-GVT 470 Dec 30, 2022
Diverse Image Captioning with Context-Object Split Latent Spaces (NeurIPS 2020)

Diverse Image Captioning with Context-Object Split Latent Spaces This repository is the PyTorch implementation of the paper: Diverse Image Captioning

Visual Inference Lab @TU Darmstadt 34 Nov 21, 2022
Semi-Autoregressive Transformer for Image Captioning

Semi-Autoregressive Transformer for Image Captioning Requirements Python 3.6 Pytorch 1.6 Prepare data Please use git clone --recurse-submodules to clo

YE Zhou 23 Dec 9, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

Ruotian(RT) Luo 906 Jan 3, 2023
An unreferenced image captioning metric (ACL-21)

UMIC This repository provides an unferenced image captioning metric from our ACL 2021 paper UMIC: An Unreferenced Metric for Image Captioning via Cont

hwanheelee 14 Nov 20, 2022
Image Captioning using CNN and Transformers

Image-Captioning Keras/Tensorflow Image Captioning application using CNN and Transformer as encoder/decoder. In particulary, the architecture consists

null 24 Dec 28, 2022
Optimized code based on M2 for faster image captioning training

Transformer Captioning This repository contains the code for Transformer-based image captioning. Based on meshed-memory-transformer, we further optimi

lyricpoem 16 Dec 16, 2022
An Image Captioning codebase

An Image Captioning codebase This is a codebase for image captioning research. It supports: Self critical training from Self-critical Sequence Trainin

Ruotian(RT) Luo 1.1k Oct 18, 2021