Vision-Language Pre-training for Image Captioning and Question Answering

Luowei Zhou

Last update: Jan 3, 2023

Related tags

Deep Learning VLP

Overview

VLP

This repo hosts the source code for our AAAI2020 work Vision-Language Pre-training (VLP). We have released the pre-trained model on Conceptual Captions dataset and fine-tuned models on COCO Captions and Flickr30k for image captioning and VQA 2.0 for VQA.

Installation

Conda Environment (Option I, Recommended)

Recursively ssh clone the repo to include coco and pythia submodules.

git clone --recursive [email protected]:LuoweiZhou/VLP.git

or clone with https:

git clone --recursive https://github.com/LuoweiZhou/VLP.git

Install CUDA (e.g., 10.0), CUDNN (e.g., v7.5), and Miniconda (either Miniconda2 or 3, version 4.6+).
Run the following commands to set up conda env and install Python packages:

MINICONDA_ROOT=[to your Miniconda root directory] # e.g., /home/[usrname]/miniconda3
cd VLP
conda env create -f misc/vlp.yml --prefix $MINICONDA_ROOT/envs/vlp
conda activate vlp

Finally, cd to the repo root directory and install other dependencies by running:

./setup.sh

To support language evaluation (SPICE), run

cd coco-caption
./get_stanford_models.sh

Docker Image (Option II)

First, install or upgrade to the latest docker (e.g., set <VERSION_STRING> to 5:19.03.2~3-0~ubuntu-xenial). Then pull our docker image:

docker pull luzhou/vlp

Before running the container, you need to declare the environment variable to your data root ($DATA_ROOT, see data prep) and it will be attached as a volume to our container. Finally, install nvidia-container-toolkit and run the docker image in a fresh container:

docker run --gpus all --name vlp_container -it \
     -v $DATA_ROOT:/mnt/dat \
     --shm-size 8G -p 8888:8888 vlp /bin/bash

You can know more about docker commands and usages here.

(Optional) To build the image on your own,

docker build -t vlp .

Data Preparation

Download links for dataset annotations and features: COCO Captions+VQA 2.0 (Part I(95GB), Part II(79GB), download both and run cat COCO0* > COCO.tar.gz), Flickr30k Captions(27GB). If you prefer to download with wget, we attach the commands here. Then, uncompress the downloaded files and place under your data root (denoted as DATA_ROOT).

To prepare for the pre-training, first download and uncompress our pre-processed Conceptual Captions (CC) data(6GB) and place under your data root. Then, download and uncompress the region features from Google Drive (feat(509GB), cls(468GB)) under the CC/region_feat_gvd_wo_bgd/feat_cls_1000_float16 dir. To evaluate CC on caption generation, download the reference file and place it under coco-caption/annotations.

Besides, download and uncompress the detectron fc7 weight files under the code root directory (denoted as CODE_ROOT): GVD Detectron fc7.

(Optional, only for VQA) Download the VQA 2.0 annotation (based on Pythia):

cd $CODE_ROOT/pythia
mkdir -p data && cd data
wget http://dl.fbaipublicfiles.com/pythia/data/vocab.tar.gz
tar xf vocab.tar.gz && rm vocab.tar.gz

wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip
unzip v2_Annotations_Val_mscoco.zip && rm v2_Annotations_Val_mscoco.zip

mkdir -p imdb && cd imdb
wget https://dl.fbaipublicfiles.com/pythia/data/imdb/vqa.tar.gz
tar xf vqa.tar.gz && rm vqa.tar.gz

(Optional, only for pre-training) Download the UniLM checkpoints and uncompress under your checkpoint root (denoted as CHECKPOINT_ROOT).

Experiment Overview

Most of the experiments in this work are performed on 8x V100 GPUs with distributed data parallel (i.e., set --world_size to 8, --local_rank and --global_rank from 0 to 7 with 8 separate scripts), unless specified otherwise. See below for detailed configurations (also in the Appendix of the paper).

Dataset	Batch Size	Learning Rate	# of Epochs	GPUs	Time per Epoch
CC	64(x8)	1e-4(x8)	30	8x V100	5hr
COCO	64(x8)	3e-5(x8)	30	8x V100	12min
VQA 2.0	64(x2)	2e-5(x2)	20	2x V100	32min
Flickr30k	64(x8)	3e-5(x8)	30	8x V100	3min
COCO (w/o pre-training)	64(x8)	3e-4(x8)	30	8x V100	12min
COCO (SCST training)	16(x4)	1e-6(x4)	30	4x Titan Xp	3hr

The (x2), (x4), (x8) in the batch size and learning rate results from distributed data parallel. Gradients are accumulated/added across GPUs.

Note that some modules need to be imported manually:

export PYTHONPATH=$CODE_ROOT/pythia:$CODE_ROOT/pythia/pythia/legacy:$CODE_ROOT:$PYTHONPATH

Pre-training

An example code on single-GPU training:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_cc} \
    --model_recover_path $CHECKPOINT_ROOT/bert_save/base_model_pretrained/model_153999_cpu.bin \
    --do_train --learning_rate ${lr} --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/CC/annotations/dataset_cc.json \
    --dataset cc --split train --file_valid_jpgs $DATA_ROOT/CC/annotations/cc_valid_jpgs.json \
    --local_rank -1 --global_rank -1 --world_size 1 --enable_butd \
    --s2s_prob ${w_s} --bi_prob ${w_b} --image_root $DATA_ROOT/CC/region_feat_gvd_wo_bgd \
    --region_bbox_file bbox/cc_detection_vg_thresh0.2_feat_gvd_checkpoint_trainval.h5 \
    --region_det_file_prefix feat_cls_1000_float16/cc_detection_vg_100dets_gvd_checkpoint_trainval

where lr=1e-4, w_s=0.75, w_b=0.25, and checkpoint_cc is the id of the checkpoint. The pre-trained models are available here.

Fine-tuning

The fine-tuning checkpoints are available at: COCO (CE optim), COCO (CIDEr optim), VQA 2.0 (train on train set only), Flickr30k.

COCO Captions

An example code on single-GPU training:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_coco_ce} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0

(Optional) To enable Self-Critical Sequence Training (SCST), set --model_recover_path $CHECKPOINT_ROOT/${checkpoint_coco_ce}/model.28.bin, --max_pred 0, --mask_prob 0, --scst, --learning_rate 1e-6 (note that SCST requires a much smaller lr than the default 3e-5), and --output_dir accordingly. The training takes 30 epochs to converge with each epoch takes roughly 3hr.

An example code on 2-GPU training with distributed data parallel:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_coco_ce} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0 \
    --local_rank 0 --global_rank 0 --world_size 2 &
python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_coco_ce} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0 \
    --local_rank 1 --global_rank 1 --world_size 2

VQA 2.0

An example code on single-GPU training:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_vqa2} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --learning_rate 2e-5 --new_segment_ids --always_truncate_tail --amp \
    --num_train_epochs 20 --enable_butd --s2s_prob 0 --bi_prob 1 \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd
    --tasks vqa2 --src_file $CODE_ROOT/pythia/data/imdb/vqa/imdb_train2014.npy \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --mask_prob 0 --max_pred 1

To get the models for leaderboard, we perform the training on both train set and val set (set src_file to imdb_train2014 and imdb_val2014).

Flickr30k Captions

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_flickr30k} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --image_root $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0 \
    --dataset flickr30k --region_bbox_file $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd/flickr30k_detection_vg_thresh0.2_feat_gvd_checkpoint_trainvaltest.h5 \
    --src_file $DATA_ROOT/flickr30k/annotations/dataset_flickr30k.json \
    --file_valid_jpgs $DATA_ROOT/flickr30k/annotations/flickr30k_valid_jpgs.json

Inference and Testing

Here, we list the expected result outcomes from our Unified VLP checkpoints. For image captioning, on Karpathy's test split:

Dataset	Method	BLEU@4	METEOR	CIDEr	SPICE
COCO	Unified VLP	36.5	28.4	116.9	21.2
	Unified VLP + SCST	39.5	29.3	129.3	23.2
Flickr30k	Unified VLP	30.1	23.0	67.4	17.0

For VQA:

Dataset	Trained on	Eval Split	Overall	Yes/No	Number	Other
VQA 2.0	train only	Dev	67.4	85.4	50.1	58.3
	train+val	Test-Dev	70.5	87.2	52.1	60.3
	train+val	Test-Standard	70.7	87.4	52.1	60.5

Note that results on Test-Dev and Test-Standard are from VQA 2.0 evaluation server. train+val indicates models are trained on both training set and validation set following the practice from early works.

Note: All the evaluation scripts support data parallel. But since we do not use standard PyTorch DataLoader, the data loading speed might be the bottleneck (imagine num_workers is always 0). We recommend to perform single-GPU inference (e.g., CUDA_VISIBLE_DEVICES=0).

COCO Captions

python vlp/decode_img2txt.py \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_coco_ce}/model.${epoch}.bin \
    --new_segment_ids --batch_size 100 --beam_size ${beam} --enable_butd \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd/ --split ${split} \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json

where checkpoint_coco_ce indicates checkpoint name, beam=1 for split=val set and 5 for split=test set, and epoch indicates the checkpoint at which epoch.

VQA 2.0

python vlp/eval_vqa2.py \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_vqa2}/model.${epoch}.bin \
    --new_segment_ids --enable_butd --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd/ \
    --src_file $CODE_ROOT/pythia/data/imdb/vqa/imdb_${split}.npy --batch_size 50 \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json --split ${split}

where split could be val2014 or test2015.

Flickr30k Captions

python vlp/decode_img2txt.py \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_flickr30k}/model.${epoch}.bin \
    --new_segment_ids --batch_size 100 --beam_size ${beam} --enable_butd \
    --image_root $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd/ --split ${split} \
    --dataset flickr30k --region_bbox_file $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd/flickr30k_detection_vg_thresh0.2_feat_gvd_checkpoint_trainvaltest.h5 \
    --src_file $DATA_ROOT/flickr30k/annotations/dataset_flickr30k.json \
    --file_valid_jpgs $DATA_ROOT/flickr30k/annotations/flickr30k_valid_jpgs.json

where beam=1 for split=val set and 5 for split=test set, and epoch indicates the checkpoint at which epoch.

Testing

For all the datasets, checkpoints (by epochs) with the best validation accuracy (CIDEr in captioning and overall accuracy in VQA) are evaluated on the test set (Test-Dev and Test-Standard for VQA 2.0).

Misc

The Detectron-based feature extraction code is available under this repo. You need to download this config file and checkpoint file.

List of download commands (only for OneDrive):

wget -O caption_cc_val.json "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212017&authkey=AHy5eiJM75RwPxg"

# data
wget -O COCO00 "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212019&authkey=ACn4bwZ0nmZ0nik"
wget -O COCO01 "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212018&authkey=AHoTGG-7-6kwoAY"
wget -O flickr30k.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212015&authkey=AFZ2iehPM8HREeA"
wget -O CC.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%213781&authkey=ANA--esfJnWIKIE"

# UniLM checkpoint
wget -O bert_save.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212016&authkey=AB5-lxzCkgpfLhg"

# pre-training checkpoints
wget -O cc_g8_lr1e-4_batch512_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212026&authkey=AH98pIVaNS4apSI"

# fine-tuning checkpoints
wget -O coco_g8_lr3e-5_batch512_ft_from_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212028&authkey=AEjQxFF1FcBK-Aw"
wget -O coco_g4_lr1e-6_batch64_scst.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212027&authkey=ACM1UXlFxgfWyt0"
wget -O vqa2_g2_lr2e-5_batch512_ft_from_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212029&authkey=APjfGJd1-nzDO7s"
wget -O flickr30k_g8_lr3e-5_batch512_ft_from_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212030&authkey=AGmfQ0fXcYCQun0"

# Detectron config/model
wget -O e2e_faster_rcnn_X-101-64x4d-FPN_2x-vlp.yaml "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212013&authkey=AHIvnE1FcggwiLU"
wget -O e2e_faster_rcnn_X-101-64x4d-FPN_2x-vlp.pkl "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212014&authkey=AAHgqN3Y-LXcBvU"

Reference

Please acknowledge the following paper if you use the code:

@article{zhou2019vlp,
  title={Unified Vision-Language Pre-Training for Image Captioning and VQA},
  author={Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao},
  journal={arXiv preprint arXiv:1909.11059},
  year={2019}
}

Related Projects/Codebase

Pre-trained UniLM: https://github.com/microsoft/unilm
GVD (captioing+grounding): https://github.com/facebookresearch/grounded-video-description
Video DenseCap: https://github.com/salesforce/densecap
MT-DNN: https://github.com/namisan/mt-dnn

Acknowledgement

Our code is mainly based on Li Dong et al.'s UniLM repo. Also, a part of the code is based on pytorch-transformers v0.4.0 and ImageCaptioning.pytorch. We thank the authors for their wonderful open-source efforts.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the UniLM project and pytorch-transformers v0.4.0 project.

Comments

Data Download Problem

Hi, thank you for your interesting work! I have a problem when trying to download the provided dataset annotations and features since the one drive link provided cannot be visited in China without a VPN. So it's difficult for me to prepare the data on my ubuntu machine. Do you have any generous advice for me to solve this problem? Or would you please provide another download link that can be easily connected in China for MSCOCO data? Thank you!

opened by tjuwyh 29
What is GPU memory size of your V100? (ERROR: Unexpected bus error encountered in worker)

HI, I am trying to use one V100 GPU with 16G memory to run the fine-tuning on COCO image captioning task and always encounter such error "ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)". So what is the GPU memory of your V100 and what are the recommended configurations (e.g., batch_size, num_workers) on my cluster with 4*V100 of 16G memory for running fine-tuning for COCO captioning. Thanks!

opened by yuewang-cuhk 10
Unable to reproduce image features for COCO and CC

Hi Luowei --

I'm unable to reproduce the image features that you've published here for COCO and CC. I've trained and evaluated the model using your provided features as well as my extracted features, on the VQA2 task (VQA2 uses COCO images). There is still an outstanding gap in performance. While you report 67.4, I can only achieve 64.3. This is a significant 3-point gap. I am wondering if others have encountered similar problem and how they have resolved it?

I've extracted my own features using the script you shared with me privately (slightly modified to resole dependency issues). Using the housebw/detectron image and your provided detectron checkpoint .pkl and config .yaml, I generate different features than yours. Comparing image-by-image, I have different values in the tensors/matricies. I also get different aggregate statisics (min, max, mean, variance) for features, image-by-image. This is the same situation for CC as well. I've also confirmed it is not a precision issue as well (float16 vs float32).

As it stands, I cannot replicate your results despite my best efforts to follow all your provided documentation, using the same environment, code, data dependencies, and source data.

I am attempting to use your SOTA model on a new dataset/task. Not being able to replicate your results is an impediment...

Thanks, Shawn

opened by darkmatter08 8
Detectron feature extraction

Might it be possible to release the script to extract features using the Detectron model? Probably something similar to this https://github.com/ruotianluo/ImageCaptioning.pytorch/blob/master/scripts/prepro_feats.py that was used to generate the region_feat_gvd_wo_bgd/trainval .npz features and class probabilities?

opened by leotam 8
Require a quick start for simple usage...

Hi, I just want to test the captioning result on some raw images. I have read the vlp/decode_img2txt.py, but the settings are a little bit complicated for me, for example, the standard size of an input image.

So it would be very kind of you if a simple usage could be provided.

I really appreciate any help you can provide.

opened by wubowen416 7

When will you release the visual feature detectron code?

Hi Luowei, Recently I have been trying to extract detectron visual features following your guideline here. But I still cannot replicate the features. I use the recommended docker image and refer to the preliminary scripts you send me by email. However, your script extract_features_luowei.py is actually incompatible with the detectron model in the housebw's repo. For example, the im_detect_bbox method (in the figure below) should be imported from core.test instead of core.test_engine. 67132010-14c08480-f1d5-11e9-876d-fc8608108c5d

Actually, I also find several other incompatibility issues when trying to run your code inside the provided docker environment. Even after I fixed some incompatibility issues and run the code successfully, I encounter such errors:

WARNING cnn.py:  40: [====DEPRECATE WARNING====]: you are creating an object from CNNModelHelper class which will be deprecated soon. Please use ModelHelper object with brew module. For more information, please refer to caffe2.ai and python/brew.py, python/brew_test.py for more information.
INFO net.py:  59: Loading weights from: /export/home/vlp_data/e2e_faster_rcnn_X-101-64x4d-FPN_2x.pkl
I1029 08:52:12.747747  2485 net_dag_utils.cc:118] Operator graph pruning prior to chain compute took: 0.000186569 secs
I1029 08:52:12.747993  2485 net_dag.cc:61] Number of parallel execution chains 36 Number of operators = 371
I1029 08:52:12.770298  2485 net_dag_utils.cc:118] Operator graph pruning prior to chain compute took: 0.000165061 secs
I1029 08:52:12.770571  2485 net_dag.cc:61] Number of parallel execution chains 30 Number of operators = 358
/export/home/vlp_data/coco_raw/coco_tiny/COCO_test2014_000000000001.jpg
terminate called after throwing an instance of 'caffe2::EnforceNotMet'
  what():  [enforce fail at blob.h:94] IsType<T>(). wrong type for the Blob instance. Blob contains nullptr (uninitialized) while caller expects caffe2::Tensor<caffe2::CUDAContext> .
Offending Blob name: gpu_0/rois_fpn2.
Error from operator: 
input: "gpu_0/fpn_res2_2_sum" input: "gpu_0/rois_fpn2" output: "gpu_0/roi_feat_fpn2" name: "" type: "RoIAlign" arg { name: "pooled_h" i: 7 } arg { name: "sampling_ratio" i: 2 } arg { name: "spatial_scale" f: 0.25 } arg { name: "pooled_w" i: 7 } device_option { device_type: 1 cuda_gpu_id: 0 }
*** Aborted at 1572339134 (unix time) try "date -d @1572339134" if you are using GNU date ***
terminate called recursively
terminate called recursively
PC: @     0x7fe85f011428 gsignal
*** SIGABRT (@0x9b5) received by PID 2485 (TID 0x7fe7c0ffd700) from PID 2485; stack trace: ***
    @     0x7fe85f3b7390 (unknown)
    @     0x7fe85f011428 gsignal
    @     0x7fe85f01302a abort
    @     0x7fe8590bf84d __gnu_cxx::__verbose_terminate_handler()
terminate called recursively
    @     0x7fe8590bd6b6 (unknown)
    @     0x7fe8590bd701 std::terminate()
    @     0x7fe8590e8d38 (unknown)
    @     0x7fe85f3ad6ba start_thread
    @     0x7fe85f0e341d clone
    @                0x0 (unknown)
Aborted (core dumped)

Due to these incompatibility issues, I find it pretty difficult to extract the same visual features as yours. But if we use other detectron codes like detectron2 or mmdetection, we cannot use your pre-trained models. Therefore, I would like to ask when you can fully release your detectron code (code & python environment), which will be extremely helpful for those planning to apply your VLP model into their own datasets like me. Look forward to your reply:)

opened by yuewang-cuhk 7

Multiple GPUs Support

The provided fine-tuning scripts fails on multiple GPUs machine. Traceback (most recent call last): File "run_img2txt_dist.py", line 621, in <module> main() File "run_img2txt_dist.py", line 546, in main iter_bar.set_description('Iter (loss=%5.3f)' % loss.item()) ValueError: only one element tensors can be converted to Python scalars

What is the recommended way to run a parallel training.
Thanks :)

opened by idansc 6

Getting `No module named 'apex.optimizers'` error

Hello, Thanks for your work.

Currently, I am trying to run inference on flickr features. I have installed apex as per the instructions in setup.sh. I have the same commit of apex (1603407bf49c7fc3da74fceb6a6c7b47fece2ef8) as mentioned in the setup.sh file. My pytorch version is 1.6.0+cu101, which is different from the one mentioned in misc/vlp.yml file. While installing apex I get following error message: error: command 'gcc' failed with exit status 1. When I try running the flickr inference code I get below error:

Traceback (most recent call last):
  File "vlp/decode_img2txt.py", line 19, in <module>
    from pytorch_pretrained_bert.tokenization import BertTokenizer, WhitespaceTokenizer
  File "/home/default/ephemeral_drive/work/image_captioning/VLP/pytorch_pretrained_bert/__init__.py", line 6, in <module>
    from .optimization_fp16 import FP16_Optimizer_State
  File "/home/default/ephemeral_drive/work/image_captioning/VLP/pytorch_pretrained_bert/optimization_fp16.py", line 4, in <module>
    from apex.optimizers import FP16_Optimizer
ModuleNotFoundError: No module named 'apex.optimizers'

I tried installing the latest apex version as per the instructions here. I get Successfully installed apex-0.1 message, but when I run the inference code I get below error.

Traceback (most recent call last):
  File "vlp/decode_img2txt.py", line 19, in <module>
    from pytorch_pretrained_bert.tokenization import BertTokenizer, WhitespaceTokenizer
  File "/home/default/ephemeral_drive/work/image_captioning/VLP/pytorch_pretrained_bert/__init__.py", line 6, in <module>
    from .optimization_fp16 import FP16_Optimizer_State
  File "/home/default/ephemeral_drive/work/image_captioning/VLP/pytorch_pretrained_bert/optimization_fp16.py", line 4, in <module>
    from apex.optimizers import FP16_Optimizer
ImportError: cannot import name 'FP16_Optimizer'

Seems, the optimizers are different in the latest apex commit. Would you recommend replacing the optimizer in your code with any of the current ones. Or, if you would have any other suggestion for me to resolve the issue. I am not able to use the conda environment you had mentioned as a requirement, as I am working on a controlled-access machine and I don't have the liberty to do all the installations.

opened by gsrivas4 5

Chinese image caption， In the result, multiple words of the same type appear

Hello, I am using the COCO dataset, A two-layer LSTM model, one layer for top-down attention, and one layer for language models.

Extracting words with jieba I used all the words in the picture description that occurred more than 3 times as a dictionary file, and a total of 14,226 words. words = [w for w in word_freq.keys () if word_freq [w]> 3]

After training the model, when using it, multiple words of the same type appear in the result, such as:

Note notebook laptop computer on bed A little girl little girl girl standing together

How can I solve this problem?

opened by cylvzj 5
UniLM checkpoint is no longer available

UniLM checkpoint

wget -O bert_save.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212016&authkey=AB5-lxzCkgpfLhg"

The above link is no longer anymore. Thanks.

opened by GabrielLin 5
File Not Found：feat_cls_1000_float16/cc_detection_vg_100dets_gvd_checkpoint_trainval

Thanks a lot for sharing this useful repo. we are trying to reproduce the finetuning result on flicker30k, but an error occurs which say "feat_cls_1000_float16/cc_detection_vg_100dets_gvd_checkpoint_trainval not found". I found this error is about this line of code, parser.add_argument('--region_det_file_prefix', default='feat_cls_1000/coco_detection_vg_100dets_gvd_checkpoint_trainval', type=str) . So where can I get access of "coco_detection_vg_100dets_gvd_checkpoint_trainval"? In my folder of feat_cls_1000, there are only "flickr30k_detection_vg_thresh0.2_feat_gvd_checkpoint_trainvaltest.h5" and "trainval".

opened by zhao1iang 4
Change N object of feature extractor and number of attention head

I want to change the number of objects in the feature extractor (e.g from 100 to 150) and the attention head (instead of changing the BERT base to BERT large) (e.g from 12 to 16). Could you please tell me where to change the code?

opened by khiemledev 0
the JSON file of dataset is in need

We intend to use our own dataset to pre-train your model, but we don't know how to construct our data to fit into your model. In addition, COCO dataset is too large and we don't have enough space, so we cannot see the JSON file in which you organize your training data. So could you please share your JSON file so that we can know how to use our own data? Thanks a lot!

opened by Hepta-Col 0
Other Pretrained Models

Hi @LuoweiZhou ， is it convenient to provide other pre-trained checkpoints, such as cc_g8_lr1e-4_batch512_s0.25_b0.75.tar.gz or cc_g8_lr1e-4_batch512_s0_b1.tar.gz ? Many thanks.

opened by jingjingdd 2
Bus error (core dump) during training

Hi, thanks for sharing this project.

I prepared all the required h5py files and caption annotations files for COCO Caption finetuning as instructed in README. The training went normally at the beginning, but got killed (bus error (core dumped)) after around 70k~100k iteration.

I wonder if it was an out-of-memory issue caused by data loading. It seemed that huge memory was progressively consumed by the program, perhaps due to reading more and more image features from h5py files. Using del or gc.collect() didn't help free unreferenced objects' memory.

Is there any good solution to save memory for the multimodal training? Or idea on what was going on in my case. Thanks a lot!

opened by ChenYutongTHU 1
Train model when I have image only without any bbox info

Hi, Thanks for your great work! I want to know whether the model can be trained without regions. In other words, I have caption and image only without any bbox info, how can I make the model work? Thank you so much!

opened by Decalogue 1

Owner

Luowei Zhou

Senior Researcher @ Microsoft. UMich Ph.D.

GitHub

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

1.3k Dec 31, 2022

Simple image captioning model - CLIP prefix captioning.

688 Jan 4, 2023

QA-GNN: Question Answering using Language Models and Knowledge Graphs

QA-GNN: Question Answering using Language Models and Knowledge Graphs This repo provides the source code & data of our paper: QA-GNN: Reasoning with L

434 Jan 4, 2023

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

34 Apr 13, 2022

covid question answering datasets and fine tuned models

Covid-QA Fine tuned models for question answering on Covid-19 data. Hosted Inference This model has been contributed to huggingface.Click here to see

19 Sep 9, 2021

Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

41 Nov 8, 2022

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

TriageSQL The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text

22 Nov 9, 2022

GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It can be used to test three levels of generalization in KBQA: i.i.d., compositional, and zero-shot.

76 Dec 21, 2022

Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

BPR Binary Passage Retriever (BPR) is an efficient neural retrieval model for open-domain question answering. BPR integrates a learning-to-hash techni

147 Dec 7, 2022

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

NExT-QA We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021 (with 1 'Strong Accept' and 2

50 Nov 24, 2022

FeTaQA: Free-form Table Question Answering

FeTaQA: Free-form Table Question Answering FeTaQA is a Free-form Table Question Answering dataset with 10K Wikipedia-based {table, question, free-form

Language, Information, and Learning at Yale

40 Dec 13, 2022

Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

Path-Generator-QA This is a Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Common

33 Dec 5, 2022

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

59 Dec 28, 2022

Vision-Language Pre-training for Image Captioning and Question Answering

Related tags

Overview

VLP

Installation

Conda Environment (Option I, Recommended)

Docker Image (Option II)

Data Preparation

Experiment Overview

Pre-training

Fine-tuning

COCO Captions

VQA 2.0

Flickr30k Captions

Inference and Testing

COCO Captions

VQA 2.0

Flickr30k Captions

Testing

Misc

Reference

Related Projects/Codebase

Acknowledgement

License

Comments

UniLM checkpoint

Owner

Luowei Zhou

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Simple image captioning model - CLIP prefix captioning.

QA-GNN: Question Answering using Language Models and Knowledge Graphs

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

covid question answering datasets and fine tuned models

Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

GrailQA: Strongly Generalizable Question Answering

Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

FeTaQA: Free-form Table Question Answering

Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

Bilinear attention networks for visual question answering

Visual Question Answering in Pytorch

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering