Vision-Language Pre-training for Image Captioning and Question Answering

Related tags

Deep Learning VLP
Overview

VLP

This repo hosts the source code for our AAAI2020 work Vision-Language Pre-training (VLP). We have released the pre-trained model on Conceptual Captions dataset and fine-tuned models on COCO Captions and Flickr30k for image captioning and VQA 2.0 for VQA.

Installation

Conda Environment (Option I, Recommended)

  1. Recursively ssh clone the repo to include coco and pythia submodules.
git clone --recursive [email protected]:LuoweiZhou/VLP.git

or clone with https:

git clone --recursive https://github.com/LuoweiZhou/VLP.git
  1. Install CUDA (e.g., 10.0), CUDNN (e.g., v7.5), and Miniconda (either Miniconda2 or 3, version 4.6+).

  2. Run the following commands to set up conda env and install Python packages:

MINICONDA_ROOT=[to your Miniconda root directory] # e.g., /home/[usrname]/miniconda3
cd VLP
conda env create -f misc/vlp.yml --prefix $MINICONDA_ROOT/envs/vlp
conda activate vlp
  1. Finally, cd to the repo root directory and install other dependencies by running:
./setup.sh

To support language evaluation (SPICE), run

cd coco-caption
./get_stanford_models.sh

Docker Image (Option II)

First, install or upgrade to the latest docker (e.g., set <VERSION_STRING> to 5:19.03.2~3-0~ubuntu-xenial). Then pull our docker image:

docker pull luzhou/vlp

Before running the container, you need to declare the environment variable to your data root ($DATA_ROOT, see data prep) and it will be attached as a volume to our container. Finally, install nvidia-container-toolkit and run the docker image in a fresh container:

docker run --gpus all --name vlp_container -it \
     -v $DATA_ROOT:/mnt/dat \
     --shm-size 8G -p 8888:8888 vlp /bin/bash

You can know more about docker commands and usages here.

(Optional) To build the image on your own,

docker build -t vlp .

Data Preparation

Download links for dataset annotations and features: COCO Captions+VQA 2.0 (Part I(95GB), Part II(79GB), download both and run cat COCO0* > COCO.tar.gz), Flickr30k Captions(27GB). If you prefer to download with wget, we attach the commands here. Then, uncompress the downloaded files and place under your data root (denoted as DATA_ROOT).

To prepare for the pre-training, first download and uncompress our pre-processed Conceptual Captions (CC) data(6GB) and place under your data root. Then, download and uncompress the region features from Google Drive (feat(509GB), cls(468GB)) under the CC/region_feat_gvd_wo_bgd/feat_cls_1000_float16 dir. To evaluate CC on caption generation, download the reference file and place it under coco-caption/annotations.

Besides, download and uncompress the detectron fc7 weight files under the code root directory (denoted as CODE_ROOT): GVD Detectron fc7.

(Optional, only for VQA) Download the VQA 2.0 annotation (based on Pythia):

cd $CODE_ROOT/pythia
mkdir -p data && cd data
wget http://dl.fbaipublicfiles.com/pythia/data/vocab.tar.gz
tar xf vocab.tar.gz && rm vocab.tar.gz

wget https://s3.amazonaws.com/cvmlp/vqa/mscoco/vqa/v2_Annotations_Val_mscoco.zip
unzip v2_Annotations_Val_mscoco.zip && rm v2_Annotations_Val_mscoco.zip

mkdir -p imdb && cd imdb
wget https://dl.fbaipublicfiles.com/pythia/data/imdb/vqa.tar.gz
tar xf vqa.tar.gz && rm vqa.tar.gz

(Optional, only for pre-training) Download the UniLM checkpoints and uncompress under your checkpoint root (denoted as CHECKPOINT_ROOT).

Experiment Overview

Most of the experiments in this work are performed on 8x V100 GPUs with distributed data parallel (i.e., set --world_size to 8, --local_rank and --global_rank from 0 to 7 with 8 separate scripts), unless specified otherwise. See below for detailed configurations (also in the Appendix of the paper).

Dataset Batch Size Learning Rate # of Epochs GPUs Time per Epoch
CC 64(x8) 1e-4(x8) 30 8x V100 5hr
COCO 64(x8) 3e-5(x8) 30 8x V100 12min
VQA 2.0 64(x2) 2e-5(x2) 20 2x V100 32min
Flickr30k 64(x8) 3e-5(x8) 30 8x V100 3min
COCO (w/o pre-training) 64(x8) 3e-4(x8) 30 8x V100 12min
COCO (SCST training) 16(x4) 1e-6(x4) 30 4x Titan Xp 3hr

The (x2), (x4), (x8) in the batch size and learning rate results from distributed data parallel. Gradients are accumulated/added across GPUs.

Note that some modules need to be imported manually:

export PYTHONPATH=$CODE_ROOT/pythia:$CODE_ROOT/pythia/pythia/legacy:$CODE_ROOT:$PYTHONPATH

Pre-training

An example code on single-GPU training:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_cc} \
    --model_recover_path $CHECKPOINT_ROOT/bert_save/base_model_pretrained/model_153999_cpu.bin \
    --do_train --learning_rate ${lr} --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/CC/annotations/dataset_cc.json \
    --dataset cc --split train --file_valid_jpgs $DATA_ROOT/CC/annotations/cc_valid_jpgs.json \
    --local_rank -1 --global_rank -1 --world_size 1 --enable_butd \
    --s2s_prob ${w_s} --bi_prob ${w_b} --image_root $DATA_ROOT/CC/region_feat_gvd_wo_bgd \
    --region_bbox_file bbox/cc_detection_vg_thresh0.2_feat_gvd_checkpoint_trainval.h5 \
    --region_det_file_prefix feat_cls_1000_float16/cc_detection_vg_100dets_gvd_checkpoint_trainval

where lr=1e-4, w_s=0.75, w_b=0.25, and checkpoint_cc is the id of the checkpoint. The pre-trained models are available here.

Fine-tuning

The fine-tuning checkpoints are available at: COCO (CE optim), COCO (CIDEr optim), VQA 2.0 (train on train set only), Flickr30k.

COCO Captions

An example code on single-GPU training:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_coco_ce} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0

(Optional) To enable Self-Critical Sequence Training (SCST), set --model_recover_path $CHECKPOINT_ROOT/${checkpoint_coco_ce}/model.28.bin, --max_pred 0, --mask_prob 0, --scst, --learning_rate 1e-6 (note that SCST requires a much smaller lr than the default 3e-5), and --output_dir accordingly. The training takes 30 epochs to converge with each epoch takes roughly 3hr.

An example code on 2-GPU training with distributed data parallel:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_coco_ce} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0 \
    --local_rank 0 --global_rank 0 --world_size 2 &
python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_coco_ce} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0 \
    --local_rank 1 --global_rank 1 --world_size 2

VQA 2.0

An example code on single-GPU training:

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_vqa2} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --learning_rate 2e-5 --new_segment_ids --always_truncate_tail --amp \
    --num_train_epochs 20 --enable_butd --s2s_prob 0 --bi_prob 1 \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd
    --tasks vqa2 --src_file $CODE_ROOT/pythia/data/imdb/vqa/imdb_train2014.npy \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json \
    --mask_prob 0 --max_pred 1

To get the models for leaderboard, we perform the training on both train set and val set (set src_file to imdb_train2014 and imdb_val2014).

Flickr30k Captions

python vlp/run_img2txt_dist.py --output_dir $CHECKPOINT_ROOT/${checkpoint_flickr30k} \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_cc}/model.30.bin \
    --do_train --new_segment_ids --always_truncate_tail --amp \
    --image_root $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd --enable_butd --s2s_prob 1 --bi_prob 0 \
    --dataset flickr30k --region_bbox_file $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd/flickr30k_detection_vg_thresh0.2_feat_gvd_checkpoint_trainvaltest.h5 \
    --src_file $DATA_ROOT/flickr30k/annotations/dataset_flickr30k.json \
    --file_valid_jpgs $DATA_ROOT/flickr30k/annotations/flickr30k_valid_jpgs.json

Inference and Testing

Here, we list the expected result outcomes from our Unified VLP checkpoints. For image captioning, on Karpathy's test split:

Dataset Method BLEU@4 METEOR CIDEr SPICE
COCO Unified VLP 36.5 28.4 116.9 21.2
Unified VLP + SCST 39.5 29.3 129.3 23.2
Flickr30k Unified VLP 30.1 23.0 67.4 17.0

For VQA:

Dataset Trained on Eval Split Overall Yes/No Number Other
VQA 2.0 train only Dev 67.4 85.4 50.1 58.3
train+val Test-Dev 70.5 87.2 52.1 60.3
train+val Test-Standard 70.7 87.4 52.1 60.5

Note that results on Test-Dev and Test-Standard are from VQA 2.0 evaluation server. train+val indicates models are trained on both training set and validation set following the practice from early works.

Note: All the evaluation scripts support data parallel. But since we do not use standard PyTorch DataLoader, the data loading speed might be the bottleneck (imagine num_workers is always 0). We recommend to perform single-GPU inference (e.g., CUDA_VISIBLE_DEVICES=0).

COCO Captions

python vlp/decode_img2txt.py \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_coco_ce}/model.${epoch}.bin \
    --new_segment_ids --batch_size 100 --beam_size ${beam} --enable_butd \
    --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd/ --split ${split} \
    --src_file $DATA_ROOT/COCO/annotations/dataset_coco.json \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json

where checkpoint_coco_ce indicates checkpoint name, beam=1 for split=val set and 5 for split=test set, and epoch indicates the checkpoint at which epoch.

VQA 2.0

python vlp/eval_vqa2.py \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_vqa2}/model.${epoch}.bin \
    --new_segment_ids --enable_butd --image_root $DATA_ROOT/COCO/region_feat_gvd_wo_bgd/ \
    --src_file $CODE_ROOT/pythia/data/imdb/vqa/imdb_${split}.npy --batch_size 50 \
    --file_valid_jpgs $DATA_ROOT/COCO/annotations/coco_valid_jpgs.json --split ${split}

where split could be val2014 or test2015.

Flickr30k Captions

python vlp/decode_img2txt.py \
    --model_recover_path $CHECKPOINT_ROOT/${checkpoint_flickr30k}/model.${epoch}.bin \
    --new_segment_ids --batch_size 100 --beam_size ${beam} --enable_butd \
    --image_root $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd/ --split ${split} \
    --dataset flickr30k --region_bbox_file $DATA_ROOT/flickr30k/region_feat_gvd_wo_bgd/flickr30k_detection_vg_thresh0.2_feat_gvd_checkpoint_trainvaltest.h5 \
    --src_file $DATA_ROOT/flickr30k/annotations/dataset_flickr30k.json \
    --file_valid_jpgs $DATA_ROOT/flickr30k/annotations/flickr30k_valid_jpgs.json

where beam=1 for split=val set and 5 for split=test set, and epoch indicates the checkpoint at which epoch.

Testing

For all the datasets, checkpoints (by epochs) with the best validation accuracy (CIDEr in captioning and overall accuracy in VQA) are evaluated on the test set (Test-Dev and Test-Standard for VQA 2.0).

Misc

The Detectron-based feature extraction code is available under this repo. You need to download this config file and checkpoint file.

List of download commands (only for OneDrive):

wget -O caption_cc_val.json "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212017&authkey=AHy5eiJM75RwPxg"

# data
wget -O COCO00 "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212019&authkey=ACn4bwZ0nmZ0nik"
wget -O COCO01 "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212018&authkey=AHoTGG-7-6kwoAY"
wget -O flickr30k.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212015&authkey=AFZ2iehPM8HREeA"
wget -O CC.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%213781&authkey=ANA--esfJnWIKIE"

# UniLM checkpoint
wget -O bert_save.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212016&authkey=AB5-lxzCkgpfLhg"

# pre-training checkpoints
wget -O cc_g8_lr1e-4_batch512_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212026&authkey=AH98pIVaNS4apSI"

# fine-tuning checkpoints
wget -O coco_g8_lr3e-5_batch512_ft_from_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212028&authkey=AEjQxFF1FcBK-Aw"
wget -O coco_g4_lr1e-6_batch64_scst.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212027&authkey=ACM1UXlFxgfWyt0"
wget -O vqa2_g2_lr2e-5_batch512_ft_from_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212029&authkey=APjfGJd1-nzDO7s"
wget -O flickr30k_g8_lr3e-5_batch512_ft_from_s0.75_b0.25.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212030&authkey=AGmfQ0fXcYCQun0"

# Detectron config/model
wget -O e2e_faster_rcnn_X-101-64x4d-FPN_2x-vlp.yaml "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212013&authkey=AHIvnE1FcggwiLU"
wget -O e2e_faster_rcnn_X-101-64x4d-FPN_2x-vlp.pkl "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212014&authkey=AAHgqN3Y-LXcBvU"

Reference

Please acknowledge the following paper if you use the code:

@article{zhou2019vlp,
  title={Unified Vision-Language Pre-Training for Image Captioning and VQA},
  author={Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, Jianfeng Gao},
  journal={arXiv preprint arXiv:1909.11059},
  year={2019}
}

Related Projects/Codebase

Acknowledgement

Our code is mainly based on Li Dong et al.'s UniLM repo. Also, a part of the code is based on pytorch-transformers v0.4.0 and ImageCaptioning.pytorch. We thank the authors for their wonderful open-source efforts.

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree. Portions of the source code are based on the UniLM project and pytorch-transformers v0.4.0 project.

Comments
  • Data Download Problem

    Data Download Problem

    Hi, thank you for your interesting work! I have a problem when trying to download the provided dataset annotations and features since the one drive link provided cannot be visited in China without a VPN. So it's difficult for me to prepare the data on my ubuntu machine. Do you have any generous advice for me to solve this problem? Or would you please provide another download link that can be easily connected in China for MSCOCO data? Thank you!

    opened by tjuwyh 29
  • What is GPU memory size of your V100? (ERROR: Unexpected bus error encountered in worker)

    What is GPU memory size of your V100? (ERROR: Unexpected bus error encountered in worker)

    HI, I am trying to use one V100 GPU with 16G memory to run the fine-tuning on COCO image captioning task and always encounter such error "ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm)". So what is the GPU memory of your V100 and what are the recommended configurations (e.g., batch_size, num_workers) on my cluster with 4*V100 of 16G memory for running fine-tuning for COCO captioning. Thanks!

    opened by yuewang-cuhk 10
  • Unable to reproduce image features for COCO and CC

    Unable to reproduce image features for COCO and CC

    Hi Luowei --

    I'm unable to reproduce the image features that you've published here for COCO and CC. I've trained and evaluated the model using your provided features as well as my extracted features, on the VQA2 task (VQA2 uses COCO images). There is still an outstanding gap in performance. While you report 67.4, I can only achieve 64.3. This is a significant 3-point gap. I am wondering if others have encountered similar problem and how they have resolved it?

    I've extracted my own features using the script you shared with me privately (slightly modified to resole dependency issues). Using the housebw/detectron image and your provided detectron checkpoint .pkl and config .yaml, I generate different features than yours. Comparing image-by-image, I have different values in the tensors/matricies. I also get different aggregate statisics (min, max, mean, variance) for features, image-by-image. This is the same situation for CC as well. I've also confirmed it is not a precision issue as well (float16 vs float32).

    As it stands, I cannot replicate your results despite my best efforts to follow all your provided documentation, using the same environment, code, data dependencies, and source data.

    I am attempting to use your SOTA model on a new dataset/task. Not being able to replicate your results is an impediment...

    Thanks, Shawn

    opened by darkmatter08 8
  • Detectron feature extraction

    Detectron feature extraction

    Might it be possible to release the script to extract features using the Detectron model? Probably something similar to this https://github.com/ruotianluo/ImageCaptioning.pytorch/blob/master/scripts/prepro_feats.py that was used to generate the region_feat_gvd_wo_bgd/trainval .npz features and class probabilities?

    opened by leotam 8
  • Require a quick start for simple usage...

    Require a quick start for simple usage...

    Hi, I just want to test the captioning result on some raw images. I have read the vlp/decode_img2txt.py, but the settings are a little bit complicated for me, for example, the standard size of an input image.

    So it would be very kind of you if a simple usage could be provided.

    I really appreciate any help you can provide.

    opened by wubowen416 7
  • When will you release the visual feature detectron code?

    When will you release the visual feature detectron code?

    Hi Luowei, Recently I have been trying to extract detectron visual features following your guideline here. But I still cannot replicate the features. I use the recommended docker image and refer to the preliminary scripts you send me by email. However, your script extract_features_luowei.py is actually incompatible with the detectron model in the housebw's repo. For example, the im_detect_bbox method (in the figure below) should be imported from core.test instead of core.test_engine. 67132010-14c08480-f1d5-11e9-876d-fc8608108c5d

    Actually, I also find several other incompatibility issues when trying to run your code inside the provided docker environment. Even after I fixed some incompatibility issues and run the code successfully, I encounter such errors:

    WARNING cnn.py:  40: [====DEPRECATE WARNING====]: you are creating an object from CNNModelHelper class which will be deprecated soon. Please use ModelHelper object with brew module. For more information, please refer to caffe2.ai and python/brew.py, python/brew_test.py for more information.
    INFO net.py:  59: Loading weights from: /export/home/vlp_data/e2e_faster_rcnn_X-101-64x4d-FPN_2x.pkl
    I1029 08:52:12.747747  2485 net_dag_utils.cc:118] Operator graph pruning prior to chain compute took: 0.000186569 secs
    I1029 08:52:12.747993  2485 net_dag.cc:61] Number of parallel execution chains 36 Number of operators = 371
    I1029 08:52:12.770298  2485 net_dag_utils.cc:118] Operator graph pruning prior to chain compute took: 0.000165061 secs
    I1029 08:52:12.770571  2485 net_dag.cc:61] Number of parallel execution chains 30 Number of operators = 358
    /export/home/vlp_data/coco_raw/coco_tiny/COCO_test2014_000000000001.jpg
    terminate called after throwing an instance of 'caffe2::EnforceNotMet'
      what():  [enforce fail at blob.h:94] IsType<T>(). wrong type for the Blob instance. Blob contains nullptr (uninitialized) while caller expects caffe2::Tensor<caffe2::CUDAContext> .
    Offending Blob name: gpu_0/rois_fpn2.
    Error from operator: 
    input: "gpu_0/fpn_res2_2_sum" input: "gpu_0/rois_fpn2" output: "gpu_0/roi_feat_fpn2" name: "" type: "RoIAlign" arg { name: "pooled_h" i: 7 } arg { name: "sampling_ratio" i: 2 } arg { name: "spatial_scale" f: 0.25 } arg { name: "pooled_w" i: 7 } device_option { device_type: 1 cuda_gpu_id: 0 }
    *** Aborted at 1572339134 (unix time) try "date -d @1572339134" if you are using GNU date ***
    terminate called recursively
    terminate called recursively
    PC: @     0x7fe85f011428 gsignal
    *** SIGABRT (@0x9b5) received by PID 2485 (TID 0x7fe7c0ffd700) from PID 2485; stack trace: ***
        @     0x7fe85f3b7390 (unknown)
        @     0x7fe85f011428 gsignal
        @     0x7fe85f01302a abort
        @     0x7fe8590bf84d __gnu_cxx::__verbose_terminate_handler()
    terminate called recursively
        @     0x7fe8590bd6b6 (unknown)
        @     0x7fe8590bd701 std::terminate()
        @     0x7fe8590e8d38 (unknown)
        @     0x7fe85f3ad6ba start_thread
        @     0x7fe85f0e341d clone
        @                0x0 (unknown)
    Aborted (core dumped)
    

    Due to these incompatibility issues, I find it pretty difficult to extract the same visual features as yours. But if we use other detectron codes like detectron2 or mmdetection, we cannot use your pre-trained models. Therefore, I would like to ask when you can fully release your detectron code (code & python environment), which will be extremely helpful for those planning to apply your VLP model into their own datasets like me. Look forward to your reply:)

    opened by yuewang-cuhk 7
  • Multiple GPUs Support

    Multiple GPUs Support

    The provided fine-tuning scripts fails on multiple GPUs machine. Traceback (most recent call last): File "run_img2txt_dist.py", line 621, in <module> main() File "run_img2txt_dist.py", line 546, in main iter_bar.set_description('Iter (loss=%5.3f)' % loss.item()) ValueError: only one element tensors can be converted to Python scalars

    What is the recommended way to run a parallel training.
    Thanks :)

    opened by idansc 6
  • Getting `No module named 'apex.optimizers'` error

    Getting `No module named 'apex.optimizers'` error

    Hello, Thanks for your work.

    Currently, I am trying to run inference on flickr features. I have installed apex as per the instructions in setup.sh. I have the same commit of apex (1603407bf49c7fc3da74fceb6a6c7b47fece2ef8) as mentioned in the setup.sh file. My pytorch version is 1.6.0+cu101, which is different from the one mentioned in misc/vlp.yml file. While installing apex I get following error message: error: command 'gcc' failed with exit status 1. When I try running the flickr inference code I get below error:

    Traceback (most recent call last):
      File "vlp/decode_img2txt.py", line 19, in <module>
        from pytorch_pretrained_bert.tokenization import BertTokenizer, WhitespaceTokenizer
      File "/home/default/ephemeral_drive/work/image_captioning/VLP/pytorch_pretrained_bert/__init__.py", line 6, in <module>
        from .optimization_fp16 import FP16_Optimizer_State
      File "/home/default/ephemeral_drive/work/image_captioning/VLP/pytorch_pretrained_bert/optimization_fp16.py", line 4, in <module>
        from apex.optimizers import FP16_Optimizer
    ModuleNotFoundError: No module named 'apex.optimizers'
    

    I tried installing the latest apex version as per the instructions here. I get Successfully installed apex-0.1 message, but when I run the inference code I get below error.

    Traceback (most recent call last):
      File "vlp/decode_img2txt.py", line 19, in <module>
        from pytorch_pretrained_bert.tokenization import BertTokenizer, WhitespaceTokenizer
      File "/home/default/ephemeral_drive/work/image_captioning/VLP/pytorch_pretrained_bert/__init__.py", line 6, in <module>
        from .optimization_fp16 import FP16_Optimizer_State
      File "/home/default/ephemeral_drive/work/image_captioning/VLP/pytorch_pretrained_bert/optimization_fp16.py", line 4, in <module>
        from apex.optimizers import FP16_Optimizer
    ImportError: cannot import name 'FP16_Optimizer'
    

    Seems, the optimizers are different in the latest apex commit. Would you recommend replacing the optimizer in your code with any of the current ones. Or, if you would have any other suggestion for me to resolve the issue. I am not able to use the conda environment you had mentioned as a requirement, as I am working on a controlled-access machine and I don't have the liberty to do all the installations.

    opened by gsrivas4 5
  • Chinese image caption, In the result, multiple words of the same type appear

    Chinese image caption, In the result, multiple words of the same type appear

    Hello, I am using the COCO dataset, A two-layer LSTM model, one layer for top-down attention, and one layer for language models.

    Extracting words with jieba I used all the words in the picture description that occurred more than 3 times as a dictionary file, and a total of 14,226 words. words = [w for w in word_freq.keys () if word_freq [w]> 3]

    After training the model, when using it, multiple words of the same type appear in the result, such as:

    Note notebook laptop computer on bed A little girl little girl girl standing together

    How can I solve this problem?

    opened by cylvzj 5
  • UniLM checkpoint is no longer available

    UniLM checkpoint is no longer available

    UniLM checkpoint

    wget -O bert_save.tar.gz "https://onedrive.live.com/download?cid=E5364FD183A1F5BB&resid=E5364FD183A1F5BB%212016&authkey=AB5-lxzCkgpfLhg"

    The above link is no longer anymore. Thanks.

    opened by GabrielLin 5
  • File Not Found:feat_cls_1000_float16/cc_detection_vg_100dets_gvd_checkpoint_trainval

    File Not Found:feat_cls_1000_float16/cc_detection_vg_100dets_gvd_checkpoint_trainval

    Thanks a lot for sharing this useful repo. we are trying to reproduce the finetuning result on flicker30k, but an error occurs which say "feat_cls_1000_float16/cc_detection_vg_100dets_gvd_checkpoint_trainval not found". I found this error is about this line of code, parser.add_argument('--region_det_file_prefix', default='feat_cls_1000/coco_detection_vg_100dets_gvd_checkpoint_trainval', type=str) . So where can I get access of "coco_detection_vg_100dets_gvd_checkpoint_trainval"? In my folder of feat_cls_1000, there are only "flickr30k_detection_vg_thresh0.2_feat_gvd_checkpoint_trainvaltest.h5" and "trainval".

    opened by zhao1iang 4
  • Change N object of feature extractor and number of attention head

    Change N object of feature extractor and number of attention head

    I want to change the number of objects in the feature extractor (e.g from 100 to 150) and the attention head (instead of changing the BERT base to BERT large) (e.g from 12 to 16). Could you please tell me where to change the code?

    opened by khiemledev 0
  • the JSON file of dataset is in need

    the JSON file of dataset is in need

    We intend to use our own dataset to pre-train your model, but we don't know how to construct our data to fit into your model. In addition, COCO dataset is too large and we don't have enough space, so we cannot see the JSON file in which you organize your training data. So could you please share your JSON file so that we can know how to use our own data? Thanks a lot!

    opened by Hepta-Col 0
  • Other Pretrained Models

    Other Pretrained Models

    Hi @LuoweiZhou , is it convenient to provide other pre-trained checkpoints, such as cc_g8_lr1e-4_batch512_s0.25_b0.75.tar.gz or cc_g8_lr1e-4_batch512_s0_b1.tar.gz ? Many thanks.

    opened by jingjingdd 2
  • Bus error (core dump) during training

    Bus error (core dump) during training

    Hi, thanks for sharing this project.

    I prepared all the required h5py files and caption annotations files for COCO Caption finetuning as instructed in README. The training went normally at the beginning, but got killed (bus error (core dumped)) after around 70k~100k iteration.

    I wonder if it was an out-of-memory issue caused by data loading. It seemed that huge memory was progressively consumed by the program, perhaps due to reading more and more image features from h5py files. Using del or gc.collect() didn't help free unreferenced objects' memory.

    Is there any good solution to save memory for the multimodal training? Or idea on what was going on in my case. Thanks a lot!

    opened by ChenYutongTHU 1
  • Train model when I have image only without any bbox info

    Train model when I have image only without any bbox info

    Hi, Thanks for your great work! I want to know whether the model can be trained without regions. In other words, I have caption and image only without any bbox info, how can I make the model work? Thank you so much!

    opened by Decalogue 1
Owner
Luowei Zhou
Senior Researcher @ Microsoft. UMich Ph.D.
Luowei Zhou
PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Salesforce 1.3k Dec 31, 2022
Simple image captioning model - CLIP prefix captioning.

Simple image captioning model - CLIP prefix captioning.

null 688 Jan 4, 2023
QA-GNN: Question Answering using Language Models and Knowledge Graphs

QA-GNN: Question Answering using Language Models and Knowledge Graphs This repo provides the source code & data of our paper: QA-GNN: Reasoning with L

Michihiro Yasunaga 434 Jan 4, 2023
Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering (NAACL 2021)

Designing a Minimal Retrieve-and-Read System for Open-Domain Question Answering Abstract In open-domain question answering (QA), retrieve-and-read mec

Clova AI Research 34 Apr 13, 2022
covid question answering datasets and fine tuned models

Covid-QA Fine tuned models for question answering on Covid-19 data. Hosted Inference This model has been contributed to huggingface.Click here to see

Abhijith Neil Abraham 19 Sep 9, 2021
Official repository with code and data accompanying the NAACL 2021 paper "Hurdles to Progress in Long-form Question Answering" (https://arxiv.org/abs/2103.06332).

Hurdles to Progress in Long-form Question Answering This repository contains the official scripts and datasets accompanying our NAACL 2021 paper, "Hur

Kalpesh Krishna 41 Nov 8, 2022
The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question IntentionClassification Benchmark for Text-to-SQL"

TriageSQL The dataset and source code for our paper: "Did You Ask a Good Question? A Cross-Domain Question Intention Classification Benchmark for Text

Yusen Zhang 22 Nov 9, 2022
GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It can be used to test three levels of generalization in KBQA: i.i.d., compositional, and zero-shot.

OSU DKI Lab 76 Dec 21, 2022
Binary Passage Retriever (BPR) - an efficient passage retriever for open-domain question answering

BPR Binary Passage Retriever (BPR) is an efficient neural retrieval model for open-domain question answering. BPR integrates a learning-to-hash techni

Studio Ousia 147 Dec 7, 2022
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions (CVPR2021)

NExT-QA We reproduce some SOTA VideoQA methods to provide benchmark results for our NExT-QA dataset accepted to CVPR2021 (with 1 'Strong Accept' and 2

Junbin Xiao 50 Nov 24, 2022
FeTaQA: Free-form Table Question Answering

FeTaQA: Free-form Table Question Answering FeTaQA is a Free-form Table Question Answering dataset with 10K Wikipedia-based {table, question, free-form

Language, Information, and Learning at Yale 40 Dec 13, 2022
Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Commonsense Question Answering

Path-Generator-QA This is a Pytorch implementation for the EMNLP 2020 (Findings) paper: Connecting the Dots: A Knowledgeable Path Generator for Common

Peifeng Wang 33 Dec 5, 2022
This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

Akari Asai 59 Dec 28, 2022
Bilinear attention networks for visual question answering

Bilinear Attention Networks This repository is the implementation of Bilinear Attention Networks for the visual question answering and Flickr30k Entit

Jin-Hwa Kim 506 Nov 29, 2022
Visual Question Answering in Pytorch

Visual Question Answering in pytorch /!\ New version of pytorch for VQA available here: https://github.com/Cadene/block.bootstrap.pytorch This repo wa

Remi 672 Jan 1, 2023
This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

This reporistory contains the test-dev data of the paper "xGQA: Cross-lingual Visual Question Answering".

AdapterHub 18 Dec 9, 2022
RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering

RNG-KBQA: Generation Augmented Iterative Ranking for Knowledge Base Question Answering Authors: Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou and

Salesforce 72 Dec 5, 2022
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 68 Jul 18, 2022
EMNLP 2021: Single-dataset Experts for Multi-dataset Question-Answering

MADE (Multi-Adapter Dataset Experts) This repository contains the implementation of MADE (Multi-adapter dataset experts), which is described in the pa

Princeton Natural Language Processing 39 Oct 5, 2021