PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Salesforce

Last update: Dec 31, 2022

Related tags

Deep Learning image-captioning visual-reasoning visual-question-answering vision-language vision-language-transformer image-text-retrieval vision-and-language-pre-training

Overview

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

This is the PyTorch code of the BLIP paper. The code has been tested on PyTorch 1.10. To install the dependencies, run

pip install -r requirements.txt

Catalog:

Inference demo
Pre-trained and finetuned checkpoints
Finetuning code for Image-Text Retrieval, Image Captioning, VQA, and NLVR2
Pre-training code
Download of bootstrapped pre-training datasets

Inference demo:

Run our interactive demo using Colab notebook (no GPU needed). The demo includes code for: (1) image captioning, (2) open-ended visual question answering, (3) multimodal / unimodal feature extraction.

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo

Pre-trained checkpoints:

Num. pre-train images	BLIP w/ ViT-B	BLIP w/ ViT-B and CapFilt-L	BLIP w/ ViT-L
14M	Download	-	-
129M	Download	Download	Download

Finetuned checkpoints:

Task	BLIP w/ ViT-B	BLIP w/ ViT-B and CapFilt-L	BLIP w/ ViT-L
Image-Text Retrieval (COCO)	Download	-	Download
Image-Text Retrieval (Flickr30k)	Download	-	Download
Image Captioning (COCO)	-	Download	Download
VQA	Download	Download	-
NLVR2	Download	-	-

Image-Text Retrieval:

Download COCO and Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.
To evaluate the finetuned BLIP model on COCO, run:

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco \
--evaluate

To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/retrieval_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:

python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco

Image-Text Captioning:

Download COCO and NoCaps datasets from the original websites, and set 'image_root' in configs/caption_coco.yaml and configs/nocaps.yaml accordingly.
To evaluate the finetuned BLIP model on COCO, run:

python -m torch.distributed.run --nproc_per_node=8 train_caption.py --evaluate

To evaluate the finetuned BLIP model on NoCaps, generate results with: (evaluation needs to be performed on official server)

python -m torch.distributed.run --nproc_per_node=8 eval_nocaps.py

To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/caption_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth". Then run:

python -m torch.distributed.run --nproc_per_node=8 train_caption.py

VQA:

Download VQA v2 dataset and Visual Genome dataset from the original websites, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml.
To evaluate the finetuned BLIP model, generate results with: (evaluation needs to be performed on official server)

python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --evaluate

To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/vqa.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth". Then run:

python -m torch.distributed.run --nproc_per_node=16 train_vqa.py

NLVR2:

Download NLVR2 dataset from the original websites, and set 'image_root' in configs/nlvr.yaml.
To evaluate the finetuned BLIP model, run

python -m torch.distributed.run --nproc_per_node=8 train_nlvr.py --evaluate

To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/nlvr.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:

python -m torch.distributed.run --nproc_per_node=16 train_nlvr.py

Pre-train:

Prepare training json files where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'image': path_of_image, 'caption': text_of_image}.
In configs/pretrain.yaml, set 'train_file' as the paths for the json files .
Pre-train the model using 8 A100 GPUs:

python -m torch.distributed.run --nproc_per_node=8 pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain

Pre-training datasets download:

We provide bootstrapped pre-training datasets as json files. Each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'url': url_of_image, 'caption': text_of_image}.

Image source	Filtered web caption	Filtered synthetic caption	Filtered synthetic caption by ViT-L
CC3M+CC12M+SBU	Download	Download	Download
LAION115M	Download	Download	Download

Citation

If you find this code to be useful for your research, please consider citing.

@misc{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, 
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      eprint={2201.12086},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

The implementation of BLIP relies on resources from ALBEF, Huggingface Transformers, and timm. We thank the original authors for their open-sourcing.

Comments

Problem with .pth file download on Windows

First I want to thank you. Based on my playing over at huggingface this seems to be the best piece of software I have hit on for image captioning. I am trying to get it to run locally on Windows 10.

I keep winding up with the following error:

OSError: [Errno 22] Invalid argument: 'C:\Users\Matthew/.cache\torch\hub\checkpoints\model*_base_caption.pth'

It seems to mangle the path to the .pth file. I tried putting some print statements in to try and figure out what was going on and it seems to be something around os.path.dirname, but I got lost. I tried a second tact and just downloaded all the .pth files and put them in the proper directory but that does the same thing. I tried a third tact and changed the model URL to a file:// that points at the file, and that results in a RuntimeError: checkpoint url or path is invalid error.

Any help is appreciated!

opened by matthewkleinmann 10
question about Bootstrap captions from noisy image-text pairs?

Is there any example scripts about how to use the pretrain checkpoint to Bootstrap captions from noisy image-text pairs, then get the much more cleaning train data?

opened by trouble-maker007 9
feature extraction on images only

I want to process a folder of images that I will use for comparing to an input text (which will be given at a different time). How do I use your colab to extract features from images and then at a later time, compare them to an input text? All the examples involve passing in an image and a text at the same time.

opened by nikky4D 8
Differences in output captions on Colab vs Spaces

Hello,

When I try the demo on Huggingface Spaces (https://huggingface.co/spaces/Salesforce/BLIP) with a sample image, the output captions are different every time I enter the submit button with same decoding strategy.

But if I try the code on Colab with the same image and same decoding strategy, the output captions are always the same.

May I know why this is happening?

Thanks.

opened by rsanjaykamath 8
The pre-training pipeline of BLIP
Thanks a lot for your awesome work! I have some question about the pre-training pipeline of BLIP. Considering the application of Captioner and Filter to generate pretraining data, I wonder if the process for the complete training is:

pre-training on noisy data

fine-tuning on COCO to obtain captioner and filter

using the captioner and filter from 2 to generate the final pretraining dataset

pre-training on the dataset obtained in 3
opened by wangjk666 8
The result of caption task is not expected

I used the pre-trained model for the 20th epoch,

and fine-tuning the caption task for 20 epochs on custom dataset,

the best model‘s results are as follows:

"caption": "post post post post post post post post post post post post post post post post"

The loss value dropped from 6.761 to 6.419.

Does the model converge? Do you have any good suggestions?

opened by liutongyang 7
Pretraining Time

Hi, it is mentioned in the paper that you pretrain the model on two 16-GPU nodes with batch-size=2880, I wonder how long does it take to pretrain on the filtered CC + COCO + SBU + VG dataset, and could you share your log with me?

Thanks a lot.

opened by wangjk666 7
vocab_size of configs/med_config.json

Hi, Thank your nice work! I have a question, when I test captioning I find that vocab_size should be 30524, but do pre-training should be set 30522, am I right? Thanks.

opened by BlueCat7 7
Can I ask more than 1 question simultaneously through the blip_vqa model?

I know how to ask the same question for multiple images at the same time and it will return different results to different images; how do I swap? I mean: can I ask multiple questions about the same image and return multiple different answers simultaneously(run the model only once)? If so, how can I do it? I tried feeding multiple questions as a list into the blip_vqa model but it raise an error seems like dimension of tensors mismatch. Thank you for your excellent work and look forward to your reply.

opened by SKBL5694 5
How can I choose the optimal checkpoint in the pre-training model

In the pre-training mode, each epoch is saved.

It seems that the code does not filter the optimal model by the accuracy or loss value.

What is your strategy?

opened by liutongyang 5

Runtime error regarding sum of probabilities

Hello,

We are running the pretraining script and got this error:

Traceback (most recent call last):
  File "pretrain.py", line 184, in <module>
    main(args, config)
  File "pretrain.py", line 141, in main
    train_stats = train(model, data_loader, optimizer, epoch, device, config) 
  File "pretrain.py", line 69, in train
    loss_ita, loss_itm, loss_lm = model(image, caption, alpha = alpha)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jupyter/codebase/BLIP/models/blip_pretrain.py", line 163, in forward
    neg_idx = torch.multinomial(weights_t2i[b], 1).item()
RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)

Any idea how we can solve it please? Thanks

opened by abhisheksgumadi 4

is BLIP w/ ViT-L and CapFilt-L model for image captioning exist?

Hi,

At first I would like to say thank you for your great work which inspires me a lot.

I would like to know, is a BLIP w/ ViT-L + CapFilt-L model (use vit large as encoder and CapFilt for data augment) exist? I believe it should be stronger compared with BLIP w/ ViT-B + CapFilt-L and BLIP w/ ViT-L.

Thanks

opened by 4thfever 2

The following `model_kwargs` are not used by the model: ['encoder_hidden_states', 'encoder_attention_mask']

I keep getting this error, I migrated from original BLIP to the new REPO here but the error keeps following. I did remove all existing models so it would download fresh copies to prevent me having wrong models, did not help, I try your example code kinda close to what its in your example but still it failes. Funny thing is it did work weeks ago and now it keeps failing so it might still be an issue with my box but I fail to find it, could you tell which direction I get shoot at?

I try this

raw_img = Image.open("00000-0-1.png").convert("RGB")

# setup device to use
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

vis_processor = load_processor("blip_image_eval").build(image_size=384)

model_type= "BLIP_large"

if model_type.startswith("BLIP"):
    blip_type = model_type.split("_")[1].lower()
    model = load_model_cache(
        "blip_caption",
        model_type=f"{blip_type}_coco",
        is_eval=True,
        device=device,
    )

use_beam = False #did try True either but same result
img = vis_processor(raw_img).unsqueeze(0).to(device)
captions = generate_caption(
    model=model, image=img, use_nucleus_sampling=not use_beam
)

opened by osi1880vr 0

Hugging Face integration of `BLIP`
Dear authors,

We have a working implementation of BLIP and 3 of its variants in huggingface transformers (image captioning, visual question answering, image text retrieval): https://github.com/huggingface/transformers/pull/20716 that is not merged yet

The license of the repository and model states that:

3. Neither the name of [Salesforce.com](http://salesforce.com/) nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

We would like to promote the addition of this architecture to transformers library. Therefore I would like to ask you the permission for promoting this contribution

Thank you very much in advance
opened by younesbelkada 0
Why the resize does not preserve the original aspect ratio

Hi, thank you for the work!

I've played with the code and noticed that examples do not preserve the original aspect ration during the resize. E.g. https://github.com/salesforce/BLIP/blob/d10be550b2974e17ea72e74edc7948c9e5eab884/predict.py#L93 or the colab example.

I wonder if it is done on purpose?

opened by yurymalkov 2
performance gap in Flickr retrieval

Thanks for your great work and well-written code. We are evaluating the performance of Zero-shot Retrieval based on the checkpoint and evaluation code you provided. Our testing results are as follows and there is about 10 points' gap between those in your paper. We guess there maybe some bug, Could you please supply your evaluation results based on this repo.

vit-large zero-shot retrieval {"val_txt_r1": 91.42011834319527, "val_txt_r5": 97.534516765286, "val_txt_r10": 98.91518737672584, "val_txt_r_mean": 95.95660749506904, "val_img_r1": 79.30966469428007, "val_img_r5": 94.04339250493096, "val_img_r10": 96.44970414201184, "val_img_r_mean": 89.93425378040763, "val_r_mean": 92.94543063773833, "test_txt_r1": 89.9, "test_txt_r5": 98.8, "test_txt_r10": 99.7, "test_txt_r_mean": 96.13333333333333, "test_img_r1": 80.38, "test_img_r5": 94.88, "test_img_r10": 97.12, "test_img_r_mean": 90.79333333333334, "test_r_mean": 93.46333333333334}

vit-large finetune retrieval {"val_txt_r1": 85.99605522682445, "val_txt_r5": 97.33727810650888, "val_txt_r10": 98.22485207100591, "val_txt_r_mean": 93.85272846811307, "val_img_r1": 77.85009861932939, "val_img_r5": 93.68836291913215, "val_img_r10": 96.60749506903353, "val_img_r_mean": 89.38198553583169, "val_r_mean": 91.61735700197238, "test_txt_r1": 85.4, "test_txt_r5": 97.9, "test_txt_r10": 99.0, "test_txt_r_mean": 94.10000000000001, "test_img_r1": 77.72, "test_img_r5": 94.2, "test_img_r10": 96.88, "test_img_r_mean": 89.60000000000001, "test_r_mean": 91.85000000000001}

opened by amandaluof 2

Owner

Salesforce

A variety of vendor agnostic projects which power Salesforce

GitHub

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

341 Dec 29, 2022

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

28 Dec 30, 2022

The implementation of "Bootstrapping Semantic Segmentation with Regional Contrast".

ReCo - Regional Contrast This repository contains the source code of ReCo and baselines from the paper, Bootstrapping Semantic Segmentation with Regio

128 Dec 30, 2022

Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech The family of UniSpeech: UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR UniSpeech-

282 Jan 9, 2023

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

75 Nov 2, 2022

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

SuperGen The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Requirements Before running, you

38 Dec 12, 2022

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

129 Dec 11, 2022

Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.

MidiBERT-Piano Authors: Yi-Hui (Sophia) Chou, I-Chun (Bronwin) Chen Introduction This is the official repository for the paper, MidiBERT-Piano: Large-

137 Dec 15, 2022

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021) This repository is the official P

159 Dec 30, 2022

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021) This repository is the official P

159 Dec 30, 2022

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Cutoff: A Simple Data Augmentation Approach for Natural Language This repository contains source code necessary to reproduce the results presented in

49 Dec 22, 2022

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

59 Dec 28, 2022

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Related tags

Overview

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Inference demo:

Pre-trained checkpoints:

Finetuned checkpoints:

Image-Text Retrieval:

Image-Text Captioning:

VQA:

NLVR2:

Pre-train:

Pre-training datasets download:

Citation

Acknowledgement

Comments

I keep winding up with the following error:

Owner

Salesforce

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

The implementation of "Bootstrapping Semantic Segmentation with Regional Contrast".

Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

X-VLM: Multi-Grained Vision Language Pre-Training

Code for our paper "Graph Pre-training for AMR Parsing and Generation" in ACL2022

Code release for SLIP Self-supervision meets Language-Image Pre-training

Monk is a low code Deep Learning tool and a unified wrapper for Computer Vision.

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang