PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Overview

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

This is the PyTorch code of the BLIP paper. The code has been tested on PyTorch 1.10. To install the dependencies, run

pip install -r requirements.txt

Catalog:

  • Inference demo
  • Pre-trained and finetuned checkpoints
  • Finetuning code for Image-Text Retrieval, Image Captioning, VQA, and NLVR2
  • Pre-training code
  • Download of bootstrapped pre-training datasets

Inference demo:

Run our interactive demo using Colab notebook (no GPU needed). The demo includes code for: (1) image captioning, (2) open-ended visual question answering, (3) multimodal / unimodal feature extraction.

Integrated into Huggingface Spaces 🤗 using Gradio. Try out the Web Demo Hugging Face Spaces

Pre-trained checkpoints:

Num. pre-train images BLIP w/ ViT-B BLIP w/ ViT-B and CapFilt-L BLIP w/ ViT-L
14M Download - -
129M Download Download Download

Finetuned checkpoints:

Task BLIP w/ ViT-B BLIP w/ ViT-B and CapFilt-L BLIP w/ ViT-L
Image-Text Retrieval (COCO) Download - Download
Image-Text Retrieval (Flickr30k) Download - Download
Image Captioning (COCO) - Download Download
VQA Download Download -
NLVR2 Download - -

Image-Text Retrieval:

  1. Download COCO and Flickr30k datasets from the original websites, and set 'image_root' in configs/retrieval_{dataset}.yaml accordingly.
  2. To evaluate the finetuned BLIP model on COCO, run:
python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco \
--evaluate
  1. To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/retrieval_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:
python -m torch.distributed.run --nproc_per_node=8 train_retrieval.py \
--config ./configs/retrieval_coco.yaml \
--output_dir output/retrieval_coco 

Image-Text Captioning:

  1. Download COCO and NoCaps datasets from the original websites, and set 'image_root' in configs/caption_coco.yaml and configs/nocaps.yaml accordingly.
  2. To evaluate the finetuned BLIP model on COCO, run:
python -m torch.distributed.run --nproc_per_node=8 train_caption.py --evaluate
  1. To evaluate the finetuned BLIP model on NoCaps, generate results with: (evaluation needs to be performed on official server)
python -m torch.distributed.run --nproc_per_node=8 eval_nocaps.py 
  1. To finetune the pre-trained checkpoint using 8 A100 GPUs, first set 'pretrained' in configs/caption_coco.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth". Then run:
python -m torch.distributed.run --nproc_per_node=8 train_caption.py 

VQA:

  1. Download VQA v2 dataset and Visual Genome dataset from the original websites, and set 'vqa_root' and 'vg_root' in configs/vqa.yaml.
  2. To evaluate the finetuned BLIP model, generate results with: (evaluation needs to be performed on official server)
python -m torch.distributed.run --nproc_per_node=8 train_vqa.py --evaluate
  1. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/vqa.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model*_base.pth". Then run:
python -m torch.distributed.run --nproc_per_node=16 train_vqa.py 

NLVR2:

  1. Download NLVR2 dataset from the original websites, and set 'image_root' in configs/nlvr.yaml.
  2. To evaluate the finetuned BLIP model, run
python -m torch.distributed.run --nproc_per_node=8 train_nlvr.py --evaluate
  1. To finetune the pre-trained checkpoint using 16 A100 GPUs, first set 'pretrained' in configs/nlvr.yaml as "https://storage.googleapis.com/sfr-vision-language-research/BLIP/models/model_base.pth". Then run:
python -m torch.distributed.run --nproc_per_node=16 train_nlvr.py 

Pre-train:

  1. Prepare training json files where each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'image': path_of_image, 'caption': text_of_image}.
  2. In configs/pretrain.yaml, set 'train_file' as the paths for the json files .
  3. Pre-train the model using 8 A100 GPUs:
python -m torch.distributed.run --nproc_per_node=8 pretrain.py --config ./configs/Pretrain.yaml --output_dir output/Pretrain 

Pre-training datasets download:

We provide bootstrapped pre-training datasets as json files. Each json file contains a list. Each item in the list is a dictonary with two key-value pairs: {'url': url_of_image, 'caption': text_of_image}.

Image source Filtered web caption Filtered synthetic caption Filtered synthetic caption by ViT-L
CC3M+CC12M+SBU Download Download Download
LAION115M Download Download Download

Citation

If you find this code to be useful for your research, please consider citing.

@misc{li2022blip,
      title={BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation}, 
      author={Junnan Li and Dongxu Li and Caiming Xiong and Steven Hoi},
      year={2022},
      eprint={2201.12086},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

The implementation of BLIP relies on resources from ALBEF, Huggingface Transformers, and timm. We thank the original authors for their open-sourcing.

Comments
  • Problem with .pth file download on Windows

    Problem with .pth file download on Windows

    First I want to thank you. Based on my playing over at huggingface this seems to be the best piece of software I have hit on for image captioning. I am trying to get it to run locally on Windows 10.

    I keep winding up with the following error:

    OSError: [Errno 22] Invalid argument: 'C:\Users\Matthew/.cache\torch\hub\checkpoints\model*_base_caption.pth'

    It seems to mangle the path to the .pth file. I tried putting some print statements in to try and figure out what was going on and it seems to be something around os.path.dirname, but I got lost. I tried a second tact and just downloaded all the .pth files and put them in the proper directory but that does the same thing. I tried a third tact and changed the model URL to a file:// that points at the file, and that results in a RuntimeError: checkpoint url or path is invalid error.

    Any help is appreciated!

    opened by matthewkleinmann 10
  • question about Bootstrap captions from noisy image-text pairs?

    question about Bootstrap captions from noisy image-text pairs?

    Is there any example scripts about how to use the pretrain checkpoint to Bootstrap captions from noisy image-text pairs, then get the much more cleaning train data?

    opened by trouble-maker007 9
  • feature extraction on images only

    feature extraction on images only

    I want to process a folder of images that I will use for comparing to an input text (which will be given at a different time). How do I use your colab to extract features from images and then at a later time, compare them to an input text? All the examples involve passing in an image and a text at the same time.

    opened by nikky4D 8
  • Differences in output captions on Colab vs Spaces

    Differences in output captions on Colab vs Spaces

    Hello,

    When I try the demo on Huggingface Spaces (https://huggingface.co/spaces/Salesforce/BLIP) with a sample image, the output captions are different every time I enter the submit button with same decoding strategy.

    But if I try the code on Colab with the same image and same decoding strategy, the output captions are always the same.

    May I know why this is happening?

    Thanks.

    opened by rsanjaykamath 8
  • The pre-training pipeline of BLIP

    The pre-training pipeline of BLIP

    Thanks a lot for your awesome work! I have some question about the pre-training pipeline of BLIP. Considering the application of Captioner and Filter to generate pretraining data, I wonder if the process for the complete training is:

    1. pre-training on noisy data
    2. fine-tuning on COCO to obtain captioner and filter
    3. using the captioner and filter from 2 to generate the final pretraining dataset
    4. pre-training on the dataset obtained in 3
    opened by wangjk666 8
  • The result of caption task is not expected

    The result of caption task is not expected

    I used the pre-trained model for the 20th epoch,

    and fine-tuning the caption task for 20 epochs on custom dataset,

    the best model‘s results are as follows:

    "caption": "post post post post post post post post post post post post post post post post"

    The loss value dropped from 6.761 to 6.419.

    Does the model converge? Do you have any good suggestions?

    opened by liutongyang 7
  • Pretraining Time

    Pretraining Time

    Hi, it is mentioned in the paper that you pretrain the model on two 16-GPU nodes with batch-size=2880, I wonder how long does it take to pretrain on the filtered CC + COCO + SBU + VG dataset, and could you share your log with me?

    Thanks a lot.

    opened by wangjk666 7
  • vocab_size of configs/med_config.json

    vocab_size of configs/med_config.json

    Hi, Thank your nice work! I have a question, when I test captioning I find that vocab_size should be 30524, but do pre-training should be set 30522, am I right? Thanks.

    opened by BlueCat7 7
  • Can I ask more than 1 question simultaneously through the blip_vqa model?

    Can I ask more than 1 question simultaneously through the blip_vqa model?

    I know how to ask the same question for multiple images at the same time and it will return different results to different images; how do I swap? I mean: can I ask multiple questions about the same image and return multiple different answers simultaneously(run the model only once)? If so, how can I do it? I tried feeding multiple questions as a list into the blip_vqa model but it raise an error seems like dimension of tensors mismatch. Thank you for your excellent work and look forward to your reply.

    opened by SKBL5694 5
  • How can I choose the optimal checkpoint in the pre-training model

    How can I choose the optimal checkpoint in the pre-training model

    In the pre-training mode, each epoch is saved.

    It seems that the code does not filter the optimal model by the accuracy or loss value.

    What is your strategy?

    opened by liutongyang 5
  • Runtime error regarding sum of probabilities

    Runtime error regarding sum of probabilities

    Hello,

    We are running the pretraining script and got this error:

    Traceback (most recent call last):
      File "pretrain.py", line 184, in <module>
        main(args, config)
      File "pretrain.py", line 141, in main
        train_stats = train(model, data_loader, optimizer, epoch, device, config) 
      File "pretrain.py", line 69, in train
        loss_ita, loss_itm, loss_lm = model(image, caption, alpha = alpha)
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 799, in forward
        output = self.module(*inputs[0], **kwargs[0])
      File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/jupyter/codebase/BLIP/models/blip_pretrain.py", line 163, in forward
        neg_idx = torch.multinomial(weights_t2i[b], 1).item()
    RuntimeError: invalid multinomial distribution (sum of probabilities <= 0)
    

    Any idea how we can solve it please? Thanks

    opened by abhisheksgumadi 4
  • is BLIP w/ ViT-L and CapFilt-L model for image captioning exist?

    is BLIP w/ ViT-L and CapFilt-L model for image captioning exist?

    Hi,

    At first I would like to say thank you for your great work which inspires me a lot.

    I would like to know, is a BLIP w/ ViT-L + CapFilt-L model (use vit large as encoder and CapFilt for data augment) exist? I believe it should be stronger compared with BLIP w/ ViT-B + CapFilt-L and BLIP w/ ViT-L.

    Thanks

    opened by 4thfever 2
  • The following `model_kwargs` are not used by the model: ['encoder_hidden_states', 'encoder_attention_mask']

    The following `model_kwargs` are not used by the model: ['encoder_hidden_states', 'encoder_attention_mask']

    I keep getting this error, I migrated from original BLIP to the new REPO here but the error keeps following. I did remove all existing models so it would download fresh copies to prevent me having wrong models, did not help, I try your example code kinda close to what its in your example but still it failes. Funny thing is it did work weeks ago and now it keeps failing so it might still be an issue with my box but I fail to find it, could you tell which direction I get shoot at?

    I try this

    raw_img = Image.open("00000-0-1.png").convert("RGB")
    
    # setup device to use
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    
    vis_processor = load_processor("blip_image_eval").build(image_size=384)
    
    model_type= "BLIP_large"
    
    if model_type.startswith("BLIP"):
        blip_type = model_type.split("_")[1].lower()
        model = load_model_cache(
            "blip_caption",
            model_type=f"{blip_type}_coco",
            is_eval=True,
            device=device,
        )
    
    use_beam = False #did try True either but same result
    img = vis_processor(raw_img).unsqueeze(0).to(device)
    captions = generate_caption(
        model=model, image=img, use_nucleus_sampling=not use_beam
    )
    
    opened by osi1880vr 0
  • Hugging Face integration of `BLIP`

    Hugging Face integration of `BLIP`

    Dear authors,

    We have a working implementation of BLIP and 3 of its variants in huggingface transformers (image captioning, visual question answering, image text retrieval): https://github.com/huggingface/transformers/pull/20716 that is not merged yet

    The license of the repository and model states that:

    3. Neither the name of [Salesforce.com](http://salesforce.com/) nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
    

    We would like to promote the addition of this architecture to transformers library. Therefore I would like to ask you the permission for promoting this contribution

    Thank you very much in advance

    opened by younesbelkada 0
  • Why the resize does not preserve the original aspect ratio

    Why the resize does not preserve the original aspect ratio

    Hi, thank you for the work!

    I've played with the code and noticed that examples do not preserve the original aspect ration during the resize. E.g. https://github.com/salesforce/BLIP/blob/d10be550b2974e17ea72e74edc7948c9e5eab884/predict.py#L93 or the colab example.

    I wonder if it is done on purpose?

    opened by yurymalkov 2
  • performance gap  in Flickr retrieval

    performance gap in Flickr retrieval

    Thanks for your great work and well-written code. We are evaluating the performance of Zero-shot Retrieval based on the checkpoint and evaluation code you provided. Our testing results are as follows and there is about 10 points' gap between those in your paper. We guess there maybe some bug, Could you please supply your evaluation results based on this repo.

    vit-large zero-shot retrieval {"val_txt_r1": 91.42011834319527, "val_txt_r5": 97.534516765286, "val_txt_r10": 98.91518737672584, "val_txt_r_mean": 95.95660749506904, "val_img_r1": 79.30966469428007, "val_img_r5": 94.04339250493096, "val_img_r10": 96.44970414201184, "val_img_r_mean": 89.93425378040763, "val_r_mean": 92.94543063773833, "test_txt_r1": 89.9, "test_txt_r5": 98.8, "test_txt_r10": 99.7, "test_txt_r_mean": 96.13333333333333, "test_img_r1": 80.38, "test_img_r5": 94.88, "test_img_r10": 97.12, "test_img_r_mean": 90.79333333333334, "test_r_mean": 93.46333333333334}

    vit-large finetune retrieval {"val_txt_r1": 85.99605522682445, "val_txt_r5": 97.33727810650888, "val_txt_r10": 98.22485207100591, "val_txt_r_mean": 93.85272846811307, "val_img_r1": 77.85009861932939, "val_img_r5": 93.68836291913215, "val_img_r10": 96.60749506903353, "val_img_r_mean": 89.38198553583169, "val_r_mean": 91.61735700197238, "test_txt_r1": 85.4, "test_txt_r5": 97.9, "test_txt_r10": 99.0, "test_txt_r_mean": 94.10000000000001, "test_img_r1": 77.72, "test_img_r5": 94.2, "test_img_r10": 96.88, "test_img_r_mean": 89.60000000000001, "test_r_mean": 91.85000000000001}

    opened by amandaluof 2
Owner
Salesforce
A variety of vendor agnostic projects which power Salesforce
Salesforce
CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

CPT This repository contains code and checkpoints for CPT. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Gener

fastNLP 341 Dec 29, 2022
CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

Mingyang Zhou 28 Dec 30, 2022
The implementation of "Bootstrapping Semantic Segmentation with Regional Contrast".

ReCo - Regional Contrast This repository contains the source code of ReCo and baselines from the paper, Bootstrapping Semantic Segmentation with Regio

Shikun Liu 128 Dec 30, 2022
Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech The family of UniSpeech: UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR UniSpeech-

Microsoft 282 Jan 9, 2023
Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning".

ERICA Source code and dataset for ACL2021 paper: "ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive L

THUNLP 75 Nov 2, 2022
The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding.

SuperGen The source code for Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. Requirements Before running, you

Yu Meng 38 Dec 12, 2022
Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal, multi-exposure and multi-focus image fusion.

U2Fusion Code of U2Fusion: a unified unsupervised image fusion network for multiple image fusion tasks, including multi-modal (VIS-IR, medical), multi

Han Xu 129 Dec 11, 2022
Official repository for the paper, MidiBERT-Piano: Large-scale Pre-training for Symbolic Music Understanding.

MidiBERT-Piano Authors: Yi-Hui (Sophia) Chou, I-Chun (Bronwin) Chen Introduction This is the official repository for the paper, MidiBERT-Piano: Large-

null 137 Dec 15, 2022
Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021) This repository is the official P

Jingyun Liang 159 Dec 30, 2022
Official PyTorch code for Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021)

Hierarchical Conditional Flow: A Unified Framework for Image Super-Resolution and Image Rescaling (HCFlow, ICCV2021) This repository is the official P

Jingyun Liang 159 Dec 30, 2022
The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Cutoff: A Simple Data Augmentation Approach for Natural Language This repository contains source code necessary to reproduce the results presented in

Dinghan Shen 49 Dec 22, 2022
[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

Multimedia Research 196 Dec 13, 2022
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

null 248 Dec 4, 2022
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

null 250 Jan 8, 2023
X-VLM: Multi-Grained Vision Language Pre-Training

X-VLM: learning multi-grained vision language alignments Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. Yan Zeng, Xi

Yan Zeng 286 Dec 23, 2022
Code for our paper "Graph Pre-training for AMR Parsing and Generation" in ACL2022

AMRBART An implementation for ACL2022 paper "Graph Pre-training for AMR Parsing and Generation". You may find our paper here (Arxiv). Requirements pyt

xfbai 60 Jan 3, 2023
Code release for SLIP Self-supervision meets Language-Image Pre-training

SLIP: Self-supervision meets Language-Image Pre-training What you can find in this repo: Pre-trained models (with ViT-Small, Base, Large) and code to

Meta Research 621 Dec 31, 2022
Monk is a low code Deep Learning tool and a unified wrapper for Computer Vision.

Monk - A computer vision toolkit for everyone Why use Monk Issue: Want to begin learning computer vision Solution: Start with Monk's hands-on study ro

Tessellate Imaging 507 Dec 4, 2022
[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

VITA 59 Dec 28, 2022