X-VLM: Multi-Grained Vision Language Pre-Training

Overview

X-VLM: learning multi-grained vision language alignments

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. Yan Zeng, Xinsong Zhang, Hang Li. arXiv 2021.

  • Jan 2022: release official PyTorch implementation and X-VLM-base checkpoints
  • Dec 2021: X-VLM-base (4M) achieves new SoTA
  • Nov 2021: release preprint in arXiv

Hiring

We are looking for interns at ByteDance AI-LAB (in Beijing / Shanghai)! If you are interested in working with us on vision language models, please send your resume to [email protected].

Features

  • Support several backbones
    • vision encoder: deit / clip-vit / swin-transformer
    • text encoder: bert / roberta
  • Support apex O1 / O2 for pre-training
  • Read from and write to HDFS
  • Distributed training across nodes for both pre-training and fine-tuning

Please read the code for more details.

Requirements

  • Install python3 environment
pip3 install -r requirements.txt
  • Download raw images from corresponding websites
  • Download the json files we provided, which contains image read paths and captions and/or bbox annotations
  • If running pre-training scripts:
  • Organize these files like this (% is for pre-training only):
X-VLM/
    data/
        finetune/
            refcoco+/*.json
            *.json
        
        %pretrain_4m/*.json
        %swin_base_patch4_window7_224_22k.pth
        %bert-base-uncased/
            config.json
            pytorch_model.bin
            tokenizer_config.json
            tokenizer.json
            vocab.txt

    images/
        coco/
            train2014/*.jpg
            val2014/*.jpg
            test2015/*.jpg
        
        visualgenome/
            image/*.jpg
        
        nlvr2/
            images/
                train/0-99/*.png
            dev/*.png
            test1/*.png
        
        %sbu/*.jpg
        %cc-3m/*.jpg

Pretrain

python3 run.py --task "pretrain_4m_base" --dist "1" --output_dir "output/pretrain_4m_base"

For distributed training across nodes, see run.py for more details.

Data

We are organizing the data and the scripts. All these will be released in Vision-Language-Data in March. Please feel free to prepare your own datasets by referring the code in dataset/pretrain_dataset.py.

Checkpoints

X-VLM-base (4M)
X-VLM-base 14M, WIP
X-VLM-large 14M, WIP

Finetune

2 nodes for fine-tuning, specify --output_hdfs to save some tmp results. # evaluate python3 run.py --task "vqa" --dist "1" --evaluate --output_dir "output/vqa_eval" --checkpoint "4m_base_finetune/vqa/model_state_epoch_9.th" ">
# train
python3 run.py --task "vqa" --dist "1" --output_dir "output/vqa" --checkpoint "4m_base_model_state_step_199999.th"
python3 run.py --task "vqa" --dist "all" --output_dir "output/vqa" --output_hdfs "hdfs://xxx/vqa_tmp" --checkpoint "4m_base_model_state_step_199999.th"  # if using >2 nodes for fine-tuning, specify --output_hdfs to save some tmp results.

# evaluate
python3 run.py --task "vqa" --dist "1" --evaluate --output_dir "output/vqa_eval" --checkpoint "4m_base_finetune/vqa/model_state_epoch_9.th" 

See run.py for fine-tuning on other tasks (Retrieval, NLVR2, RefCOCO). We set some python assertions to help you run the code correctly. The fine-tuning scripts are based on ALBEF. We thank the author for opening source their code.

Data

download json files

Checkpoints and Logs

retrieval-mscoco
retrieval-flickr
vqa
nlvr2
refcoco
refcoco-bbox
Note that fine-tuning configs are given in "X-VLM/configs/*.yaml"

Citation

If you use this code, please considering citing:

@article{xvlm,
  title={Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts},
  author={Zeng, Yan and Zhang, Xinsong and Li, Hang},
  journal={arXiv preprint arXiv:2111.08276},
  year={2021}
}

Contact

For issues or help using this code, please submit a GitHub issue.

Comments
  • Training log for the pretrain stage

    Training log for the pretrain stage

    Hi,

    Thank you for releasing the code! It's an interesting project!

    I'd like to know is it available to also release the pretraining logs? Or some milestones to verify the training process?

    Thanks! Blakey

    opened by tgxs002 2
  • About license

    About license

    Thanks for the great work! I just want to confirm in this project, the released pre-trained models are also with https://github.com/zengyan-97/X-VLM/blob/master/LICENSE Thanks a lot!

    opened by WangWenhao0716 2
  • Distributed mode for single GPU

    Distributed mode for single GPU

    Is it possibile to run itr_flickr as not distributed but on a single gpu?

    When running: python run.py --task "itr_flickr" --dist "gpu0" --output_dir "output/itr_flickr" --checkpoint "4m_base_finetune/itr_flickr/checkpoint_best.pth"

    I get:

    Training Retrieval Flickr

    | distributed init (rank 0): env:// Traceback (most recent call last): File "Retrieval.py", line 381, in main(args, config) File "Retrieval.py", line 215, in main utils.init_distributed_mode(args) File "C:\Users..\X-VLM-master\utils_init_.py", line 357, in init_distributed_mode world_size=args.world_size, rank=args.rank) File "C:\Users..\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\distributed_c10d.py", line 434, in init_process_group init_method, rank, world_size, timeout=timeout File "C:\Users..\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\rendezvous.py", line 82, in rendezvous raise RuntimeError("No rendezvous handler for {}://".format(result.scheme)) RuntimeError: No rendezvous handler for env://

    opened by TheodorPatrickZ 2
  • Finetuning On NLVR2

    Finetuning On NLVR2

    Hi! @zengyan-97

    I can not find the check point that can be finetuned.

    Thank you for providing these check points:

    • nlvr_domain_pretrain.th
    • 16m_base_model_state_step_199999.th
    • 4m_base_model_state_step_199999.th

    Also there is a ckpt that you have already finetuned:

    • nlvr_ft/checkpoint_best.pth

    However, can you guide me for how to finetune on NLVR2?

    opened by lonestar234028 1
  • Fine-tuning

    Fine-tuning

    Hello,

    I wonder how did you finetune your model for the Flickr30K dataset? Did you freeze The Text and Vision Encoder on only fine-tuned the itm_head, or did you apply the fine-tuning to the whole model?

    opened by TheodorPatrickZ 1
  • pretrain-base-4m for the X_VLM

    pretrain-base-4m for the X_VLM

    Dear authors: Thanks for your great works. While pretraining the model, I don't know how to organize the open dateset. I have already tried the following yaml.

    图片 图片

    opened by wfx0330 1
  • Hi, could you provide the specific commands of finetuning on coco captioning? Thanks!

    Hi, could you provide the specific commands of finetuning on coco captioning? Thanks!

    I am confused about the "lm_domain_pretrain.th" file in "4m_base_finetune/coco_caption/". If I want to reproduce the fine-tuning results on coco captioning, which pre-trained model should I load: "lm_domain_pretrain.th" or "4m_base_model_state_step_199999.th"? Maybe you could provide the specific commands of two-stage finetuning on coco captioning? Thanks!

    opened by yaolinli 1
  • Code for Grad-CAM visualization

    Code for Grad-CAM visualization

    Hi,

    Thanks for the great work.

    In Figure 3 of your paper, you showed the Grad-CAM visualizations of your model on RefCOCO+ from text descriptions. Could you share the code for using Grad-CAM on your model?

    Thanks!

    opened by qiaomu-miao 0
  • About batch sampling `iter_perc`

    About batch sampling `iter_perc`

    Thanks for your code.

    I note that in your paper, you said "We sample the data by making half of the images in a batch containing bounding box annotations".

    But the code is: https://github.com/zengyan-97/X-VLM/blob/e7b960256d194952321b5adad39770c03e6ce9c2/Pretrain.py#L82-L121

    The iter_perc you used is 0.5, which means only 50% of time, the model takes a batch of image-text-box data and a batch of image-text data as input; and otherwise, the model only takes a batch of image-text data as input.

    Therefore, it seems that iter_perc = 1.0 fits your statement in the paper.

    According to the ablation study results in Table 4, you have certainly tested the impact of iter_perc = 0.0 (corresponding to the model, X-VLM w/o all).

    So, have you tested more values of iter_perc (e.g., 1.0)?

    opened by yangbang18 0
  • Performance of different vision encoders

    Performance of different vision encoders

    Thanks for your great sharing.

    https://github.com/zengyan-97/X-VLM/blob/e7b960256d194952321b5adad39770c03e6ce9c2/models/xvlm.py#L121

    As shown above, you mentioned in the code that initilaizing the vision encoder with deit is worser than clip-vit and swin.

    Do you have some supporting results? For example, the performance on Image-Text Retrieval with deit or swin

    opened by AI-in-Hospitals 0
  • Fine-tune on VQA

    Fine-tune on VQA

    Traceback (most recent call last): File "VQA.py", line 283, in main(args, config) File "VQA.py", line 134, in main train_dataset, vqa_test_dataset = create_dataset('vqa', config, args.evaluate) File "/home/deer/X-VLM/dataset/init.py", line 84, in create_dataset vqa_test_dataset = vqa_dataset(config['test_file'], test_transform, config['vqa_root'], config['vg_root'], File "/home/deer/X-VLM/dataset/vqa_dataset.py", line 31, in init tokenizer = RobertaTokenizer.from_pretrained(text_encoder) if use_roberta else BertTokenizer.from_pretrained(text_encoder) File "/home/deer/anaconda3/envs/xvlm/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1651, in from_pretrained fast_tokenizer_file = get_fast_tokenizer_file( File "/home/deer/anaconda3/envs/xvlm/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3467, in get_fast_tokenizer_file all_files = get_list_of_files( File "/home/deer/anaconda3/envs/xvlm/lib/python3.8/site-packages/transformers/file_utils.py", line 1818, in get_list_of_files model_info = HfApi(endpoint=HUGGINGFACE_CO_RESOLVE_ENDPOINT).model_info( File "/home/deer/anaconda3/envs/xvlm/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 94, in _inner_fn return fn(*args, **kwargs) File "/home/deer/anaconda3/envs/xvlm/lib/python3.8/site-packages/huggingface_hub/utils/_deprecation.py", line 98, in inner_f return f(*args, **kwargs) File "/home/deer/anaconda3/envs/xvlm/lib/python3.8/site-packages/huggingface_hub/hf_api.py", line 1289, in model_info hf_raise_for_status(r) File "/home/deer/anaconda3/envs/xvlm/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 242, in hf_raise_for_status raise RepositoryNotFoundError(message, response) from e huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Tpjmanp6E_vsjeLwXkyXg)

    Repository Not Found for url: https://huggingface.co/api/models/data/bert-base-uncased. Please make sure you specified the correct repo_id and repo_type. If the repo is private, make sure you are authenticated.

    Hi, have you ever had this problem? how to deal with this problem.

    opened by darwann 2
  • About swin_B_480

    About swin_B_480

    Currently I want to train a model with 480 resolution, but I cannot find the link to download the image encoder swin-480. Can you share me the link to download it? Thank you very much.

    opened by Sxx1995 1
Owner
Yan Zeng
Yan Zeng
Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

Fine-Grained R2R Code and data of the Fine-Grained R2R Dataset proposed in the EMNLP2020 paper Sub-Instruction Aware Vision-and-Language Navigation. C

YicongHong 34 Nov 15, 2022
[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

Multimedia Research 196 Dec 13, 2022
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

null 248 Dec 4, 2022
CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

Mingyang Zhou 28 Dec 30, 2022
(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

null 250 Jan 8, 2023
PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval (M2HSE) PyTorch code fo

Xinlei-Pei 6 Dec 23, 2022
PyTorch implementation for Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition.

Stochastic CSLR This is the PyTorch implementation for the ECCV 2020 paper: Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuou

Zhe Niu 28 Dec 19, 2022
[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

VITA 59 Dec 28, 2022
The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

CAiRE 42 Jan 7, 2023
A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

CLIP4CMR A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval The original data and pre-calculate

null 9 Jan 12, 2022
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Vision Longformer This project provides the source code for the vision longformer paper. Multi-Scale Vision Longformer: A New Vision Transformer for H

Microsoft 209 Dec 30, 2022
A task-agnostic vision-language architecture as a step towards General Purpose Vision

Towards General Purpose Vision Systems By Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem Overview Welcome to the official code base f

AI2 79 Dec 23, 2022
The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

PRIMER The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization. PRIMER is a pre-trained model for mu

AI2 114 Jan 6, 2023
[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

Versatile Multi-Modal Pre-Training for Human-Centric Perception Fangzhou Hong1  Liang Pan1  Zhongang Cai1,2,3  Ziwei Liu1* 1S-Lab, Nanyang Technologic

Fangzhou Hong 96 Jan 3, 2023
Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

DeCLIP Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. Our paper is available in arxiv Updates ** Ou

Sense-GVT 470 Dec 30, 2022
CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

Galuh 17 Mar 10, 2022
Code release for SLIP Self-supervision meets Language-Image Pre-training

SLIP: Self-supervision meets Language-Image Pre-training What you can find in this repo: Pre-trained models (with ViT-Small, Base, Large) and code to

Meta Research 621 Dec 31, 2022
PyTorch implementation of Rethinking Positional Encoding in Language Pre-training

TUPE PyTorch implementation of Rethinking Positional Encoding in Language Pre-training. Quickstart Clone this repository. git clone https://github.com

Jake Tae 5 Jan 27, 2022
SAS: Self-Augmentation Strategy for Language Model Pre-training

SAS: Self-Augmentation Strategy for Language Model Pre-training This repository

Alibaba 5 Nov 2, 2022