X-VLM: Multi-Grained Vision Language Pre-Training

Yan Zeng

Last update: Dec 23, 2022

Related tags

Overview

X-VLM: learning multi-grained vision language alignments

Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. Yan Zeng, Xinsong Zhang, Hang Li. arXiv 2021.

Jan 2022: release official PyTorch implementation and X-VLM-base checkpoints
Dec 2021: X-VLM-base (4M) achieves new SoTA
Nov 2021: release preprint in arXiv

Hiring

We are looking for interns at ByteDance AI-LAB (in Beijing / Shanghai)! If you are interested in working with us on vision language models, please send your resume to [email protected].

Features

Support several backbones
- vision encoder: deit / clip-vit / swin-transformer
- text encoder: bert / roberta
Support apex O1 / O2 for pre-training
Read from and write to HDFS
Distributed training across nodes for both pre-training and fine-tuning

Please read the code for more details.

Requirements

Install python3 environment

pip3 install -r requirements.txt

Download raw images from corresponding websites
Download the json files we provided, which contains image read paths and captions and/or bbox annotations
If running pre-training scripts:
- install Apex
- download pre-trained models for parameter initialization
  - image encoder: swin-transformer-base
  - text encoder: bert-base
Organize these files like this (% is for pre-training only):

X-VLM/
    data/
        finetune/
            refcoco+/*.json
            *.json
        
        %pretrain_4m/*.json
        %swin_base_patch4_window7_224_22k.pth
        %bert-base-uncased/
            config.json
            pytorch_model.bin
            tokenizer_config.json
            tokenizer.json
            vocab.txt

    images/
        coco/
            train2014/*.jpg
            val2014/*.jpg
            test2015/*.jpg
        
        visualgenome/
            image/*.jpg
        
        nlvr2/
            images/
                train/0-99/*.png
            dev/*.png
            test1/*.png
        
        %sbu/*.jpg
        %cc-3m/*.jpg

Pretrain

python3 run.py --task "pretrain_4m_base" --dist "1" --output_dir "output/pretrain_4m_base"

For distributed training across nodes, see run.py for more details.

Data

We are organizing the data and the scripts. All these will be released in Vision-Language-Data in March. Please feel free to prepare your own datasets by referring the code in dataset/pretrain_dataset.py.

Checkpoints

X-VLM-base (4M)
X-VLM-base 14M, WIP
X-VLM-large 14M, WIP

Finetune

2 nodes for fine-tuning, specify --output_hdfs to save some tmp results. # evaluate python3 run.py --task "vqa" --dist "1" --evaluate --output_dir "output/vqa_eval" --checkpoint "4m_base_finetune/vqa/model_state_epoch_9.th" ">

# train
python3 run.py --task "vqa" --dist "1" --output_dir "output/vqa" --checkpoint "4m_base_model_state_step_199999.th"
python3 run.py --task "vqa" --dist "all" --output_dir "output/vqa" --output_hdfs "hdfs://xxx/vqa_tmp" --checkpoint "4m_base_model_state_step_199999.th"  # if using >2 nodes for fine-tuning, specify --output_hdfs to save some tmp results.

# evaluate
python3 run.py --task "vqa" --dist "1" --evaluate --output_dir "output/vqa_eval" --checkpoint "4m_base_finetune/vqa/model_state_epoch_9.th"

See run.py for fine-tuning on other tasks (Retrieval, NLVR2, RefCOCO). We set some python assertions to help you run the code correctly. The fine-tuning scripts are based on ALBEF. We thank the author for opening source their code.

Data

download json files

Checkpoints and Logs

retrieval-mscoco
retrieval-flickr
vqa
nlvr2
refcoco
refcoco-bbox
Note that fine-tuning configs are given in "X-VLM/configs/*.yaml"

Citation

If you use this code, please considering citing:

@article{xvlm,
  title={Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts},
  author={Zeng, Yan and Zhang, Xinsong and Li, Hang},
  journal={arXiv preprint arXiv:2111.08276},
  year={2021}
}

Contact

For issues or help using this code, please submit a GitHub issue.

Comments

Training log for the pretrain stage

Hi,

Thank you for releasing the code! It's an interesting project!

I'd like to know is it available to also release the pretraining logs? Or some milestones to verify the training process?

Thanks! Blakey

opened by tgxs002 2
About license

Thanks for the great work! I just want to confirm in this project, the released pre-trained models are also with https://github.com/zengyan-97/X-VLM/blob/master/LICENSE Thanks a lot!

opened by WangWenhao0716 2
Distributed mode for single GPU

Is it possibile to run itr_flickr as not distributed but on a single gpu?

When running: python run.py --task "itr_flickr" --dist "gpu0" --output_dir "output/itr_flickr" --checkpoint "4m_base_finetune/itr_flickr/checkpoint_best.pth"

I get:

Training Retrieval Flickr

| distributed init (rank 0): env:// Traceback (most recent call last): File "Retrieval.py", line 381, in main(args, config) File "Retrieval.py", line 215, in main utils.init_distributed_mode(args) File "C:\Users..\X-VLM-master\utils_init_.py", line 357, in init_distributed_mode world_size=args.world_size, rank=args.rank) File "C:\Users..\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\distributed_c10d.py", line 434, in init_process_group init_method, rank, world_size, timeout=timeout File "C:\Users..\AppData\Local\Programs\Python\Python37\lib\site-packages\torch\distributed\rendezvous.py", line 82, in rendezvous raise RuntimeError("No rendezvous handler for {}://".format(result.scheme)) RuntimeError: No rendezvous handler for env://

opened by TheodorPatrickZ 2
Finetuning On NLVR2
Hi! @zengyan-97

I can not find the check point that can be finetuned.

Thank you for providing these check points:

nlvr_domain_pretrain.th

16m_base_model_state_step_199999.th

4m_base_model_state_step_199999.th

Also there is a ckpt that you have already finetuned:

nlvr_ft/checkpoint_best.pth

However, can you guide me for how to finetune on NLVR2?
opened by lonestar234028 1
Fine-tuning

Hello,

I wonder how did you finetune your model for the Flickr30K dataset? Did you freeze The Text and Vision Encoder on only fine-tuned the itm_head, or did you apply the fine-tuning to the whole model?

opened by TheodorPatrickZ 1
pretrain-base-4m for the X_VLM

Dear authors: Thanks for your great works. While pretraining the model, I don't know how to organize the open dateset. I have already tried the following yaml.

opened by wfx0330 1
Hi, could you provide the specific commands of finetuning on coco captioning? Thanks!

I am confused about the "lm_domain_pretrain.th" file in "4m_base_finetune/coco_caption/". If I want to reproduce the fine-tuning results on coco captioning, which pre-trained model should I load: "lm_domain_pretrain.th" or "4m_base_model_state_step_199999.th"? Maybe you could provide the specific commands of two-stage finetuning on coco captioning? Thanks!

opened by yaolinli 1
Code for Grad-CAM visualization

Hi,

Thanks for the great work.

In Figure 3 of your paper, you showed the Grad-CAM visualizations of your model on RefCOCO+ from text descriptions. Could you share the code for using Grad-CAM on your model?

Thanks!

opened by qiaomu-miao 0
About batch sampling `iter_perc`

Thanks for your code.

I note that in your paper, you said "We sample the data by making half of the images in a batch containing bounding box annotations".

But the code is: https://github.com/zengyan-97/X-VLM/blob/e7b960256d194952321b5adad39770c03e6ce9c2/Pretrain.py#L82-L121

The iter_perc you used is 0.5, which means only 50% of time, the model takes a batch of image-text-box data and a batch of image-text data as input; and otherwise, the model only takes a batch of image-text data as input.

Therefore, it seems that iter_perc = 1.0 fits your statement in the paper.

According to the ablation study results in Table 4, you have certainly tested the impact of iter_perc = 0.0 (corresponding to the model, X-VLM w/o all).

So, have you tested more values of iter_perc (e.g., 1.0)?

opened by yangbang18 0
Performance of different vision encoders

Thanks for your great sharing.

https://github.com/zengyan-97/X-VLM/blob/e7b960256d194952321b5adad39770c03e6ce9c2/models/xvlm.py#L121

As shown above, you mentioned in the code that initilaizing the vision encoder with deit is worser than clip-vit and swin.

Do you have some supporting results? For example, the performance on Image-Text Retrieval with deit or swin

opened by AI-in-Hospitals 0
Fine-tune on VQA

Traceback (most recent call last): File "VQA.py", line 283, in main(args, config) File "VQA.py", line 134, in main train_dataset, vqa_test_dataset = create_dataset('vqa', config, args.evaluate) File "/home/deer/X-VLM/dataset/init.py", line 84, in create_dataset vqa_test_dataset = vqa_dataset(config['test_file'], test_transform, config['vqa_root'], config['vg_root'], File "/home/deer/X-VLM/dataset/vqa_dataset.py", line 31, in init tokenizer = RobertaTokenizer.from_pretrained(text_encoder) if use_roberta else BertTokenizer.from_pretrained(text_encoder) File "/home/deer/anaconda3/envs/xvlm/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1651, in from_pretrained fast_tokenizer_file = get_fast_tokenizer_file( File "/home/deer/anaconda3/envs/xvlm/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 3467, in get_fast_tokenizer_file all_files = get_list_of_files( File "/home/deer/anaconda3/envs/xvlm/lib/python3.8/site-packages/transformers/file_utils.py", line 1818, in get_list_of_files model_info = HfApi(endpoint=HUGGINGFACE_CO_RESOLVE_ENDPOINT).model_info( File "/home/deer/anaconda3/envs/xvlm/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 94, in _inner_fn return fn(*args, **kwargs) File "/home/deer/anaconda3/envs/xvlm/lib/python3.8/site-packages/huggingface_hub/utils/_deprecation.py", line 98, in inner_f return f(*args, **kwargs) File "/home/deer/anaconda3/envs/xvlm/lib/python3.8/site-packages/huggingface_hub/hf_api.py", line 1289, in model_info hf_raise_for_status(r) File "/home/deer/anaconda3/envs/xvlm/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 242, in hf_raise_for_status raise RepositoryNotFoundError(message, response) from e huggingface_hub.utils._errors.RepositoryNotFoundError: 401 Client Error. (Request ID: Tpjmanp6E_vsjeLwXkyXg)

Repository Not Found for url: https://huggingface.co/api/models/data/bert-base-uncased. Please make sure you specified the correct repo_id and repo_type. If the repo is private, make sure you are authenticated.

Hi, have you ever had this problem? how to deal with this problem.

opened by darwann 2
About swin_B_480

Currently I want to train a model with 480 resolution, but I cannot find the link to download the image encoder swin-480. Can you share me the link to download it? Thank you very much.

opened by Sxx1995 1

Owner

Yan Zeng

GitHub

Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

Fine-Grained R2R Code and data of the Fine-Grained R2R Dataset proposed in the EMNLP2020 paper Sub-Instruction Aware Vision-and-Language Navigation. C

34 Nov 15, 2022

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

196 Dec 13, 2022

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

248 Dec 4, 2022

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

UC2 UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu,

28 Dec 30, 2022

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

Kaleido-BERT: Vision-Language Pre-training on Fashion Domain Mingchen Zhuge*, Dehong Gao*, Deng-Ping Fan#, Linbo Jin, Ben Chen, Haoming Zhou, Minghui

250 Jan 8, 2023

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval (M2HSE) PyTorch code fo

6 Dec 23, 2022

PyTorch implementation for Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition.

Stochastic CSLR This is the PyTorch implementation for the ECCV 2020 paper: Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuou

28 Dec 19, 2022

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

59 Dec 28, 2022

The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

42 Jan 7, 2023

X-VLM: Multi-Grained Vision Language Pre-Training

Related tags

Overview

X-VLM: learning multi-grained vision language alignments

Hiring

Features

Requirements

Pretrain

Data

Checkpoints

Finetune

Data

Checkpoints and Logs

Citation

Contact

Comments

Training Retrieval Flickr

Owner

Yan Zeng

Code and data of the Fine-Grained R2R Dataset proposed in paper Sub-Instruction Aware Vision-and-Language Navigation

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

PyTorch implementation for Stochastic Fine-grained Labeling of Multi-state Sign Glosses for Continuous Sign Language Recognition.

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

A task-agnostic vision-language architecture as a step towards General Purpose Vision

The official code for PRIMER: Pyramid-based Masked Sentence Pre-training for Multi-document Summarization

[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

Code release for SLIP Self-supervision meets Language-Image Pre-training

PyTorch implementation of Rethinking Positional Encoding in Language Pre-training

SAS: Self-Augmentation Strategy for Language Model Pre-training