CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Mingyang Zhou

Last update: Dec 30, 2022

Related tags

Deep Learning UC2

Overview

UC2

UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training
Mingyang Zhou, Luowei Zhou, Shuohang Wang, Yu Cheng, Linjie Li, Zhou Yu, Jingjing Liu
This is the official repository of UC2, a multili-lingual multi-modal pre-training framefork. In this repository we support end-to-end pretraining and finetuning for image-text retrieval on COCO.

Requirements

We Provide a Docker image to run our code. Please install the following:

To run the docker command without sudo, user need to have docker group membership. Our code only supports Linux with NVIDIA GPUs. We test our code on Ubuntu 18.04 and V100 cards.

Data and Pretrained Checkpoints

Download the pre-processed text features and pretrained checkpoints with the following command:

wget https://mmaisharables.blob.core.windows.net/uc2/UC2_DATA.tar.gz

The image features for mscoco can be obtained from UNITER via this code script. As CC's image features are large and inconvient for direct downloading, please contact UNITER's author to obtain the image features if you are interested in pretraining.

Launch the Docker Container for Experiments

Once the user set up the data and checkpoints properly, please run the following command to launch a docker container and start the pretraining process.

source launch_container_pretrain.sh /PATH_TO_STORAGE/txt_db /PATH_TO_STORAGE/img_db /PATH_TO_STORAGE/finetune /PATH_TO_STORAG/pretrain

Pretraining

(Inside the Docker Container)If the user wants to run pretraining, please use the following command:

horovodrun -np $N_GPU python pretrain.py  --config config/uc2_pretrain.json

Downstream Task Finetuning

Text-to-Image Retrieval To run the finetuning experiment for the text-to-image retrieval task, please use the following command:

horovodrun -np $N_GPU python itm.py --config config/uc2_mscoco_itm.json

Citation

If you find this code useful for your research, please consider citing:

@InProceedings{zhou2021uc,
author = {Zhou, Mingyang and Zhou, Luowei and Wang, Shuohang and Cheng, Yu and Li, Linjie and Yu, Zhou and Liu, Jingjing},
title = {UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training},
booktitle = {IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2021)},
year = {2021},
month = {June},
abstract = {Vision-and-language pre-training has achieved impressive success in learning multimodal representations between vision and language. To generalize this success to non-English languages, we introduce UC2 , the first machine translation-augmented framework for cross-lingual cross-modal representation learning. To tackle the scarcity problem of multilingual captions for image datasets, we first augment existing English-only datasets with other languages via machine translation (MT). Then we extend the standard Masked Language Modeling and Image-Text Matching training objectives to multilingual setting, where alignment between different languages is captured through shared visual context (i.e., using image as pivot). To facilitate the learning of a joint embedding space of images and all languages of interest, we further propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM), leveraging MT-enhanced translated data. Evaluation on multilingual image-text retrieval and multilingual visual question answering benchmarks demonstrates that our proposed framework achieves new state of the art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.},
url = {https://www.microsoft.com/en-us/research/publication/uc2-universal-cross-lingual-cross-modal-vision-and-language-pre-training/},
}

Acknowledge

Our code is mainly based on Linjie Li and Yen-Chun Chen's project UNITER. We thank the author for opening source their code and providing helful discussion for code implementation. Portions of the code also uses resources from transformers.

Liscense

MIT

Comments

About the pretraining data

Hi, Thanks for your great work! I am going to reproduce the results but could not find the pertaining data, i.e., CC3M and its translations in five languages. Would you like to release them or I need to translate them by myself?

Thanks!

opened by shizhediao 10
Results for zero-shot cross-lingual image-text retrieval

Hi,

Thanks for your great work. Have you tried the zero-shot cross-lingual image-text retrieval task like that in M3P? i.e., after pretrained on conceptual captions dataset (CC3m), directly tested in COCO-ZH or COCO-JA task, without fine-tuning on the English COCO training data. If yes, could you share the results?

Many thanks!

opened by ghchen18 5
How did you do MT
Hello, I want to know how did you do your machine translation so I can extend this model to more languages, which I cannot find in paper and this repo. What I was asking are as follows:

What MT tools/interface (eg. baidu/google translation)

What version (eg. free/vip) Thank you!
opened by BUAADreamer 2
About the multi30k dataset

Hi,

I have some questions about the multi30k dataset used in the UC2 paper.

(1) There are task1 and task2 in the official multi30k repo, do you use the train/test data from Task1?

(2) there are test2016/test2017/test2018 in Task1, which testset do you use in Table 1?

(3) For English-only fine-tune results (flickr30k) in Table 1, is the UC2 model fine-tuned on flickr30k training set or the concatenation of flickr30k + COCO training set? the flickr30k has only 29k image-text pairs, which seems small for fine-tuning the UC2.

Thank you.

opened by ghchen18 2
Questions about the setting of VG VQA JA

Hi,

I tried to reproduce UC2 on VG VQA JA, but I got accuracy of ~25% instead of the reported ~34%.

I followed UC2 paper to preprocess the data and I submitted an issue about data split before (thank you again for replying), but I got 37674 for test instead of 30K as the paper said. So, my first question is: did you filter the testing data? can you share the processed data?

Besides, I found that there are many answers in top-3000 frequent answers have very similar meaning. So, the model made these "wrong" predictions, which should have been viewed as corrected: gt: 2人, preds: ２人
gt: 1本, preds: １本 gt: 緑, preds: 緑色 gt: 赤, preds: 赤色 gt: 白色, preds: 白 gt: 白, preds: 白色 gt: 一本, preds: １本 gt: １つ, preds: 1個 gt: 1, preds: 1つ gt: ２本, preds: 2つ gt: １本, preds: 1つ

So, my second question is: did you pick or process the top-3000 frequent answers by some strategies? can you share the list of top-3000 frequent answers that you chose?

Thanks!

opened by zengyan-97 2
How did you split VG VQA JA ?

Hi, Thanks for your interesting work!

May I ask how did you split VG VQA JA? The paper says that you use the train/test split in the original VG VQA dataset. However, I didn't find the information in: https://visualgenome.org/static/data/dataset/question_answers.json.zip Could you please refer me to the original split?

Besides, you consider top-3129 frequent answers for VQA v2.0 and top-3000 for VG VQA JA. Is the frequency calculated on the train set or the train and validation set?

The evaluation settings for multilingual multimodal processing is not so clear now. Hope we can make it better together. Thanks!

opened by zengyan-97 2
About object detection?

Hi, Thanks for sharing the code of this amazing multimodal and multilingual method. I wonder how I can find the code where you applied the object detector to label an image region (top-1 or soft-label) in Masked Region-to-Token Modeling? Thanks,

opened by nooralahzadeh 1

Owner

Mingyang Zhou

Ph.D Student at UC Davis with research interest in Multimodality Learning with computer vision and NLP.

GitHub

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval (M2HSE) PyTorch code fo

6 Dec 23, 2022

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation This repository is the pytorch implementation of our paper: Hierarchical Cr

43 Nov 21, 2022

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

1.3k Dec 31, 2022

[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

Versatile Multi-Modal Pre-Training for Human-Centric Perception Fangzhou Hong1 Liang Pan1 Zhongang Cai1,2,3 Ziwei Liu1* 1S-Lab, Nanyang Technologic

96 Jan 3, 2023

Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

RGBT Crowd Counting Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, Liang Lin. "Cross-Modal Collaborative Representation Learning and a L

37 Dec 8, 2022

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

ROSITA News & Updates (24/08/2021) Release the demo to perform fine-grained semantic alignments using the pretrained ROSITA model. (15/08/2021) Releas

48 Dec 23, 2022

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral] By Zhicheng Huang*, Zhaoyang Zeng*, Yupan H

196 Dec 13, 2022

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

59 Dec 28, 2022

PyTorch original implementation of Cross-lingual Language Model Pretraining.

XLM NEW: Added XLM-R model. PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes: Monolingual language model pretrain

2.7k Dec 27, 2022

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

WECHSEL Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. arXiv: https://arx

45 Dec 29, 2022

Probabilistic Cross-Modal Embedding (PCME) CVPR 2021

Probabilistic Cross-Modal Embedding (PCME) CVPR 2021 Official Pytorch implementation of PCME | Paper Sanghyuk Chun1 Seong Joon Oh1 Rafael Sampaio de R

87 Dec 21, 2022

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

59 Dec 28, 2022

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Related tags

Overview

UC2

Requirements

Data and Pretrained Checkpoints

Launch the Docker Container for Experiments

Pretraining

Downstream Task Finetuning

Citation

Acknowledge

Liscense

Comments

About the pretraining data

Results for zero-shot cross-lingual image-text retrieval

How did you do MT

About the multi30k dataset

Questions about the setting of VG VQA JA

How did you split VG VQA JA ?

About object detection?

Owner

Mingyang Zhou

PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

PyTorch code for BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

[CVPR 2022 Oral] Versatile Multi-Modal Pre-Training for Human-Centric Perception

Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

PyTorch original implementation of Cross-lingual Language Model Pretraining.

Code for WECHSEL: Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models.

Probabilistic Cross-Modal Embedding (PCME) CVPR 2021

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

EMNLP 2021 paper Models and Datasets for Cross-Lingual Summarisation.

Paddle implementation for "Cross-Lingual Word Embedding Refinement by ℓ1 Norm Optimisation" (NAACL 2021)

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Code and data for ACL2021 paper Cross-Lingual Abstractive Summarization with Limited Parallel Resources.

Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain

(CVPR2021) Kaleido-BERT: Vision-Language Pre-training on Fashion Domain