ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Related tags

Overview

ROSITA

News & Updates

(24/08/2021)

Release the demo to perform fine-grained semantic alignments using the pretrained ROSITA model.

(15/08/2021)

Release the basic framework for ROSITA, including the pretrained base ROSITA model, as well as the scripts to run the fine-tuning and evaluation on three downstream tasks (i.e., VQA, REC, ITR) over six datasets.

Introduction

This repository contains source code necessary to reproduce the results presented in our ACM MM paper ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration, which encodes the cROSs- and InTrA-model prior knowledge in a in a unified scene graph to perform knowledge-guided vision-and-language pretraining. Compared with existing counterparts, ROSITA learns better fine-grained semantic alignments across different modalities, thus improving the capability of the pretrained model.

Performance

We compare ROSITA against existing state-of-the-art VLP methods on three downstream tasks. All methods use the base model of Transformer for a fair comparison. The trained checkpoints to reproduce these results are provided in finetune.md.

^_Tasks	^_VQA	^_REC			^_ITR
^_Datasets	^{_{VQAv2 dev \| std}}	^{_{RefCOCO val \| testA \| testB}}	^{_{RefCOCO+ val \| testA \| testB}}	^{_{RefCOCOg val \| test}}	^{_{IR-COCO R@1 \| R@5 \| R@10}}	^{_{TR-COCO R@1 \| R@5 \| R@10}}	^{_{IR-Flickr R@1 \| R@5 \| R@10}}	^{_{TR-Flickr R@1 \| R@5 \| R@10}}
^_ROSITA	^{_{73.91 \| 73.97}}	^{_{84.79 \| 87.99 \| 78.28}}	^{_{76.06 \| 82.01 \| 67.40}}	^{_{78.23 \| 78.25}}	^{_{54.40 \| 80.92 \| 88.60}}	^{_{71.26 \| 91.62 \| 95.58}}	^{_{74.08 \| 92.44 \| 96.08}}	^{_{88.90 \| 98.10 \| 99.30}}
^_SoTA-base	^{_{73.59 \| 73.67}}	^{_{81.56 \| 87.40 \| 74.48}}	^{_{76.05 \| 81.65 \| 65.70}}	^{_{75.90 \| 75.93}}	^{_{54.00 \| 80.80 \| 88.50}}	^{_{70.00 \| 91.10 \| 95.50}}	^{_{74.74 \| 92.86 \| 95.82}}	^{_{86.60 \| 97.90 \| 99.20}}

Installation

Software and Hardware Requirements

We recommand a workstation with 4 GPU (>= 24GB, e.g., RTX 3090 or V100), 120GB memory and 50GB free disk space. We strongly recommend to use a SSD drive to guarantee high-speed I/O. Also, you should first install some necessary package as follows:

Python >= 3.6
PyTorch >= 1.4 with Cuda >=10.2
torchvision >= 0.5.0
Cython

# git clone
$ git clone https://github.com/MILVLG/rosita.git 

# build essential utils
$ cd rosita/rosita/utils/rec
$ python setup.py build
$ cp build/lib*/bbox.cpython*.so .

Dataset Setup

To download the required datasets to run this project, please check datasets.md for details.

Pretraining

Please check pretrain.md for the details for ROSITA pretraining. We currently only provide the pretrained model to run finetuning on downstream tasks. The codes to run pretraining will be released later.

Finetuning

Please check finetune.md for the details for finetuning on downstream tasks. Scripts to run finetuning on downstream tasks are provided. Also, we provide trained models that can be directly evaluated to reproduce the results.

Demo

We provide the Jupyter notebook scripts for reproducing the visualization results shown in our paper.

Acknowledgment

We appreciate the well-known open-source projects such as LXMERT, UNITER, OSCAR, and Huggingface, which help us a lot when writing our codes.

Yuhao Cui (@cuiyuhao1996) and Tong-An Luo (@Zoroaster97) are the main contributors to this repository. Please kindly contact them if you find any issue.

Citations

Please consider citing this paper if you use the code:

@inProceedings{cui2021rosita,
  title={ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration},
  author={Cui, Yuhao and Yu, Zhou and Wang, Chunqi and Zhao, Zhongzhou and Zhang, Ji and Wang, Meng and Yu, Jun},
  booktitle={Proceedings of the 29th ACM International Conference on Multimedia},
  year={2021}
}

You might also like...

Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

RGBT Crowd Counting Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, Liang Lin. "Cross-Modal Collaborative Representation Learning and a L

37 Dec 8, 2022

X-modaler is a versatile and high-performance codebase for cross-modal analytics.

[AAAI2021] The source code for our paper 《Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion》.

DSM The source code for paper Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion Project Website; Datasets li

114 Oct 16, 2022

Pytorch implementation of the paper "Enhancing Content Preservation in Text Style Transfer Using Reverse Attention and Conditional Layer Normalization"

4 Sep 18, 2022

Comments

Naming error in the provided vqa2.0 files

There is a naming error in the provided vqa2.0 files in datasets.md. The file names of the minival2014 files v2_mscoco_minival2014_annotations.json and v2_mscoco_minival2014_questions.json should be interchanged.

opened by ChCh1999 1
accuracy on test split

Hi, Your article has a great idea, thank you for sharing it, and thank you for your open source code. It's really great and helpful.

I ran the finetuning code with "scripts/train-vqa-vqav2.sh", for "train+trainvalsplit of VQAv2" and I achieved the accuracy successfully.

But after that when I ran the code with "scripts/test-vqa-vqav2.sh" for "evaluation on the test split" it will run ok but doesn't show any result of accuracy as it was in "train-vqa".

Could you please help me with how can I see the result of accuracy for the test split too? I'm a beginner in python and don't know how to do it.

Thanks.

opened by saeideh02 0

^_Tasks	^_VQA	^_REC			^_ITR
^_Datasets	^{_{VQAv2 dev \| std}}	^{_{RefCOCO val \| testA \| testB}}	^{_{RefCOCO+ val \| testA \| testB}}	^{_{RefCOCOg val \| test}}	^{_{IR-COCO R@1 \| R@5 \| R@10}}	^{_{TR-COCO R@1 \| R@5 \| R@10}}	^{_{IR-Flickr R@1 \| R@5 \| R@10}}	^{_{TR-Flickr R@1 \| R@5 \| R@10}}
^_ROSITA	^{_{73.91 \| 73.97}}	^{_{84.79 \| 87.99 \| 78.28}}	^{_{76.06 \| 82.01 \| 67.40}}	^{_{78.23 \| 78.25}}	^{_{54.40 \| 80.92 \| 88.60}}	^{_{71.26 \| 91.62 \| 95.58}}	^{_{74.08 \| 92.44 \| 96.08}}	^{_{88.90 \| 98.10 \| 99.30}}
^_SoTA-base	^{_{73.59 \| 73.67}}	^{_{81.56 \| 87.40 \| 74.48}}	^{_{76.05 \| 81.65 \| 65.70}}	^{_{75.90 \| 75.93}}	^{_{54.00 \| 80.80 \| 88.50}}	^{_{70.00 \| 91.10 \| 95.50}}	^{_{74.74 \| 92.86 \| 95.82}}	^{_{86.60 \| 97.90 \| 99.20}}

ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration

Related tags

Overview

ROSITA

News & Updates

Introduction

Performance

Installation

Software and Hardware Requirements

Dataset Setup

Pretraining

Finetuning

Demo

Acknowledgment

Citations

You might also like...

Official Implement of CVPR 2021 paper “Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT Benchmark for Crowd Counting”

X-modaler is a versatile and high-performance codebase for cross-modal analytics.

《Image2Reverb: Cross-Modal Reverb Impulse Response Synthesis》(2021)

Cross-modal Deep Face Normals with Deactivable Skip Connections

Probabilistic Cross-Modal Embedding (PCME) CVPR 2021

Cross-Modal Contrastive Learning for Text-to-Image Generation

Multi-modal Text Recognition Networks: Interactive Enhancements between Visual and Semantic Features

[AAAI2021] The source code for our paper 《Enhancing Unsupervised Video Representation Learning by Decoupling the Scene and the Motion》.

Pytorch implementation of the paper "Enhancing Content Preservation in Text Style Transfer Using Reverse Attention and Conditional Layer Normalization"

Comments

Naming error in the provided vqa2.0 files

accuracy on test split

Owner

Vision and Language Group@ MIL

CVPR 2021 Official Pytorch Code for UC2: Universal Cross-lingual Cross-modal Vision-and-Language Pre-training

Pytorch code for ICRA'21 paper: "Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation"

A Comprehensive Empirical Study of Vision-Language Pre-trained Model for Supervised Cross-Modal Retrieval

Enhancing Knowledge Tracing via Adversarial Training

Code for 'Single Image 3D Shape Retrieval via Cross-Modal Instance and Category Contrastive Learning', ICCV 2021

Code for Referring Image Segmentation via Cross-Modal Progressive Comprehension, CVPR2020.

Code for "Intra-hour Photovoltaic Generation Forecasting based on Multi-source Data and Deep Learning Methods."

The repo for the paper "I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection".

IntraQ: Learning Synthetic Images with Intra-Class Heterogeneity for Zero-Shot Network Quantization

Implementation of the paper All Labels Are Not Created Equal: Enhancing Semi-supervision via Label Grouping and Co-training