This repository contains code for the following two papers:
VisualBERT: A Simple and Performant Baseline for Vision and Language (arxiv) with a short version titiled What Does BERT with Vision Look At? published on ACL 2020.
Under the folder
is code (the original VisualBERT), where we pre-train a Transformer for vision-and-language (V&L) tasks on image-caption data. -
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions published on NAACL 2021.
Under the folder
is code (Unsupervised VisualBERT), where we pre-train a V&L transformer without aligned image-captions pairs. Rather, we pre-training only using unaligned images and text, and achieve competitive performance with many models supervised with aligned data.
The model VisualBERT has been also integrated into several libararies such as Huggingface Transformer (many thanks to Gunjan Chhablani who made it work) and Facebook MMF.