Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"

Natural Language Processing @UCLA

Last update: Dec 9, 2022

Related tags

Deep Learning visualbert

Overview

This repository contains code for the following two papers:

VisualBERT: A Simple and Performant Baseline for Vision and Language (arxiv) with a short version titiled What Does BERT with Vision Look At? published on ACL 2020.

Under the folder visualbert is code (the original VisualBERT), where we pre-train a Transformer for vision-and-language (V&L) tasks on image-caption data.
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions published on NAACL 2021.

Under the folder unsupervised_visualbert is code (Unsupervised VisualBERT), where we pre-train a V&L transformer without aligned image-captions pairs. Rather, we pre-training only using unaligned images and text, and achieve competitive performance with many models supervised with aligned data.

The model VisualBERT has been also integrated into several libararies such as Huggingface Transformer (many thanks to Gunjan Chhablani who made it work) and Facebook MMF.

Thanks~

Comments

Extracting image features for VQA

https://github.com/uclanlp/visualbert#extracting-image-features

Could you go into more detail? Should we install the custom pytorch into a new virtual environment, so it doesn't break the pytorch used in training the model? What command do we run with detectron to extract features?

opened by johntiger1 6
The ROIAlign module used in VCR experiments crushes when forwarding

Hi, I am running the experiments on VCR following the instructions in the readme. I have installed the customed torchvision and detectron modules. However, when the COCO pre-training process begins, the program is terminated by segmentation fault while the tensors forward the ROIAlign module. Could you help me to figure out this issue? Thank you very much!

opened by yangapku 4
COCO features

Hi! Thank you for your excellent work. I noticed that we downloaded COCO features separately for NLVR, VQA and VCR. What is the difference between the features? Are they from different models of detectron2? By the way, could you please provide the script for generating Flickr30k features?

opened by qinzzz 2
Using visualBERT for generation

Hi, great work with this, very clearly explained and I'm enjoying tinkering around with it. I wanted to try and use the same for text generation - captioning images for example, could you give some guidance on how I could proceed here? I think it will require adding a decoder stack on top of the encoder and can be trained on COCO(which has captions) itself right, in the same way - MLM plus fine tuning on COCO itself? https://arxiv.org/pdf/2003.01473.pdf - these people have done this and their approach is slightly different in that they use 2 BERT encoders in parallel for encoding images and text separately. Do you think generation like that would be possible with visualBERT and how do you think I can proceed to try it out? Since you say your version of BERT is from huggingface, maybe I can use a decoder stack from them? Else - huggingface themselves have a EncoderDecoder class - this may work once trained right? If I preprocess image features the same way you have here?

opened by nishanthcgit 2
Mask Probabilty for Task-specific Pre-training

Hi, in your paper you mention that task-specific pre-training is also using masked-language modelling similar to task-agnostic pre-training. However, I cannot find any mask probability for the task-specific pre-training .json files. Why is no probability specified & did you use the same 15% probability as for task agnostic pre-training?

Sorry perhaps I'm missing something here -- Thanks for the help!

opened by Muennighoff 2
Pre-training on other BERT models

Thanks for the great repo and your efforts! Two quick questions:

Is there anything that speaks against pre-training VisualBERT with Albert instead of BERT on COCO and then finetune it for downstream tasks? Also, I havn't found exact details on what resources are needed for pre-training, except for that it took less than a day on COCO according to your paper - How much hours did it take & what GPUs did you use?

opened by Muennighoff 2
Flickr30k Entities fine-tuning clarification

Hi,

Thanks for the open-source repository.

I was wondering: how did you implement fine-tuning for the Flickr30k Entities dataset? From the ACL short paper:

What is the loss between the predicted alignment and the ground-truth alignment? As noted in your preprint on arXiv, this can be a bit complicated because the ground-truth alignment can have multiple boxes.

opened by tonyduan 2

config_vcr is no where to be found

I am running the following command:

python train.py --folder log --config ../configs/vcr/fine-tune-qa.json

Traceback (most recent call last):
  File "train.py", line 26, in <module>
    from visualbert.dataloaders.vcr import VCR, VCRLoader
  File "/auto/nlg-05/chengham/third-party/visualbert/dataloaders/vcr.py", line 20, in <module>
    from dataloaders.box_utils import load_image, resize_image, to_tensor_and_normalize
  File "/auto/nlg-05/chengham/third-party/visualbert/dataloaders/box_utils.py", line 8, in <module>
    from config_vcr import USE_IMAGENET_PRETRAINED
ModuleNotFoundError: No module named 'config_vcr'

I searched the whole repo and cannot find the file.

opened by ChenghaoMou 2

Features vqacoco-pre-train

Hi,

Thank you for this repo! I would like to know what are the visual features used for the checkpoint: visualbert/configs/vqa/coco-pre-train.json?

The image features are from which model? Do you have the checkpoint (for example: Detectron e2e_mask_rcnn_R-101-FPN_2x, model_id: 35861858) so that I can use the model with different images?

opened by RitaRamo 1
About evaluation
visualbert

Can I use models on huggingface model hub do evaluation without fiinetuning, and get mentioned performance in your paper?

"uclanlp/visualbert-vqa" evaluate vqa

"uclanlp/visualbert-nlvr2" evaluate nlvr
opened by renmada 1
seq_relationship_score logits order
I'm testing this model on the image-sentence-alignment task and I'm observing weird results.

By running the pretrained COCO model in eval-mode on COCO17 I get results below random chance ( using basically the setting used for the pretraining).

The 'seq_relationship_score' returns two logits and according to what reported in the doc:

index 0 is "next sentence is the continuation"

index 1 is "next sentence is random"

Following the doc, as I said, I get results that would make much more sense if the meaning of the logits was flipped.

Moreover, that part of the code seems to have been borrowed from the transformers library, and recently a similar issue has been found in another BERT-based model: https://github.com/huggingface/transformers/issues/9212

We are conducting experiments with your model and it would be convenient for us just to ignore the documentation and to report the results flipped.

It would be great if you could clarify this point!

Thank you in advance!
opened by michelecafagna26 1
CVE-2007-4559 Patch

Patching CVE-2007-4559

Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

opened by TrellixVulnTeam 0
VisualBERT with Detectron2

Hi,

I was wondering whether VisualBERT can be used out of the box (from Hugging Face) with Detectron2? I followed this nice tutorial (also linked in the same Hugging Face page) for extracting embeddings with Detectron2, but the VisualBERT paper states that it was trained with Detectron rather than Detectron2. Do I have to do my own pretraining then in order to use Detectron2 embeddings?

Thanks in advance!

opened by smfsamir 0
How to use visualbert for visual grounding(entity grounding)?

Paperwithcode https://paperswithcode.com/sota/phrase-grounding-on-flickr30k-entities-test?metric=R%4010 shows visualbert can be used for entity grounding. Can you please tell me how to achieve this?

opened by gagaein 0
"VisualBERTDetector not in acceptable choices for type

I encounter this error when I pretrain on VCR. How can I solve this? allennlp.common.checks.ConfigurationError: "VisualBERTDetector not in acceptable choices for type: ['bcn', 'constituency_parser', 'biaffine_parser', 'coref', 'crf_tagger', 'decomposable_attention', 'event2mind', 'simple_seq2seq', 'bidaf', 'bidaf-ensemble', 'dialog_qa', 'nlvr_coverage_parser', 'nlvr_direct_parser', 'quarel_parser', 'wikitables_mml_parser', 'wikitables_erm_parser', 'atis_parser', 'text2sql_parser', 'srl', 'simple_tagger', 'esim', 'bimpm', 'graph_parser', 'bidirectional-language-model']"

opened by menggehe 1

Owner

Natural Language Processing @UCLA

GitHub

Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

NeuralTextures This is repository with inference code for paper "StylePeople: A Generative Model of Fullbody Human Avatars" (CVPR21). This code is for

Visual Understanding Lab @ Samsung AI Center Moscow

18 Oct 6, 2022

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

SLATE This is the official source code for SLATE. We provide the code for the model, the training code and a dataset loader for the 3D Shapes dataset.

66 Dec 26, 2022

Code for paper ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

Who Left the Dogs Out? Evaluation and demo code for our ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization

29 Dec 28, 2022

TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

SLM: Structural Language Models of Code This is an official implementation of the model described in: "Structural Language Models of Code" [PDF] To ap

73 Nov 6, 2022

Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

CoProtector Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

1 Oct 26, 2021

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

This codebase is being actively maintained, please create and issue if you have issues using it Basics All data files are included under losses and ea

32 Nov 9, 2021

Code for our method RePRI for Few-Shot Segmentation. Paper at http://arxiv.org/abs/2012.06166

Region Proportion Regularized Inference (RePRI) for Few-Shot Segmentation In this repo, we provide the code for our paper : "Few-Shot Segmentation Wit

138 Dec 12, 2022

Code for ACM MM 2020 paper "NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination"

NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination The offical implementation for the "NOH-NMS: Improving Pedestrian Detection by

64 Nov 11, 2022

Official TensorFlow code for the forthcoming paper

~ Efficient-CapsNet ~ Are you tired of over inflated and overused convolutional neural networks? You're right! It's time for CAPSULES :)

203 Jan 8, 2023

This is the code for the paper "Contrastive Clustering" (AAAI 2021)

Contrastive Clustering (CC) This is the code for the paper "Contrastive Clustering" (AAAI 2021) Dependency python>=3.7 pytorch>=1.6.0 torchvision>=0.8

210 Dec 30, 2022

Code for the paper Learning the Predictability of the Future

Learning the Predictability of the Future Code from the paper Learning the Predictability of the Future. Website of the project in hyperfuture.cs.colu

Computer Vision Lab at Columbia University

139 Nov 18, 2022

PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning This is the PyTorch implementation of our paper: FeatMatch: Feature-Based Augmentat

43 Nov 19, 2022

Code for the paper A Theoretical Analysis of the Repetition Problem in Text Generation

A Theoretical Analysis of the Repetition Problem in Text Generation This repository share the code for the paper "A Theoretical Analysis of the Repeti

37 Nov 21, 2022

Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

SA-Net: Shuffle Attention for Deep Convolutional Neural Networks (paper) By Qing-Long Zhang and Yu-Bin Yang [State Key Laboratory for Novel Software T

199 Jan 8, 2023

Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

Non-Rigid Neural Radiance Fields This is the official repository for the project "Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synt

296 Dec 29, 2022

Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"

Related tags

Overview

Comments

Patching CVE-2007-4559

Owner

Natural Language Processing @UCLA

Inference code for "StylePeople: A Generative Model of Fullbody Human Avatars" paper. This code is for the part of the paper describing video-based avatars.

This is the official source code for SLATE. We provide the code for the model, the training code, and a dataset loader for the 3D Shapes dataset. This code is implemented in Pytorch.

Code for paper ECCV 2020 paper: Who Left the Dogs Out? 3D Animal Reconstruction with Expectation Maximization in the Loop.

TensorFlow code for the neural network presented in the paper: "Structural Language Models of Code" (ICML'2020)

Code for the prototype tool in our paper "CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning".

Code to use Augmented Shapiro Wilks Stopping, as well as code for the paper "Statistically Signifigant Stopping of Neural Network Training"

Code for our method RePRI for Few-Shot Segmentation. Paper at http://arxiv.org/abs/2012.06166

Code for ACM MM 2020 paper "NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination"

Official TensorFlow code for the forthcoming paper

This is the code for the paper "Contrastive Clustering" (AAAI 2021)

Code for the paper Learning the Predictability of the Future

PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

Code for the paper A Theoretical Analysis of the Repetition Problem in Text Generation

Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

Code for the Shortformer model, from the paper by Ofir Press, Noah A. Smith and Mike Lewis.

PyTorch code for ICLR 2021 paper Unbiased Teacher for Semi-Supervised Object Detection

Official code for paper "Optimization for Oriented Object Detection via Representation Invariance Loss".

Code for our CVPR 2021 paper "MetaCam+DSCE"