project page for VinVL

Last update: Jan 9, 2023

Related tags

Deep Learning VinVL

Overview

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

02/28/2021: Project page built.

Introduction

This repository is the project page for VinVL, containing necessary instructions to reproduce the results presented in the paper. We presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model (code), the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR (code), and utilize an improved approach to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks.

Performance

Task	t2i	t2i	i2t	i2t	IC	IC	IC	IC	NoCaps	NoCaps	VQA	NLVR2	GQA
Metric	R@1	R@5	R@1	R@5	B@4	M	C	S	C	S	test-std	test-P	test-std
SoTA_S	39.2	68.0	56.6	84.5	38.9	29.2	129.8	22.4	61.5	9.2	70.92	58.80	63.17
SoTA_B	54.0	80.8	70.0	91.1	40.5	29.7	137.6	22.8	86.58	12.38	73.67	79.30	61.62
SoTA_L	57.5	82.8	73.5	92.2	41.7	30.6	140.0	24.5	-	-	74.93	81.47	-
-----	---	---	---	---	---	---	---	---	---	---	---	---	---
VinVL_B	58.1	83.2	74.6	92.6	40.9	30.9	140.6	25.1	92.46	13.07	76.12	83.08	64.65
VinVL_L	58.8	83.5	75.4	92.9	41.0	31.1	140.9	25.2	-	-	76.62	83.98	-
gain	1.3	0.7	1.9	0.6	-0.7	0.5	0.9	0.7	5.9	0.7	1.69	2.51	1.48

t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO.

Leaderboard results

VinVL has achieved top-position in several VL leaderboards, including Visual Question Answering (VQA), Microsoft COOC Image Captioning, Novel Object Captioning (nocaps), and Visual Commonsense Reasoning (VCR).

Comparison with image features from bottom-up and top-down model (code).

We observe uniform improvements on seven VL tasks by replacing visual features from bottom-up and top-down model with ours. The NoCaps baseline is from VIVO, and our results are obtained by directly replacing the visual features. The baselines for rest tasks are from OSCAR, and our results are obtained by replacing the visual features and performing OSCAR+ pre-training. All models are BERT-Base size. As analyzed in Section 5.2 in the VinVL paper, the new visual features contributes 95% of the improvement.

Task	t2i	t2i	i2t	i2t	IC	IC	IC	IC	NoCaps	NoCaps	VQA	NLVR2	GQA
metric	R@1	R@5	R@1	R@5	B@4	M	C	S	C	S	test-std	test-P	test-std
bottom-up and top-down model	54.0	80.8	70.0	91.1	40.5	29.7	137.6	22.8	86.58	12.38	73.16	78.07	61.62
VinVL (ours)	58.1	83.2	74.6	92.6	40.9	30.9	140.6	25.1	92.46	13.07	75.95	83.08	64.65
gain	4.1	2.4	4.6	1.5	0.4	1.2	3.0	2.3	5.9	0.7	2.79	4.71	3.03

Please see the following two figures for visual comparison.

Source code

Pretrained Faster-RCNN model and feature extraction

The pretrained X152-C4 object-attribute detection can be downloaded here. With code from our Scene Graph Benchmark Repo (to be released soon), one can extract features with following command:

python tools/test_sg_net.py --config-file sgg_configs/vgattr/vinvl_x152c4.yaml TEST.IMS_PER_BATCH 2 MODEL.WEIGHT models/vinvl/vinvl_vg_x152c4.pth MODEL.ROI_HEADS.NMS_FILTER 1 MODEL.ROI_HEADS.SCORE_THRESH 0.2 DATA_DIR "../maskrcnn-benchmark-1/datasets1" TEST.IGNORE_BOX_REGRESSION True MODEL.ATTRIBUTE_ON True TEST.OUTPUT_FEATURE True

The output feature will be encoded as base64.

Find more pretrained models in DOWNLOAD.

Pre-exacted Image Features

For ease-of-use, we make pretrained features and predictions available for all pretraining datasets and downstream tasks. Please find the instructions to download them in DOWNLOAD.

Pretraind Oscar+ models and VL downstream tasks

The code to produce all vision-language results (both pretraining and downstream task finetuning) can be found in our OSCAR repo. One can find the model zoo for vision-language tasks here.

Citations

Please consider citing this paper if you use the code:

@article{li2020oscar,
  title={Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
  author={Li, Xiujun and Yin, Xi and Li, Chunyuan and Hu, Xiaowei and Zhang, Pengchuan and Zhang, Lei and Wang, Lijuan and Hu, Houdong and Dong, Li and Wei, Furu and Choi, Yejin and Gao, Jianfeng},
  journal={ECCV 2020},
  year={2020}
}

@article{zhang2021vinvl,
  title={VinVL: Making Visual Representations Matter in Vision-Language Models},
  author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
  journal={CVPR 2021},
  year={2021}
}

Comments

Where can I find "train_vgoi6_clipped.yaml" and "test_vgoi6_clipped.yaml"?

Hello! Thank you so much for providing the well-structured code!

I'm trying to extract image regions features by python tools/test_sg_net.py --config-file sgg_configs/vgattr/vinvl_x152c4.yaml TEST.IMS_PER_BATCH 2 MODEL.WEIGHT models/vinvl/vinvl_vg_x152c4.pth MODEL.ROI_HEADS.NMS_FILTER 1 MODEL.ROI_HEADS.SCORE_THRESH 0.2 DATA_DIR "../maskrcnn-benchmark-1/datasets1" TEST.IGNORE_BOX_REGRESSION True MODEL.ATTRIBUTE_ON True, but I can't find the download links for two config files specified in vinvl_x152c4.yaml, namely TRAIN: ("visualgenome/train_vgoi6_clipped.yaml",) and TEST: ("visualgenome/test_vgoi6_clipped.yaml",)

What I could find were these two TRAIN: ("visualgenome/train_danfeiX_relation_nm.yaml",), TEST: ("visualgenome/test_danfeiX_relation.yaml",) after downloading visualgenome by path/to/azcopy copy 'https://penzhanwu2.blob.core.windows.net/sgg/sgg_benchmark/datasets/visualgenome' <target folder> --recursive. However, with these two config files, my program would complain KeyError: 'box_features', which seemed to be caused by the returned prediction not including the 'box_features' field.

Any suggestions? Thanks a lot!

opened by zdxdsw 7
Link to "Scene Graph Benchmark Repo" is not avaliable

Hi, the link to code repo " Scene Graph Benchmark Repo" is broken. I would like to use it for extracting features on my own dataset. Could you please update the broken link ? Thanks in Advance!!

opened by Tclz 3
Features Dimensionality
Hi,

I downloaded pre-trained COCO features using the command

<path/to/azcopy> copy https://biglmdiag.blob.core.windows.net/vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/ <local_path> --recursive

then I decoded the features.tsv file using the following code (showing only the first 10 results):

with open(features.tsv, 'r') as data_file: for i, line in enumerate(data_file): data = line.split('\t') id = int(data[0]) detections = int(data[1]) features = np.frombuffer(base64.b64decode(data[2]), np.float32).reshape((detections, -1)) print(features.shape) if i == 10: break

What seems unusual to me is that features dimensionality obtained this way is 2054. It could be totally ok, it's just the fact that all other object detectors I've worked with have features dimensionality of some power of 2, usually 2048 or 1024. I've also check the paper and didn't find references to feature dimensionality, I was just wondering if this is correct or maybe I made some mistakes decoding features.

Thanks!
opened by eugeniotonanzi 2
Broken link to code

The link to "Scene Graph Benchmark Repo" on this page is broken. Crucially, this is the one that contains the code to get the image-region features.

Could you update it so that it's possible to run the code?

Thanks, Luke

opened by lukerm 2
Pre-extracted Image Features: what OD model is used?

Hi, In here, we can easily use pre-extracted image features.

And I thought these features are from VINVL OD model trained from the merged four datasets: COCO with stuff, Visual Genome, Object365 and Open Images.

However, I found that features and corresponding labels (object tags) are only from the Visual Genome dataset, which shows inferior performance than that from merged four datasets (according to VinVL paper)

So I want to clarify whether the given image features are from the pretrained X152-C4 object-attribute detection (based on only the Visual Genome dataset) or from the pretrained model on the merged four datasets.

Thanks

opened by ahnjaewoo 1
Getting error when downloading VinVL pretrained model and
Thank you for providing the code for VinVL. I am getting similar error when I try to download the models given here - https://github.com/pzzhang/VinVL/blob/main/DOWNLOAD.md#pre-trained-models

azcopy copy https://penzhanwu2.blob.core.windows.net/results/vinvl/od_models/vinvl_vg_x152c4.pth . INFO: Scanning... failed to perform copy command due to error: Login Credentials missing. No SAS token or OAuth token is present and the resource is not public

I am getting similar error when I try to download the associated labelmap at this link https://penzhanwu2.blob.core.windows.net/results/vinvl/od_models/VG-SGG-dicts-vgoi6-clipped.json. However, I do not get the link when I try to download the features from this link - https://biglmdiag.blob.core.windows.net/vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/. Let me know how can I resolve this issue.
opened by gsrivas4 1
About image labels for pretraining

Thanks for the great work! Yet I find there are missing resources for pretraining, making the reproduction of the results impossible. This issue was posted in the Oscar repo but no one responded.

I followed https://github.com/microsoft/Oscar/blob/master/VinVL_DOWNLOAD.md#pre-exacted-image-features to prepare image features, and followed https://github.com/microsoft/Oscar/blob/master/VinVL_MODEL_ZOO.md#oscarplus-pretraining for pre-training. But I cannot find where are the image labels for the pre-training datasets, e.g. COCO, flickr30k, GCA.

As shown in https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_googlecc_gqa_sbu_oi_x152c4big2exp168.yaml, we need to prepare image_label_path. corpus: coco_flickr30k_gqa_googlecc_sbu_oi corpus_file: coco_flickr30k_googlecc_gqa_sbu_oi.tsv image_label_path: coco: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/coco flickr30k: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/flickr30k gqa: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/gqa googlecc: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/googlecc sbu: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/sbu oi: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/oi image_feature_path: coco: vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000 flickr30k: vinvl/image_features/flickr30k_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000 gqa: vinvl/image_features/gqa_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000 googlecc: vinvl/image_features/googlecc_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000 sbu: vinvl/image_features/sbu_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000 oi: vinvl/image_features/oi_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000

To be specific, there is no guidance to download or generate predictions_gt.tsv and QA_fileB.tsv files, which are needed for pre-training in https://github.com/microsoft/Oscar/blob/master/oscar/datasets/oscar_tsv.py#L383-L385.

opened by yikaiw 0
Exact list of images that were used in training and development

Is there an exact list of image IDs that were used in training and tuning of hyperparameters (ie images that were part of any dev sets that were potentially used)?

While this might list might be useful in general, the reason for me specifically is that I want to verify whether or not the GQA features provided as download were produced by a model that was trained with images from the GQA balanced validation set (or GQA testdev set).

Thanks!

opened by dreichCSL 0
Question about the demo of visualization

Hi, wonderful project! Here I have a question the visualization. The command for visualizing the detections from the pretrained models in your README.md is:

python tools/demo/demo_image.py --config_file sgg_configs/vgattr/vinvl_x152c4.yaml --img_file ../maskrcnn-benchmark-1/datasets1/imgs/woman_fish.jpg --save_file output/woman_fish_x152c4.obj.jpg MODEL.WEIGHT models/vinvl/vinvl_vg_x152c4.pth MODEL.ROI_HEADS.NMS_FILTER 1 MODEL.ROI_HEADS.SCORE_THRESH 0.2 DATA_DIR "../maskrcnn-benchmark-1/datasets1" TEST.IGNORE_BOX_REGRESSION False But, in the README.md of Scene Graph Benchmark Repo, the corresponding command is: python tools/demo/demo_image.py --config_file sgg_configs/vgattr/vinvl_x152c4.yaml --img_file demo/woman_fish.jpg --save_file output/woman_fish_x152c4.obj.jpg MODEL.WEIGHT pretrained_model/vinvl_vg_x152c4.pth MODEL.ROI_HEADS.NMS_FILTER 1 MODEL.ROI_HEADS.SCORE_THRESH 0.2 TEST.IGNORE_BOX_REGRESSION False

There is no DATA_DIR argument is the command of Scene Graph Benchmark Repo, I wonder what is the difference? why DATA_DIR is introduced?

opened by ForawardStar 0
Provide merged dataset

In https://github.com/pzzhang/VinVL/blob/main/DOWNLOAD.md#pre-trained-models you explain that you merged COCO with stuff, Visual Genome, Objects365 and Open Images into one dataset. Could you please provide this merged dataset or scripts on how to create this dataset?

opened by dreamflasher 0
MS-COCO 1K Testing Set of Image-Text Retrieval

Hi! I have a question about the 1K testing set of image-text retrieval. In your dataset.

There is a file "test_img_keys_1k.tsv". Do you test your model on this 1K testing image instead of 5-fold of whole 5K testing images?

opened by LibertFan 0

project page for VinVL

Related tags

Overview

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

Introduction

Performance

Leaderboard results

Comparison with image features from bottom-up and top-down model (code).

Source code

Pretrained Faster-RCNN model and feature extraction

Pre-exacted Image Features

Pretraind Oscar+ models and VL downstream tasks

Citations

Comments

Owner

Project page of the paper 'Analyzing Perception-Distortion Tradeoff using Enhanced Perceptual Super-resolution Network' (ECCVW 2018)

Using Streamlit to host a multi-page tool with model specs and classification metrics, while also accepting user input values for prediction.

Official page of Struct-MDC (RA-L'22 with IROS'22 option); Depth completion from Visual-SLAM using point & line features

(Personalized) Page-Rank computation using PyTorch

Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Ceph.

This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

It's final year project of Diploma Engineering. This project is based on Computer Vision.

Erpnext app for make employee salary on payroll entry based on one or more project with percentage for all project equal 100 %

BC3407-Group-5-Project - BC3407 Group Project With Python

UpChecker is a simple opensource project to host it fast on your server and check is server up, view statistic, get messages if it is down. UpChecker - just run file and use project easy

A python-image-classification web application project, written in Python and served through the Flask Microframework. This Project implements the VGG16 covolutional neural network, through Keras and Tensorflow wrappers, to make predictions on uploaded images.

The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Warning: This project does not have any current developer. See bellow.

This project is for a Twitter bot that monitors a bird feeder in my backyard. Any detected birds are identified and posted to Twitter.

Generic template to bootstrap your PyTorch project with PyTorch Lightning, Hydra, W&B, and DVC.

THIS IS THE **OLD** PYMC PROJECT. PLEASE USE PYMC3 INSTEAD:

This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.

Trading Gym is an open source project for the development of reinforcement learning algorithms in the context of trading.

This project provides a stock market environment using OpenGym with Deep Q-learning and Policy Gradient.

THIS IS THE OLD PYMC PROJECT. PLEASE USE PYMC3 INSTEAD: