project page for VinVL

Related tags

Deep Learning VinVL
Overview

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

02/28/2021: Project page built.

Introduction

This repository is the project page for VinVL, containing necessary instructions to reproduce the results presented in the paper. We presents a detailed study of improving visual representations for vision language (VL) tasks and develops an improved object detection model to provide object-centric representations of images. Compared to the most widely used bottom-up and top-down model (code), the new model is bigger, better-designed for VL tasks, and pre-trained on much larger training corpora that combine multiple public annotated object detection datasets. Therefore, it can generate representations of a richer collection of visual objects and concepts. While previous VL research focuses mainly on improving the vision-language fusion model and leaves the object detection model improvement untouched, we show that visual features matter significantly in VL models. In our experiments we feed the visual features generated by the new object detection model into a Transformer-based VL fusion model OSCAR (code), and utilize an improved approach to pre-train the VL model and fine-tune it on a wide range of downstream VL tasks. Our results show that the new visual features significantly improve the performance across all VL tasks, creating new state-of-the-art results on seven public benchmarks.

Performance

Task t2i t2i i2t i2t IC IC IC IC NoCaps NoCaps VQA NLVR2 GQA
Metric R@1 R@5 R@1 R@5 B@4 M C S C S test-std test-P test-std
SoTA_S 39.2 68.0 56.6 84.5 38.9 29.2 129.8 22.4 61.5 9.2 70.92 58.80 63.17
SoTA_B 54.0 80.8 70.0 91.1 40.5 29.7 137.6 22.8 86.58 12.38 73.67 79.30 61.62
SoTA_L 57.5 82.8 73.5 92.2 41.7 30.6 140.0 24.5 - - 74.93 81.47 -
----- --- --- --- --- --- --- --- --- --- --- --- --- ---
VinVL_B 58.1 83.2 74.6 92.6 40.9 30.9 140.6 25.1 92.46 13.07 76.12 83.08 64.65
VinVL_L 58.8 83.5 75.4 92.9 41.0 31.1 140.9 25.2 - - 76.62 83.98 -
gain 1.3 0.7 1.9 0.6 -0.7 0.5 0.9 0.7 5.9 0.7 1.69 2.51 1.48

t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO.

Leaderboard results

VinVL has achieved top-position in several VL leaderboards, including Visual Question Answering (VQA), Microsoft COOC Image Captioning, Novel Object Captioning (nocaps), and Visual Commonsense Reasoning (VCR).

Comparison with image features from bottom-up and top-down model (code).

We observe uniform improvements on seven VL tasks by replacing visual features from bottom-up and top-down model with ours. The NoCaps baseline is from VIVO, and our results are obtained by directly replacing the visual features. The baselines for rest tasks are from OSCAR, and our results are obtained by replacing the visual features and performing OSCAR+ pre-training. All models are BERT-Base size. As analyzed in Section 5.2 in the VinVL paper, the new visual features contributes 95% of the improvement.

Task t2i t2i i2t i2t IC IC IC IC NoCaps NoCaps VQA NLVR2 GQA
metric R@1 R@5 R@1 R@5 B@4 M C S C S test-std test-P test-std
bottom-up and top-down model 54.0 80.8 70.0 91.1 40.5 29.7 137.6 22.8 86.58 12.38 73.16 78.07 61.62
VinVL (ours) 58.1 83.2 74.6 92.6 40.9 30.9 140.6 25.1 92.46 13.07 75.95 83.08 64.65
gain 4.1 2.4 4.6 1.5 0.4 1.2 3.0 2.3 5.9 0.7 2.79 4.71 3.03

Please see the following two figures for visual comparison.

Source code

Pretrained Faster-RCNN model and feature extraction

The pretrained X152-C4 object-attribute detection can be downloaded here. With code from our Scene Graph Benchmark Repo (to be released soon), one can extract features with following command:

python tools/test_sg_net.py --config-file sgg_configs/vgattr/vinvl_x152c4.yaml TEST.IMS_PER_BATCH 2 MODEL.WEIGHT models/vinvl/vinvl_vg_x152c4.pth MODEL.ROI_HEADS.NMS_FILTER 1 MODEL.ROI_HEADS.SCORE_THRESH 0.2 DATA_DIR "../maskrcnn-benchmark-1/datasets1" TEST.IGNORE_BOX_REGRESSION True MODEL.ATTRIBUTE_ON True TEST.OUTPUT_FEATURE True

The output feature will be encoded as base64.

Find more pretrained models in DOWNLOAD.

Pre-exacted Image Features

For ease-of-use, we make pretrained features and predictions available for all pretraining datasets and downstream tasks. Please find the instructions to download them in DOWNLOAD.

Pretraind Oscar+ models and VL downstream tasks

The code to produce all vision-language results (both pretraining and downstream task finetuning) can be found in our OSCAR repo. One can find the model zoo for vision-language tasks here.

Citations

Please consider citing this paper if you use the code:

@article{li2020oscar,
  title={Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
  author={Li, Xiujun and Yin, Xi and Li, Chunyuan and Hu, Xiaowei and Zhang, Pengchuan and Zhang, Lei and Wang, Lijuan and Hu, Houdong and Dong, Li and Wei, Furu and Choi, Yejin and Gao, Jianfeng},
  journal={ECCV 2020},
  year={2020}
}

@article{zhang2021vinvl,
  title={VinVL: Making Visual Representations Matter in Vision-Language Models},
  author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
  journal={CVPR 2021},
  year={2021}
}
Comments
  • Where can I find

    Where can I find "train_vgoi6_clipped.yaml" and "test_vgoi6_clipped.yaml"?

    Hello! Thank you so much for providing the well-structured code!

    I'm trying to extract image regions features by python tools/test_sg_net.py --config-file sgg_configs/vgattr/vinvl_x152c4.yaml TEST.IMS_PER_BATCH 2 MODEL.WEIGHT models/vinvl/vinvl_vg_x152c4.pth MODEL.ROI_HEADS.NMS_FILTER 1 MODEL.ROI_HEADS.SCORE_THRESH 0.2 DATA_DIR "../maskrcnn-benchmark-1/datasets1" TEST.IGNORE_BOX_REGRESSION True MODEL.ATTRIBUTE_ON True, but I can't find the download links for two config files specified in vinvl_x152c4.yaml, namely TRAIN: ("visualgenome/train_vgoi6_clipped.yaml",) and TEST: ("visualgenome/test_vgoi6_clipped.yaml",)

    What I could find were these two TRAIN: ("visualgenome/train_danfeiX_relation_nm.yaml",), TEST: ("visualgenome/test_danfeiX_relation.yaml",) after downloading visualgenome by path/to/azcopy copy 'https://penzhanwu2.blob.core.windows.net/sgg/sgg_benchmark/datasets/visualgenome' <target folder> --recursive. However, with these two config files, my program would complain KeyError: 'box_features', which seemed to be caused by the returned prediction not including the 'box_features' field.

    Any suggestions? Thanks a lot!

    opened by zdxdsw 7
  •  Link to

    Link to "Scene Graph Benchmark Repo" is not avaliable

    Hi, the link to code repo " Scene Graph Benchmark Repo" is broken. I would like to use it for extracting features on my own dataset. Could you please update the broken link ? Thanks in Advance!!

    opened by Tclz 3
  • Features Dimensionality

    Features Dimensionality

    Hi,

    I downloaded pre-trained COCO features using the command

    <path/to/azcopy> copy https://biglmdiag.blob.core.windows.net/vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/ <local_path> --recursive
    

    then I decoded the features.tsv file using the following code (showing only the first 10 results):

    with open(features.tsv, 'r') as data_file:
      for i, line in enumerate(data_file):
        data = line.split('\t')
        id = int(data[0])
        detections = int(data[1])
        features = np.frombuffer(base64.b64decode(data[2]), np.float32).reshape((detections, -1))
        print(features.shape)
        if i == 10:
          break
    

    What seems unusual to me is that features dimensionality obtained this way is 2054. It could be totally ok, it's just the fact that all other object detectors I've worked with have features dimensionality of some power of 2, usually 2048 or 1024. I've also check the paper and didn't find references to feature dimensionality, I was just wondering if this is correct or maybe I made some mistakes decoding features.

    Thanks!

    opened by eugeniotonanzi 2
  • Broken link to code

    Broken link to code

    The link to "Scene Graph Benchmark Repo" on this page is broken. Crucially, this is the one that contains the code to get the image-region features.

    Could you update it so that it's possible to run the code?

    Thanks, Luke

    opened by lukerm 2
  • Pre-extracted Image Features: what OD model is used?

    Pre-extracted Image Features: what OD model is used?

    Hi, In here, we can easily use pre-extracted image features.

    And I thought these features are from VINVL OD model trained from the merged four datasets: COCO with stuff, Visual Genome, Object365 and Open Images.

    However, I found that features and corresponding labels (object tags) are only from the Visual Genome dataset, which shows inferior performance than that from merged four datasets (according to VinVL paper)

    So I want to clarify whether the given image features are from the pretrained X152-C4 object-attribute detection (based on only the Visual Genome dataset) or from the pretrained model on the merged four datasets.

    Thanks

    opened by ahnjaewoo 1
  • Getting error when downloading VinVL pretrained model and

    Getting error when downloading VinVL pretrained model and

    Thank you for providing the code for VinVL. I am getting similar error when I try to download the models given here - https://github.com/pzzhang/VinVL/blob/main/DOWNLOAD.md#pre-trained-models

    azcopy copy https://penzhanwu2.blob.core.windows.net/results/vinvl/od_models/vinvl_vg_x152c4.pth .
    INFO: Scanning...
    
    failed to perform copy command due to error: Login Credentials missing. No SAS token or OAuth token is present and the resource is not public
    

    I am getting similar error when I try to download the associated labelmap at this link https://penzhanwu2.blob.core.windows.net/results/vinvl/od_models/VG-SGG-dicts-vgoi6-clipped.json. However, I do not get the link when I try to download the features from this link - https://biglmdiag.blob.core.windows.net/vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000/. Let me know how can I resolve this issue.

    opened by gsrivas4 1
  • About image labels for pretraining

    About image labels for pretraining

    Thanks for the great work! Yet I find there are missing resources for pretraining, making the reproduction of the results impossible. This issue was posted in the Oscar repo but no one responded.

    I followed https://github.com/microsoft/Oscar/blob/master/VinVL_DOWNLOAD.md#pre-exacted-image-features to prepare image features, and followed https://github.com/microsoft/Oscar/blob/master/VinVL_MODEL_ZOO.md#oscarplus-pretraining for pre-training. But I cannot find where are the image labels for the pre-training datasets, e.g. COCO, flickr30k, GCA.

    As shown in https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_googlecc_gqa_sbu_oi_x152c4big2exp168.yaml, we need to prepare image_label_path. corpus: coco_flickr30k_gqa_googlecc_sbu_oi corpus_file: coco_flickr30k_googlecc_gqa_sbu_oi.tsv image_label_path: coco: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/coco flickr30k: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/flickr30k gqa: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/gqa googlecc: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/googlecc sbu: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/sbu oi: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/oi image_feature_path: coco: vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000 flickr30k: vinvl/image_features/flickr30k_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000 gqa: vinvl/image_features/gqa_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000 googlecc: vinvl/image_features/googlecc_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000 sbu: vinvl/image_features/sbu_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000 oi: vinvl/image_features/oi_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000

    To be specific, there is no guidance to download or generate predictions_gt.tsv and QA_fileB.tsv files, which are needed for pre-training in https://github.com/microsoft/Oscar/blob/master/oscar/datasets/oscar_tsv.py#L383-L385.

    opened by yikaiw 0
  • Exact list of images that were used in training and development

    Exact list of images that were used in training and development

    Is there an exact list of image IDs that were used in training and tuning of hyperparameters (ie images that were part of any dev sets that were potentially used)?

    While this might list might be useful in general, the reason for me specifically is that I want to verify whether or not the GQA features provided as download were produced by a model that was trained with images from the GQA balanced validation set (or GQA testdev set).

    Thanks!

    opened by dreichCSL 0
  • Question about the demo of visualization

    Question about the demo of visualization

    Hi, wonderful project! Here I have a question the visualization. The command for visualizing the detections from the pretrained models in your README.md is:

    python tools/demo/demo_image.py --config_file sgg_configs/vgattr/vinvl_x152c4.yaml --img_file ../maskrcnn-benchmark-1/datasets1/imgs/woman_fish.jpg --save_file output/woman_fish_x152c4.obj.jpg MODEL.WEIGHT models/vinvl/vinvl_vg_x152c4.pth MODEL.ROI_HEADS.NMS_FILTER 1 MODEL.ROI_HEADS.SCORE_THRESH 0.2 DATA_DIR "../maskrcnn-benchmark-1/datasets1" TEST.IGNORE_BOX_REGRESSION False But, in the README.md of Scene Graph Benchmark Repo, the corresponding command is: python tools/demo/demo_image.py --config_file sgg_configs/vgattr/vinvl_x152c4.yaml --img_file demo/woman_fish.jpg --save_file output/woman_fish_x152c4.obj.jpg MODEL.WEIGHT pretrained_model/vinvl_vg_x152c4.pth MODEL.ROI_HEADS.NMS_FILTER 1 MODEL.ROI_HEADS.SCORE_THRESH 0.2 TEST.IGNORE_BOX_REGRESSION False

    There is no DATA_DIR argument is the command of Scene Graph Benchmark Repo, I wonder what is the difference? why DATA_DIR is introduced?

    opened by ForawardStar 0
  • Provide merged dataset

    Provide merged dataset

    In https://github.com/pzzhang/VinVL/blob/main/DOWNLOAD.md#pre-trained-models you explain that you merged COCO with stuff, Visual Genome, Objects365 and Open Images into one dataset. Could you please provide this merged dataset or scripts on how to create this dataset?

    opened by dreamflasher 0
  • MS-COCO 1K Testing Set of Image-Text Retrieval

    MS-COCO 1K Testing Set of Image-Text Retrieval

    Hi! I have a question about the 1K testing set of image-text retrieval. In your dataset.

    There is a file "test_img_keys_1k.tsv". Do you test your model on this 1K testing image instead of 5-fold of whole 5K testing images?

    opened by LibertFan 0
Owner
null
Project page of the paper 'Analyzing Perception-Distortion Tradeoff using Enhanced Perceptual Super-resolution Network' (ECCVW 2018)

EPSR (Enhanced Perceptual Super-resolution Network) paper This repo provides the test code, pretrained models, and results on benchmark datasets of ou

Subeesh Vasu 78 Nov 19, 2022
Using Streamlit to host a multi-page tool with model specs and classification metrics, while also accepting user input values for prediction.

Predicitng_viability Using Streamlit to host a multi-page tool with model specs and classification metrics, while also accepting user input values for

Gopalika Sharma 1 Nov 8, 2021
Official page of Struct-MDC (RA-L'22 with IROS'22 option); Depth completion from Visual-SLAM using point & line features

Struct-MDC (click the above buttons for redirection!) Official page of "Struct-MDC: Mesh-Refined Unsupervised Depth Completion Leveraging Structural R

Urban Robotics Lab. @ KAIST 37 Dec 22, 2022
(Personalized) Page-Rank computation using PyTorch

torch-ppr This package allows calculating page-rank and personalized page-rank via power iteration with PyTorch, which also supports calculation on GP

Max Berrendorf 69 Dec 3, 2022
Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Ceph.

Project Aquarium Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Cep

Aquarist Labs 73 Jul 21, 2022
This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

Reinforcement-trading This project uses Reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can

Deepender Singla 1.4k Dec 22, 2022
It's final year project of Diploma Engineering. This project is based on Computer Vision.

Face-Recognition-Based-Attendance-System It's final year project of Diploma Engineering. This project is based on Computer Vision. Brief idea about ou

Neel 10 Nov 2, 2022
Erpnext app for make employee salary on payroll entry based on one or more project with percentage for all project equal 100 %

Project Payroll this app for make payroll for employee based on projects like project on 30 % and project 2 70 % as account dimension it makes genral

Ibrahim Morghim 8 Jan 2, 2023
BC3407-Group-5-Project - BC3407 Group Project With Python

BC3407-Group-5-Project As the world struggles to contain the ever-changing varia

null 1 Jan 26, 2022
UpChecker is a simple opensource project to host it fast on your server and check is server up, view statistic, get messages if it is down. UpChecker - just run file and use project easy

UpChecker UpChecker is a simple opensource project to host it fast on your server and check is server up, view statistic, get messages if it is down.

Yan 4 Apr 7, 2022
The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Deep High-Resolution Representation Learning for Human Pose Estimation (CVPR 2019) News [2020/07/05] A very nice blog from Towards Data Science introd

Leo Xiao 3.9k Jan 5, 2023
Warning: This project does not have any current developer. See bellow.

Pylearn2: A machine learning research library Warning : This project does not have any current developer. We will continue to review pull requests and

Laboratoire d’Informatique des Systèmes Adaptatifs 2.7k Dec 26, 2022
This project is for a Twitter bot that monitors a bird feeder in my backyard. Any detected birds are identified and posted to Twitter.

Backyard Birdbot Introduction This is a silly hobby project to use existing ML models to: Detect any birds sighted by a webcam Identify whic

Chi Young Moon 71 Dec 25, 2022
Generic template to bootstrap your PyTorch project with PyTorch Lightning, Hydra, W&B, and DVC.

NN Template Generic template to bootstrap your PyTorch project. Click on Use this Template and avoid writing boilerplate code for: PyTorch Lightning,

Luca Moschella 520 Dec 30, 2022
THIS IS THE **OLD** PYMC PROJECT. PLEASE USE PYMC3 INSTEAD:

Introduction Version: 2.3.8 Authors: Chris Fonnesbeck Anand Patil David Huard John Salvatier Web site: https://github.com/pymc-devs/pymc Documentation

PyMC 7.2k Jan 7, 2023
This project is the official implementation of our accepted ICLR 2021 paper BiPointNet: Binary Neural Network for Point Clouds.

BiPointNet: Binary Neural Network for Point Clouds Created by Haotong Qin, Zhongang Cai, Mingyuan Zhang, Yifu Ding, Haiyu Zhao, Shuai Yi, Xianglong Li

Haotong Qin 59 Dec 17, 2022
Trading Gym is an open source project for the development of reinforcement learning algorithms in the context of trading.

Trading Gym Trading Gym is an open-source project for the development of reinforcement learning algorithms in the context of trading. It is currently

Dimitry Foures 535 Nov 15, 2022
This project provides a stock market environment using OpenGym with Deep Q-learning and Policy Gradient.

Stock Trading Market OpenAI Gym Environment with Deep Reinforcement Learning using Keras Overview This project provides a general environment for stoc

Kim, Ki Hyun 769 Dec 25, 2022