Oscar and VinVL

Microsoft

Last update: Dec 26, 2022

Related tags

Deep Learning vqa image-captioning oscar vision-and-language pre-training image-text-search vinvl

Overview

Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

05/28/2020: Released finetuned models on downstream tasks, please check MODEL_ZOO.md.
05/15/2020: Released pretrained models, datasets, and code for downstream tasks finetuning.
01/13/2021: our new work VinVL proposed OSCAR+, an improved version of OSCAR, and provided a better object-attribute detection model to extract features for V+L tasks. The VinVL work achieved SOTA performance on all seven V+L tasks here. Please stay tuned for the model and code release.
03/08/2021: Oscar+ pretraining code released, please check the last section in VinVL_MODEL_ZOO.md. All image features and model checkpoints in VinVL are also released. Please check VinVL for details.
04/13/2021: Our Scene Graph Benchmark Repo has been released. Welcome to use the code there to extract image features with VinVL pretrained models.

Introduction

This repository contains source code necessary to reproduce the results presented in the paper Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks. We propose a new cross-modal pre-training method Oscar (Object-Semantics Aligned Pre-training). It leverages object tags detected in images as anchor points to significantly ease the learning of image-text alignments. We pre-train Oscar on the public corpus of 6.5 million text-image pairs, and fine-tune it on downstream tasks, creating new state-of-the-arts on six well-established vision-language understanding and generation tasks. For more on this project, see the Microsoft Research Blog post.

Performance

Task	t2i	t2i	i2t	i2t	IC	IC	IC	IC	NoCaps	NoCaps	VQA	NLVR2	GQA
Metric	R@1	R@5	R@1	R@5	B@4	M	C	S	C	S	test-std	test-P	test-std
SoTA_S	39.2	68.0	56.6	84.5	38.9	29.2	129.8	22.4	61.5	9.2	70.92	58.80	63.17
SoTA_B	54.0	80.8	70.0	91.1	40.5	29.7	137.6	22.8	86.58	12.38	73.67	79.30	-
SoTA_L	57.5	82.8	73.5	92.2	41.7	30.6	140.0	24.5	-	-	74.93	81.47	-
-----	---	---	---	---	---	---	---	---	---	---	---	---	---
Oscar_B	54.0	80.8	70.0	91.1	40.5	29.7	137.6	22.8	78.8	11.7	73.44	78.36	61.62
Oscar_L	57.5	82.8	73.5	92.2	41.7	30.6	140.0	24.5	80.9	11.3	73.82	80.05	-
-----	---	---	---	---	---	---	---	---	---	---	---	---	---
VinVL_B	58.1	83.2	74.6	92.6	40.9	30.9	140.6	25.1	92.46	13.07	76.12	83.08	64.65
VinVL_L	58.8	83.5	75.4	92.9	41.0	31.1	140.9	25.2	-	-	76.62	83.98	-
gain	1.3	0.7	1.9	0.6	-0.7	0.5	0.9	0.7	5.9	0.7	1.69	2.51	1.48

t2i: text-to-image retrieval; i2t: image-to-text retrieval; IC: image captioning on COCO.

Download

We released pre-trained models, datasets, VinVL image features, and Oscar+ pretraining corpus for downstream tasks. Please check VinVL_DOWNLOAD.md for details.

To download checkpoints for the Vanilla OSCAR, please check DOWNLOAD.md for details.

Installation

Check INSTALL.md for installation instructions.

Model Zoo

Check MODEL_ZOO.md for scripts to run oscar downstream finetuning.

Check VinVL_MODEL_ZOO.md for scripts to run oscar+ pretraining and downstream finetuning.

Citations

Please consider citing this paper if you use the code:

@article{li2020oscar,
  title={Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks},
  author={Li, Xiujun and Yin, Xi and Li, Chunyuan and Hu, Xiaowei and Zhang, Pengchuan and Zhang, Lei and Wang, Lijuan and Hu, Houdong and Dong, Li and Wei, Furu and Choi, Yejin and Gao, Jianfeng},
  journal={ECCV 2020},
  year={2020}
}

@article{zhang2021vinvl,
  title={VinVL: Making Visual Representations Matter in Vision-Language Models},
  author={Zhang, Pengchuan and Li, Xiujun and Hu, Xiaowei and Yang, Jianwei and Zhang, Lei and Wang, Lijuan and Choi, Yejin and Gao, Jianfeng},
  journal={CVPR 2021},
  year={2021}
}

License

Oscar is released under the MIT license. See LICENSE for details.

Comments

ModuleNotFoundError: No module named 'transformers.pytorch_transformers'

Hi, thank for your work.

I'm trying to finetune for image captioning task. When i run

python oscar/run_captioning.py \
    --model_name_or_path pretrained_models/base-vg-labels/ep_67_588997 \
    --do_train \
    --do_lower_case \
    --evaluate_during_training \
    --add_od_labels \
    --learning_rate 0.00003 \
    --per_gpu_train_batch_size 64 \
    --num_train_epochs 30 \
    --save_steps 5000 \
    --output_dir output/

I encounter this error

2021-02-04 06:41:10.151589: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib:/usr/local/cuda/lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
2021-02-04 06:41:10.151621: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
Traceback (most recent call last):
  File "test.py", line 5, in <module>
    from transformers.pytorch_transformers.modeling_utils import PreTrainedModel
ModuleNotFoundError: No module named 'transformers.pytorch_transformers'

I clone this repo with cmd

git clone https://github.com/microsoft/Oscar.git
git submodule init
git submodule update

How could i fix this issue?

opened by NguyenVanThanhHust 12

Faster RCNN model version and Object Tag Sequences

Did you use the open sourced version of the faster rcnn from torchvision: https://pytorch.org/docs/stable/torchvision/models.html#torchvision.models.detection.fasterrcnn_resnet50_fpn? And, did you use the open sourced version of tags and labels?

opened by xiaoleihuang 10

Installation Failure

Failing to clone the repo and its submodules, please help.

$git clone --recursive [email protected]:microsoft/Oscar.git
Cloning into 'Oscar'...
[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

Additionally for cloning using https "https://github.com/microsoft/Oscar.git", the submodules fail to install giving the same error:

[email protected]: Permission denied (publickey).
fatal: Could not read from remote repository.

opened by tjdevWorks 7

Coco caption pertained model output results are not good
Hello,

Thank you for your great works! I used your pertained model for coco image captioning. Here is the command I used. python oscar/run_captioning.py \ --do_test \ --do_eval \ --test_yaml test.yaml \ --per_gpu_eval_batch_size 64 \ --num_beams 5 \ --max_gen_length 20 \ --eval_model_dir image_caption/Oscarrepo/Oscar/checkpoint-29-132780/

where checkpoint-29-132780 is uncompressed pertained coco model folder. But the outputs are not good. Some following examples are the following:

caption claire libraries libraries libraries libraries libraries robbery libraries libraries libraries libraries libraries libraries libraries librariesletsletslets caption demanded adoptedrredrred libraries libraries libraries libraries librariessteadsteadsteadsteadsteadstead libraries libraries libraries caption typing curvature curvature libraries curvature curvature curvature curvature curvature curvature curvature curvature curvature curvature curvature curvature curvature

Do I miss some important steps? Thank you for your help! Also, where is the test.yaml. Thanks
opened by joey-wang123 7
Why are the number of labels and the number of image feature regions unequal in the CaptionTensorizer Coco-caption

https://github.com/microsoft/Oscar/blob/a9013bb7dda35a63856d1cebd16eeeeb73615e5c/oscar/run_captioning.py#L195 Hello, could you please explain why the number of labels(text_b) is not equal to the number of image feature regions? It's a little bit weird from my point of view.

opened by ZuoJiaxing 7
How to create train_caption.json on Flickr8k dataset? [Image Captioning task]

Hello everyone! I want to run Oscar on Flickr8k. I've already created all the other files like: feature.lineidx , label.lineidx, feature.tsv, label.tsv,... but I don't know how to create train_caption.json from the captioning annotation of Flickr8k (because I see that train_caption.json of COCO uses attributes: image_id, id, caption; meanwhile, the annotation of Flickr8k uses attributes: image_name, caption). Anyone knows how to do it? Please help me! Thanks a lot!

opened by hasontung1999 4
Extracted feature for VQA test-dev set
Thank you for making this excellent work public! I hope to reproduce your result on vqa task, but some problem occurred about the dataset. I downloaded the vqa dataset following this instruction: https://github.com/microsoft/Oscar/blob/master/DOWNLOAD.md#datasets, and I didn't find image frcnn feature for test-dev. I'm not sure if there's some mistake during my downloading, or this part just wasn't provided. If it's not possible to share frcnn features for test-dev, could you please provide some code and basic information about how to extract the features myself, so I can reproduce the work correctly. For example,

the features were extracted by which version of faster rcnn?

what's the correct structure to save these features? [i.e. for each image, how to organize all rois features&location and bound them with image id or question id]

Really thank you for your kind help!
opened by weiyx16 4
Pre-training for image captioning

Hello, and congrats for your brilliant work! I’d like to ask. For image captioning, you mention in the appendix:

we directly fine-tune Oscar for image captioning on COCO without additional pre-training on Conceptual Captions

Does that mean you only use COCO dataset for pretraining, and not the rest (SBU, Flickr, GQA)? And the cider score of 1.4 is achieved after fine tuning the coco only pretrained model?

opened by fawazsammani 4
TypeError: cannot serialize '_io.TextIOWrapper' object

Hi, when executing the run_captioning command I get this error:

ForkingPickler(file, protocol).dump(obj) TypeError: cannot serialize '_io.TextIOWrapper' object

I also report here the complete log:

python oscar/run_captioning.py --do_test --do_eval --test_yaml vinvl_demo_images_features/inference_test/test.yaml --per_gpu_eval_batch_size 64 --num_beams 5 --max_gen_length 20 --eval_model_dir vinvl_demo_images_features/coco_captioning_large_scst/checkpoint-4-50000

2021-10-03 18:34:06,058 vlpretrain WARNING: Device: cuda, n_gpu: 1 2021-10-03 18:34:06,063 vlpretrain WARNING: Override max_seq_length to 50 = max_gen_length:20 + od_labels_len:30 2021-10-03 18:34:06,064 vlpretrain WARNING: Override do_lower_case with train args: False -> True 2021-10-03 18:34:06,070 vlpretrain WARNING: Override add_od_labels with train args: False -> True 2021-10-03 18:34:06,101 vlpretrain INFO: Evaluate the following checkpoint: vinvl_demo_images_features/coco_captioning_large_scst/checkpoint-4-50000 2021-10-03 18:34:17,930 vlpretrain INFO: Training/evaluation parameters Namespace(adam_epsilon=1e-08, add_od_labels=True, cider_cached_tokens='coco-train-words.p', config_name='', data_dir='datasets/coco_caption', device='cpu', distributed=False, do_eval=True, do_lower_case=True, do_test=True, do_train=False, drop_out=0.1, drop_worst_after=0, drop_worst_ratio=0, eval_model_dir='vinvl_demo_images_features/coco_captioning_large_scst/checkpoint-4-50000', evaluate_during_training=False, freeze_embedding=False, gradient_accumulation_steps=1, img_feature_dim=2054, img_feature_type='frcnn', label_smoothing=0, learning_rate=3e-05, length_penalty=1, local_rank=0, logging_steps=20, loss_type='sfmx', mask_prob=0.15, max_gen_length=20, max_grad_norm=1.0, max_img_seq_length=50, max_masked_tokens=3, max_seq_a_length=40, max_seq_length=50, max_steps=-1, min_constraints_to_satisfy=2, model_name_or_path=None, no_cuda=True, num_beams=5, num_gpus=1, num_keep_best=1, num_labels=2, num_return_sequences=1, num_train_epochs=40, num_workers=4, output_dir='output/', output_hidden_states=False, output_mode='classification', per_gpu_eval_batch_size=64, per_gpu_train_batch_size=64, repetition_penalty=1, save_steps=-1, sc_baseline_type='greedy', sc_beam_size=1, sc_train_sample_n=5, scheduler='linear', scst=False, seed=88, temperature=1, test_yaml='vinvl_demo_images_features/inference_test/test.yaml', tie_weights=False, tokenizer_name='', top_k=0, top_p=1, train_yaml='train.yaml', use_cbs=False, val_yaml='val.yaml', warmup_steps=0, weight_decay=0.05) 2021-10-03 18:34:17,933 vlpretrain INFO: Evaluate on dataset: vinvl_demo_images_features/inference_test/test.yaml c:\users\gabriele.ferrario\onedrive\desktop\tesi\vinvl\oscar\oscar\oscar\utils\misc.py:34: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details. return yaml.load(fp) predict_file: vinvl_demo_images_features/coco_captioning_large_scst/checkpoint-4-50000\pred.coco_caption.test.beam5.max20.odlabels.tsv values: <generator object test..gen_rows at 0x000001F47BC3DD48> test_dataloader: <torch.utils.data.dataloader.DataLoader object at 0x000001F47B421448> C:\Users\gabriele.ferrario.conda\envs\sg_benchmark\lib\site-packages\torch\cuda_init_.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at ..\c10\cuda\CUDAFunctions.cpp:100.) return torch._C._cuda_getDeviceCount() > 0 Traceback (most recent call last): File "oscar/run_captioning.py", line 1018, in main() File "oscar/run_captioning.py", line 1014, in main checkpoint) File "oscar/run_captioning.py", line 621, in evaluate test(args, val_dataloader, model, tokenizer, predict_file) File "oscar/run_captioning.py", line 715, in test tsv_writer(gen_rows(), cache_file) File "c:\users\gabriele.ferrario\onedrive\desktop\tesi\vinvl\oscar\oscar\oscar\utils\tsv_file_ops.py", line 18, in tsv_writer for value in values: File "oscar/run_captioning.py", line 681, in gen_rows for step, (img_keys, batch) in tqdm(enumerate(test_dataloader)): File "C:\Users\gabriele.ferrario.conda\envs\sg_benchmark\lib\site-packages\torch\utils\data\dataloader.py", line 352, in iter return self._get_iterator() File "C:\Users\gabriele.ferrario.conda\envs\sg_benchmark\lib\site-packages\torch\utils\data\dataloader.py", line 294, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "C:\Users\gabriele.ferrario.conda\envs\sg_benchmark\lib\site-packages\torch\utils\data\dataloader.py", line 801, in init w.start() File "C:\Users\gabriele.ferrario.conda\envs\sg_benchmark\lib\multiprocessing\process.py", line 112, in start self._popen = self._Popen(self) File "C:\Users\gabriele.ferrario.conda\envs\sg_benchmark\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\gabriele.ferrario.conda\envs\sg_benchmark\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj) File "C:\Users\gabriele.ferrario.conda\envs\sg_benchmark\lib\multiprocessing\popen_spawn_win32.py", line 89, in init reduction.dump(process_obj, to_child) File "C:\Users\gabriele.ferrario.conda\envs\sg_benchmark\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: cannot serialize '_io.TextIOWrapper' object

Does anyone have any suggestions? thank you

opened by GabrieleFerrario 3
Cannot find eval caption index file when testing image/text retrieval ..

First of all, many thanks for your sharing VinVL model. However, when I used Flickr-30k pre-extracted features to finetune for Oscarplus base to evaluate Image/Text Retrieval, I found I missed --eval_caption_index_file minival_caption_indexs_top20.pt. So would you mind sharing the download link? Where could i get the minival_caption_indexs_top20.pt? Thanks!

opened by byougert 3
Cannot find image_label for pre-training

I followed https://github.com/microsoft/Oscar/blob/master/VinVL_DOWNLOAD.md#pre-exacted-image-features to prepare image features, and followed https://github.com/microsoft/Oscar/blob/master/VinVL_MODEL_ZOO.md#oscarplus-pretraining for pre-training. But I cannot find where are the image labels for the pre-training datasets, e.g. COCO, flickr30k, GCA.

As shown in https://biglmdiag.blob.core.windows.net/vinvl/pretrain_corpus/coco_flickr30k_googlecc_gqa_sbu_oi_x152c4big2exp168.yaml, we need to prepare image_label_path. corpus: coco_flickr30k_gqa_googlecc_sbu_oi corpus_file: coco_flickr30k_googlecc_gqa_sbu_oi.tsv image_label_path: coco: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/coco flickr30k: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/flickr30k gqa: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/gqa googlecc: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/googlecc sbu: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/sbu oi: X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/oi image_feature_path: coco: vinvl/image_features/coco_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000 flickr30k: vinvl/image_features/flickr30k_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000 gqa: vinvl/image_features/gqa_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000 googlecc: vinvl/image_features/googlecc_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000 sbu: vinvl/image_features/sbu_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000 oi: vinvl/image_features/oi_X152C4_frcnnbig2_exp168model_0060000model.roi_heads.nm_filter_2_model.roi_heads.score_thresh_0.2/model_0060000

To be specific, there is no guidance to download or generate predictions_gt.tsv and QA_fileB.tsv files, which are needed for pre-training in https://github.com/microsoft/Oscar/blob/master/oscar/datasets/oscar_tsv.py#L383-L385.

opened by yikaiw 3
Vocabulary of the test split

Hi! Thanks for the written paper and the availabe code.

I have what may be a stupid question, but I didn't find a straight answer to it anywhere:

When evaluating the model with the karpathy test split, some words might not be present on the vocabulary from the train split. What do you do? Simple remove these words from the captions of the test split?

opened by gondimjoaom 0
The specified resource does not exist.

When I run: wget https://biglmdiag.blob.core.windows.net/oscar/pretrained_models/large-vg-labels.zip It returned "The specified resource does not exist".

opened by victorup 4
Can you share the full NoCaps results on the test data?

The OSCAR paper reports a CIDEr score of 78.8 and 80.9 for OSCAR base and large, resp. Since it's not clarified, I assume these are scores on the NoCaps test split, for NoCaps-entire. Can you share the scores for the in-, near- and out-domain subsplits? And can you confirm if the 78.8/80.9 scores are on the test data? Thanks!

opened by YovaKem 1
VinVL features for datasets not available

Hi there,

Thanks a lot for your code release. I noticed that the VinVL features are not available anymore: https://github.com/microsoft/Oscar/blob/master/VinVL_DOWNLOAD.md#pre-exacted-image-features

Could you please advise?

opened by aleSuglia 1

Oscar and VinVL

Related tags

Overview

Oscar: Object-Semantics Aligned Pre-training for Vision-and-Language Tasks

VinVL: Revisiting Visual Representations in Vision-Language Models

Updates

Introduction

Performance

Download

Installation

Model Zoo

Citations

License

Comments

Owner

Microsoft

We evaluate our method on different datasets (including ShapeNet, CUB-200-2011, and Pascal3D+) and achieve state-of-the-art results, outperforming all the other supervised and unsupervised methods and 3D representations, all in terms of performance, accuracy, and training time.

This is the unofficial code of Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes. which achieve state-of-the-art trade-off between accuracy and speed on cityscapes and camvid, without using inference acceleration and extra data

Pre-trained BERT Models for Ancient and Medieval Greek, and associated code for LaTeCH 2021 paper titled - "A Pilot Study for BERT Language Modelling and Morphological Analysis for Ancient and Medieval Greek"

An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

This repository is related to an Arabic tutorial, within the tutorial we discuss the common data structure and algorithms and their worst and best case for each, then implement the code using Python.

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

BBB streaming without Xorg and Pulseaudio and Chromium and other nonsense (heavily WIP)

All the essential resources and template code needed to understand and practice data structures and algorithms in python with few small projects to demonstrate their practical application.

Implement face detection, and age and gender classification, and emotion classification.

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

[CVPR 21] Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting, IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021.

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

I have created this Virtual Paint Program, in this you can paint(draw) on your screen using hand gestures, created in Python-3 using OpenCV and Mediapipe library. Gestures :- Index Finger for drawing and Index+Middle Finger for changing position and objects.