OMNIVORE is a single vision model for many different visual modalities

Overview

Omnivore: A Single Model for Many Visual Modalities

PWC PWC PWC PWC PWC

[paper][website]

OMNIVORE is a single vision model for many different visual modalities. It learns to construct representations that are aligned across visual modalities, without requiring training data that specifies correspondences between those modalities. Using OMNIVORE’s shared visual representation, we successfully identify nearest neighbors of left: an image (ImageNet-1K validation set) in vision datasets that contain right: depth maps (ImageNet-1K training set), single-view 3D images (ImageNet-1K training set), and videos (Kinetics-400 validation set).

This repo contains the code to run inference with a pretrained model on an image, video or RGBD image.

Usage

Setup and Installation

conda create --name omnivore python=3.8
conda activate omnivore
conda install pytorch=1.9.0 torchvision=0.10.0 torchaudio=0.9.0 cudatoolkit=11.1 -c pytorch -c nvidia
conda install -c conda-forge -c pytorch -c defaults apex
conda install pytorchvideo

To run the notebook you may also need to install the follwing:

conda install jupyter nb_conda ipykernel
python -m ipykernel install --user --name omnivore

Run Inference

Follow the inference_tutorial.ipynb tutorial locally or Open in Colab for step by step instructions on how to run inference with an image, video and RGBD image.

Model Zoo

Name IN1k Top 1 Kinetics400 Top 1 SUN RGBD Top 1 Model
Omnivore Swin T 81.2 78.9 62.3 weights
Omnivore Swin S 83.4 82.2 64.6 weights
Omnivore Swin B 84.0 83.3 65.4 weights
Omnivore Swin B (IN21k) 85.3 84.0 67.2 weights
Omnivore Swin L (IN21k) 86.0 84.1 67.1 weights

Numbers are based on Table 2. and Table 4. in the Omnivore Paper.

Torch Hub

Models can be loaded via torch hub e.g.

model = torch.hub.load("facebookresearch/omnivore", model="omnivore_swinB")

The class mappings for the datasets can be downloaded as follows:

wget https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json 
wget https://dl.fbaipublicfiles.com/pyslowfast/dataset/class_names/kinetics_classnames.json 
wget https://dl.fbaipublicfiles.com/omnivore/sunrgbd_classnames.json

Citation

If this work is helpful in your research, please consider starring us and citing:

@article{girdhar2022omnivore,
  title={{Omnivore: A Single Model for Many Visual Modalities}},
  author={Girdhar, Rohit and Singh, Mannat and Ravi, Nikhila and van der Maaten, Laurens and Joulin, Armand and Misra, Ishan},
  journal={arXiv preprint arXiv:2201.08377},
  year={2022}
}

Contributing

We welcome your pull requests! Please see CONTRIBUTING and CODE_OF_CONDUCT for more information.

License

Omnivore is released under the CC-BY-NC 4.0 license. See LICENSE for additional details. However the Swin Transformer implementation is additionally licensed under the Apache 2.0 license (see NOTICE for additional details).

Issues
  • About use_seg

    About use_seg

    I note a forward_seg function in the class BasicLayer() of swin_transformer_3d.

    What does this function work for? Should I switch the forward function to the forward_seg function when I fine-tune for the semantic segmentation task?

    Thanks!

    opened by ZhangYuanhan-AI 8
  • urllib.error.HTTPError: HTTP Error 403: Forbidden

    urllib.error.HTTPError: HTTP Error 403: Forbidden

    have been getting urllib.error.HTTPError: HTTP Error 403: Forbidden

    using

    model = torch.hub.load("facebookresearch/omnivore:main", model=model_name)

    opened by AK391 8
  • Unable to load the model

    Unable to load the model

    I tried running the inference notebook. It breaks at model = torch.hub.load("facebookresearch/omnivore:main", model=model_name) complaining:

    FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/torch/hub/facebookresearch_omnivore_main/hubconf.py'

    I am using Google Colab and the torch version is: 1.11.0+cu113. I have tried with the following too but it doesn't help:

    model_name = "omnivore_swinB"
    model = torch.hub.load("facebookresearch/omnivore", model=model_name)
    

    @imisra @mannatsingh

    opened by sayakpaul 4
  • Get pool accuracy on epic-kitchen 100.

    Get pool accuracy on epic-kitchen 100.

    Dear author, I use the given pretrain model on the epic-kitchen dataset to inference on the validation dataset, however, I get a very poor result: {"action_top1_acc": "35.19", "action_top5_acc": "57.09"}. Could you please tell me where I have a mistake?

    opened by realgump 4
  • Add Docker environment & web demo

    Add Docker environment & web demo

    This pull request makes it possible to run your model inside a Docker environment, which makes it easier for other people to run it. We're using an open source tool called Cog to make this process easier.

    This also means we can make a web page where other people can try out your model! View it here: https://replicate.com/facebookresearch/omnivore. We enable selecting different models for inference, and you can find the docker file under the tab ‘run model with docker’.

    We have added some examples to the page, but do claim the page so you can own the page, customise the Example gallery as you like, push any future update to the web demo, and we'll feature it on our website and tweet about it too.

    In case you're wondering who I am, I'm from Replicate, where we're trying to make machine learning reproducible. We got frustrated that we couldn't run all the really interesting ML work being done. So, we're going round implementing models we like. 😊

    CLA Signed 
    opened by chenxwh 4
  • code to convert depth to disparity for SUN RGB-D

    code to convert depth to disparity for SUN RGB-D

    Hi

    I'm curious whether the SUN RGB-D data with disparity can be released or the code to covert the depth to disparity for the SUN RGB-D can be released.

    Thanks

    opened by liyunsheng13 3
  • Tab 3 and Tab 7 NYUv2 mIoU

    Tab 3 and Tab 7 NYUv2 mIoU

    Hello,

    Thanks for the amazing work. I am curious about the performance differences between Tab 3 and Tab 7 on NYUv2 segmentation.

    Can you confirm that the performance difference is due to the different datasets used in pre-training? In Tab3, is Omnivore Swin-B pre-trained on IN1K ? In Tab 7, is Omnivore Swin-B pre-trained on IN21K, IN1K, K400, and SUN ?

    Thanks

    opened by Zongwei97 3
  • SUN RGB-D 19 scene classification labels

    SUN RGB-D 19 scene classification labels

    Hi,

    Thanks for the really cool work, and for sharing the repository! I was wondering if you can provide more details on obtaining the 19 scene classification labels for SUN RGB-D dataset. When I downloaded directly from https://rgbd.cs.princeton.edu/ (SUN RGBD V1), and I looked in the scene.txt file for each image, it seemed like there were more than 19 scene labels (I saw 44 different scene labels).

    Thanks!

    opened by mjkleinman 2
  • loss function

    loss function

    Hello author, I am very interested in your code! I would like to ask if it is possible to publish the code of the loss function for the training period, and I would like to know if the labels of the 3 datasets are constrained separately or together. Or is it computing 3 classification tasks or one classification task.

    opened by 184446223 2
  • Is

    Is "pip install pytorchvideo" or "pip install torchvideo"?

    I got an error said: ERROR: Could not find a version that satisfies the requirement pytorchvideo (from versions: none) ERROR: No matching distribution found for pytorchvideo

    opened by dspcad 2
  • About of the top-1 accuracy on SUN RGB-D dataset

    About of the top-1 accuracy on SUN RGB-D dataset

    Hi, thanks for your good job. I'd like to confirm whether you compute the average accuracy of all samples or the average accuracy of all categories on SUN RGB-D dataset.

    opened by yangjiangeyjg 1
Owner
Meta Research
Meta Research
Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Paper | Blog OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image gene

OFA Sys 860 Aug 7, 2022
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 11.1k Aug 12, 2022
Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021 The code for training mCOLT/mRASP2, a multilingua

null 84 Jul 30, 2022
M2MRF: Many-to-Many Reassembly of Features for Tiny Lesion Segmentation in Fundus Images

M2MRF: Many-to-Many Reassembly of Features for Tiny Lesion Segmentation in Fundus Images This repo is the official implementation of paper "M2MRF: Man

null 9 Aug 2, 2022
Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

El Bruno 3 Mar 30, 2022
Many Class Activation Map methods implemented in Pytorch for CNNs and Vision Transformers. Including Grad-CAM, Grad-CAM++, Score-CAM, Ablation-CAM and XGrad-CAM

Class Activation Map methods implemented in Pytorch pip install grad-cam ⭐ Tested on many Common CNN Networks and Vision Transformers. ⭐ Includes smoo

Jacob Gildenblat 5.5k Aug 8, 2022
Rohit Ingole 2 Mar 24, 2022
A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

Awesome Pretrained StyleGAN2 A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution. Note the readme is a

Justin 1.1k Aug 2, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 267 Aug 2, 2022
Code for One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022)

One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022) Paper | Demo Requirements Python >= 3.6 , Pytorch >

FuxiVirtualHuman 45 Aug 2, 2022
So-ViT: Mind Visual Tokens for Vision Transformer

So-ViT: Mind Visual Tokens for Vision Transformer        Introduction This repository contains the source code under PyTorch framework and models trai

Jiangtao Xie 43 Jul 24, 2022
Alex Pashevich 52 Jul 13, 2022
VOLO: Vision Outlooker for Visual Recognition

VOLO: Vision Outlooker for Visual Recognition, arxiv This is a PyTorch implementation of our paper. We present Vision Outlooker (VOLO). We show that o

Sea AI Lab 824 Aug 6, 2022
MLP-Like Vision Permutator for Visual Recognition (PyTorch)

Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition (arxiv) This is a Pytorch implementation of our paper. We present Vision

Qibin (Andrew) Hou 154 Aug 7, 2022
Code and data for "Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning" (EMNLP 2021).

GD-VCR Code for Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning (EMNLP 2021). Research Questions and Aims: How well can a model perform o

Da Yin 20 Jul 24, 2022
This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

Akari Asai 50 Jul 25, 2022
Pynomial - a lightweight python library for implementing the many confidence intervals for the risk parameter of a binomial model

Pynomial - a lightweight python library for implementing the many confidence intervals for the risk parameter of a binomial model

Demetri Pananos 8 Jun 15, 2022
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Vision Longformer This project provides the source code for the vision longformer paper. Multi-Scale Vision Longformer: A New Vision Transformer for H

Microsoft 191 Aug 10, 2022
A task-agnostic vision-language architecture as a step towards General Purpose Vision

Towards General Purpose Vision Systems By Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem Overview Welcome to the official code base f

AI2 70 Jul 19, 2022