OMNIVORE is a single vision model for many different visual modalities

Overview

Omnivore: A Single Model for Many Visual Modalities

PWC PWC PWC PWC PWC

[paper][website]

OMNIVORE is a single vision model for many different visual modalities. It learns to construct representations that are aligned across visual modalities, without requiring training data that specifies correspondences between those modalities. Using OMNIVORE’s shared visual representation, we successfully identify nearest neighbors of left: an image (ImageNet-1K validation set) in vision datasets that contain right: depth maps (ImageNet-1K training set), single-view 3D images (ImageNet-1K training set), and videos (Kinetics-400 validation set).

This repo contains the code to run inference with a pretrained model on an image, video or RGBD image.

Usage

Setup and Installation

conda create --name omnivore python=3.8
conda activate omnivore
conda install pytorch=1.9.0 torchvision=0.10.0 torchaudio=0.9.0 cudatoolkit=11.1 -c pytorch -c nvidia
conda install -c conda-forge -c pytorch -c defaults apex
conda install pytorchvideo

To run the notebook you may also need to install the follwing:

conda install jupyter nb_conda ipykernel
python -m ipykernel install --user --name omnivore

Run Inference

Follow the inference_tutorial.ipynb tutorial locally or Open in Colab for step by step instructions on how to run inference with an image, video and RGBD image.

Model Zoo

Name IN1k Top 1 Kinetics400 Top 1 SUN RGBD Top 1 Model
Omnivore Swin T 81.2 78.9 62.3 weights
Omnivore Swin S 83.4 82.2 64.6 weights
Omnivore Swin B 84.0 83.3 65.4 weights
Omnivore Swin B (IN21k) 85.3 84.0 67.2 weights
Omnivore Swin L (IN21k) 86.0 84.1 67.1 weights

Numbers are based on Table 2. and Table 4. in the Omnivore Paper.

Torch Hub

Models can be loaded via torch hub e.g.

model = torch.hub.load("facebookresearch/omnivore", model="omnivore_swinB")

The class mappings for the datasets can be downloaded as follows:

wget https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json 
wget https://dl.fbaipublicfiles.com/pyslowfast/dataset/class_names/kinetics_classnames.json 
wget https://dl.fbaipublicfiles.com/omnivore/sunrgbd_classnames.json

Citation

If this work is helpful in your research, please consider starring us and citing:

@article{girdhar2022omnivore,
  title={{Omnivore: A Single Model for Many Visual Modalities}},
  author={Girdhar, Rohit and Singh, Mannat and Ravi, Nikhila and van der Maaten, Laurens and Joulin, Armand and Misra, Ishan},
  journal={arXiv preprint arXiv:2201.08377},
  year={2022}
}

Contributing

We welcome your pull requests! Please see CONTRIBUTING and CODE_OF_CONDUCT for more information.

License

Omnivore is released under the CC-BY-NC 4.0 license. See LICENSE for additional details. However the Swin Transformer implementation is additionally licensed under the Apache 2.0 license (see NOTICE for additional details).

Comments
  • Extract features about EPIC100

    Extract features about EPIC100

    Hi, Thanks for your work, i have an question about extracting features. Could you please tell me how to extract features by using Omnivore on epic100.

    opened by EdenGabriel 8
  • About use_seg

    About use_seg

    I note a forward_seg function in the class BasicLayer() of swin_transformer_3d.

    What does this function work for? Should I switch the forward function to the forward_seg function when I fine-tune for the semantic segmentation task?

    Thanks!

    opened by ZhangYuanhan-AI 8
  • urllib.error.HTTPError: HTTP Error 403: Forbidden

    urllib.error.HTTPError: HTTP Error 403: Forbidden

    have been getting urllib.error.HTTPError: HTTP Error 403: Forbidden

    using

    model = torch.hub.load("facebookresearch/omnivore:main", model=model_name)

    opened by AK391 8
  • Unable to load the model

    Unable to load the model

    I tried running the inference notebook. It breaks at model = torch.hub.load("facebookresearch/omnivore:main", model=model_name) complaining:

    FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/torch/hub/facebookresearch_omnivore_main/hubconf.py'

    I am using Google Colab and the torch version is: 1.11.0+cu113. I have tried with the following too but it doesn't help:

    model_name = "omnivore_swinB"
    model = torch.hub.load("facebookresearch/omnivore", model=model_name)
    

    @imisra @mannatsingh

    opened by sayakpaul 4
  • Get pool accuracy on epic-kitchen 100.

    Get pool accuracy on epic-kitchen 100.

    Dear author, I use the given pretrain model on the epic-kitchen dataset to inference on the validation dataset, however, I get a very poor result: {"action_top1_acc": "35.19", "action_top5_acc": "57.09"}. Could you please tell me where I have a mistake?

    opened by realgump 4
  • Add Docker environment & web demo

    Add Docker environment & web demo

    This pull request makes it possible to run your model inside a Docker environment, which makes it easier for other people to run it. We're using an open source tool called Cog to make this process easier.

    This also means we can make a web page where other people can try out your model! View it here: https://replicate.com/facebookresearch/omnivore. We enable selecting different models for inference, and you can find the docker file under the tab ‘run model with docker’.

    We have added some examples to the page, but do claim the page so you can own the page, customise the Example gallery as you like, push any future update to the web demo, and we'll feature it on our website and tweet about it too.

    In case you're wondering who I am, I'm from Replicate, where we're trying to make machine learning reproducible. We got frustrated that we couldn't run all the really interesting ML work being done. So, we're going round implementing models we like. 😊

    CLA Signed 
    opened by chenxwh 4
  • code to convert depth to disparity for SUN RGB-D

    code to convert depth to disparity for SUN RGB-D

    Hi

    I'm curious whether the SUN RGB-D data with disparity can be released or the code to covert the depth to disparity for the SUN RGB-D can be released.

    Thanks

    opened by liyunsheng13 3
  • Tab 3 and Tab 7 NYUv2 mIoU

    Tab 3 and Tab 7 NYUv2 mIoU

    Hello,

    Thanks for the amazing work. I am curious about the performance differences between Tab 3 and Tab 7 on NYUv2 segmentation.

    Can you confirm that the performance difference is due to the different datasets used in pre-training? In Tab3, is Omnivore Swin-B pre-trained on IN1K ? In Tab 7, is Omnivore Swin-B pre-trained on IN21K, IN1K, K400, and SUN ?

    Thanks

    opened by Zongwei97 3
  • SUN RGB-D 19 scene classification labels

    SUN RGB-D 19 scene classification labels

    Hi,

    Thanks for the really cool work, and for sharing the repository! I was wondering if you can provide more details on obtaining the 19 scene classification labels for SUN RGB-D dataset. When I downloaded directly from https://rgbd.cs.princeton.edu/ (SUN RGBD V1), and I looked in the scene.txt file for each image, it seemed like there were more than 19 scene labels (I saw 44 different scene labels).

    Thanks!

    opened by mjkleinman 2
  • loss function

    loss function

    Hello author, I am very interested in your code! I would like to ask if it is possible to publish the code of the loss function for the training period, and I would like to know if the labels of the 3 datasets are constrained separately or together. Or is it computing 3 classification tasks or one classification task.

    opened by 184446223 2
  • Is

    Is "pip install pytorchvideo" or "pip install torchvideo"?

    I got an error said: ERROR: Could not find a version that satisfies the requirement pytorchvideo (from versions: none) ERROR: No matching distribution found for pytorchvideo

    opened by dspcad 2
  • replicating action recognition accuracy for EPIC-KITCHENS-100

    replicating action recognition accuracy for EPIC-KITCHENS-100

    Hello Omnivore!

    Thank you very much for everything you have done and provided through this repository.

    I'm interested in building-upon your work for egocentric action recognition, and my first step was to replicate the action recognition results quoted in your paper for the EK100 dataset. If I understand correctly, using the provided omnivore_swimB_epic checkpoint, I should be able to obtain an action recognition accuracy close to 50% on the validation subset of the data. However, trying, I only got an accuracy of 35.91% (for top1 action).

    This is a similar problem as the one listed in #20 . However, that issue was closed without a clear resolution. I hope you can help me find what I might be missing. Here are more details about what I am doing:

    Loading the data

    I adapted the epic kitchens data loader found here to do the following:

    For each action listed in the Epic-Kitchens validation csv:

    • find the center frame
    • load 32 sequential frames (separated with a stride of 2), starting at the center frame. The resulting shape is [1, 3, 256, 456]
    • divide all loaded frame values by 255
    • subtract the omnivore `mean = [0.485, 0.456, 0.406] from each channel
    • divide by the omnivore std = [0.229, 0.224, 0.225]
    • crop the center [256x256] pixels in the image to yield the shape [1,3,256,256]

    I'm tracking my work in this repository: https://github.com/iranroman/ego_actrecog_analysis

    Any hints will be greatly appreciated. Thank you very much in advance!

    opened by iranroman 6
  • Extract video features

    Extract video features

    Hi. Thank you for your amazing code. May I ask how to use the omnimae model to extract the features of the videos instead of output the keywords. Thank you very much.

    opened by Vincent6896 1
  • Fix inference_tutorial notebook

    Fix inference_tutorial notebook

    Hi.

    • fixed ModuleNotFoundError while running inference_tutorial.ipynb notebook image
    • removed unnecessary imports os, torch.nn.functional, matplotlib.image;
    • fixed referenced to CONTRIBUTING, LICENCE, NOTICE, CODE_OF_CONDUCT inside {omnivore,omnimae}/README.md

    P.S: during this import from torchvision.transforms._transforms_video import NormalizeVideo a UserWarning raised: image so probably it's better to fix this import for future.

    P.S.S: tested in colab environment, "Run all cells" without any errors.

    CLA Signed 
    opened by amrzv 0
  • Fine-tuning parameter on SUN RGBD and Kinetics400

    Fine-tuning parameter on SUN RGBD and Kinetics400

    Hi.

    Thank you very much for your excellent work and for sharing the repository!

    I was wondering if you could provide more details on the hyperparameter of fine-tuning on SUN RGBD and Kinetics400 (in Table 2).

    I think I would use the ImageNet-1k pre-trained model (ImageSwin) and fine-tune the parameters according to Supplement A., right? Also, is the performance of the Omnivore model in Table 2 without using a pre-trained model?

    opened by ryosuke-yamada 2
Owner
Meta Research
Meta Research
Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Paper | Blog OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image gene

OFA Sys 1.4k Jan 8, 2023
Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021 The code for training mCOLT/mRASP2, a multilingua

null 104 Jan 1, 2023
M2MRF: Many-to-Many Reassembly of Features for Tiny Lesion Segmentation in Fundus Images

M2MRF: Many-to-Many Reassembly of Features for Tiny Lesion Segmentation in Fundus Images This repo is the official implementation of paper "M2MRF: Man

null 12 Dec 14, 2022
Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Phil Wang 12.6k Jan 9, 2023
Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

El Bruno 3 Mar 30, 2022
A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

Awesome Pretrained StyleGAN2 A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution. Note the readme is a

Justin 1.1k Dec 24, 2022
Many Class Activation Map methods implemented in Pytorch for CNNs and Vision Transformers. Including Grad-CAM, Grad-CAM++, Score-CAM, Ablation-CAM and XGrad-CAM

Class Activation Map methods implemented in Pytorch pip install grad-cam ⭐ Tested on many Common CNN Networks and Vision Transformers. ⭐ Includes smoo

Jacob Gildenblat 6.6k Jan 6, 2023
Rohit Ingole 2 Mar 24, 2022
improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

null 310 Dec 28, 2022
Code for One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022)

One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022) Paper | Demo Requirements Python >= 3.6 , Pytorch >

FuxiVirtualHuman 84 Jan 3, 2023
This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

CORA This is the official implementation of the following paper: Akari Asai, Xinyan Yu, Jungo Kasai and Hannaneh Hajishirzi. One Question Answering Mo

Akari Asai 59 Dec 28, 2022
Pynomial - a lightweight python library for implementing the many confidence intervals for the risk parameter of a binomial model

Pynomial - a lightweight python library for implementing the many confidence intervals for the risk parameter of a binomial model

Demetri Pananos 9 Oct 4, 2022
So-ViT: Mind Visual Tokens for Vision Transformer

So-ViT: Mind Visual Tokens for Vision Transformer        Introduction This repository contains the source code under PyTorch framework and models trai

Jiangtao Xie 44 Nov 24, 2022
Alex Pashevich 62 Dec 24, 2022
VOLO: Vision Outlooker for Visual Recognition

VOLO: Vision Outlooker for Visual Recognition, arxiv This is a PyTorch implementation of our paper. We present Vision Outlooker (VOLO). We show that o

Sea AI Lab 876 Dec 9, 2022
MLP-Like Vision Permutator for Visual Recognition (PyTorch)

Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition (arxiv) This is a Pytorch implementation of our paper. We present Vision

Qibin (Andrew) Hou 162 Nov 28, 2022
Code and data for "Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning" (EMNLP 2021).

GD-VCR Code for Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning (EMNLP 2021). Research Questions and Aims: How well can a model perform o

Da Yin 24 Oct 13, 2022
Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

Vision Longformer This project provides the source code for the vision longformer paper. Multi-Scale Vision Longformer: A New Vision Transformer for H

Microsoft 209 Dec 30, 2022
A task-agnostic vision-language architecture as a step towards General Purpose Vision

Towards General Purpose Vision Systems By Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem Overview Welcome to the official code base f

AI2 79 Dec 23, 2022