OMNIVORE is a single vision model for many different visual modalities

Meta Research

Last update: Dec 27, 2022

Related tags

Deep Learning omnivore

Overview

Omnivore: A Single Model for Many Visual Modalities

OMNIVORE is a single vision model for many different visual modalities. It learns to construct representations that are aligned across visual modalities, without requiring training data that specifies correspondences between those modalities. Using OMNIVORE’s shared visual representation, we successfully identify nearest neighbors of left: an image (ImageNet-1K validation set) in vision datasets that contain right: depth maps (ImageNet-1K training set), single-view 3D images (ImageNet-1K training set), and videos (Kinetics-400 validation set).

This repo contains the code to run inference with a pretrained model on an image, video or RGBD image.

Usage

Setup and Installation

conda create --name omnivore python=3.8
conda activate omnivore
conda install pytorch=1.9.0 torchvision=0.10.0 torchaudio=0.9.0 cudatoolkit=11.1 -c pytorch -c nvidia
conda install -c conda-forge -c pytorch -c defaults apex
conda install pytorchvideo

To run the notebook you may also need to install the follwing:

conda install jupyter nb_conda ipykernel
python -m ipykernel install --user --name omnivore

Run Inference

Follow the inference_tutorial.ipynb tutorial locally or for step by step instructions on how to run inference with an image, video and RGBD image.

Model Zoo

Name	IN1k Top 1	Kinetics400 Top 1	SUN RGBD Top 1	Model
Omnivore Swin T	81.2	78.9	62.3	weights
Omnivore Swin S	83.4	82.2	64.6	weights
Omnivore Swin B	84.0	83.3	65.4	weights
Omnivore Swin B (IN21k)	85.3	84.0	67.2	weights
Omnivore Swin L (IN21k)	86.0	84.1	67.1	weights

Numbers are based on Table 2. and Table 4. in the Omnivore Paper.

Torch Hub

Models can be loaded via torch hub e.g.

model = torch.hub.load("facebookresearch/omnivore", model="omnivore_swinB")

The class mappings for the datasets can be downloaded as follows:

wget https://s3.amazonaws.com/deep-learning-models/image-models/imagenet_class_index.json 
wget https://dl.fbaipublicfiles.com/pyslowfast/dataset/class_names/kinetics_classnames.json 
wget https://dl.fbaipublicfiles.com/omnivore/sunrgbd_classnames.json

Citation

If this work is helpful in your research, please consider starring ⭐ us and citing:

@article{girdhar2022omnivore,
  title={{Omnivore: A Single Model for Many Visual Modalities}},
  author={Girdhar, Rohit and Singh, Mannat and Ravi, Nikhila and van der Maaten, Laurens and Joulin, Armand and Misra, Ishan},
  journal={arXiv preprint arXiv:2201.08377},
  year={2022}
}

Contributing

We welcome your pull requests! Please see CONTRIBUTING and CODE_OF_CONDUCT for more information.

License

Omnivore is released under the CC-BY-NC 4.0 license. See LICENSE for additional details. However the Swin Transformer implementation is additionally licensed under the Apache 2.0 license (see NOTICE for additional details).

Comments

Extract features about EPIC100

Hi, Thanks for your work, i have an question about extracting features. Could you please tell me how to extract features by using Omnivore on epic100.

opened by EdenGabriel 8
About use_seg

I note a forward_seg function in the class BasicLayer() of swin_transformer_3d.

What does this function work for? Should I switch the forward function to the forward_seg function when I fine-tune for the semantic segmentation task?

Thanks!

opened by ZhangYuanhan-AI 8
urllib.error.HTTPError: HTTP Error 403: Forbidden

have been getting urllib.error.HTTPError: HTTP Error 403: Forbidden

using

model = torch.hub.load("facebookresearch/omnivore:main", model=model_name)

opened by AK391 8
Unable to load the model
I tried running the inference notebook. It breaks at model = torch.hub.load("facebookresearch/omnivore:main", model=model_name) complaining:

FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/torch/hub/facebookresearch_omnivore_main/hubconf.py'

I am using Google Colab and the torch version is: 1.11.0+cu113. I have tried with the following too but it doesn't help:

model_name = "omnivore_swinB" model = torch.hub.load("facebookresearch/omnivore", model=model_name)

@imisra @mannatsingh
opened by sayakpaul 4
Get pool accuracy on epic-kitchen 100.

Dear author, I use the given pretrain model on the epic-kitchen dataset to inference on the validation dataset, however, I get a very poor result: {"action_top1_acc": "35.19", "action_top5_acc": "57.09"}. Could you please tell me where I have a mistake?

opened by realgump 4
Add Docker environment & web demo

This pull request makes it possible to run your model inside a Docker environment, which makes it easier for other people to run it. We're using an open source tool called Cog to make this process easier.

This also means we can make a web page where other people can try out your model! View it here: https://replicate.com/facebookresearch/omnivore. We enable selecting different models for inference, and you can find the docker file under the tab ‘run model with docker’.

We have added some examples to the page, but do claim the page so you can own the page, customise the Example gallery as you like, push any future update to the web demo, and we'll feature it on our website and tweet about it too.

In case you're wondering who I am, I'm from Replicate, where we're trying to make machine learning reproducible. We got frustrated that we couldn't run all the really interesting ML work being done. So, we're going round implementing models we like. 😊
CLA Signed

opened by chenxwh 4
code to convert depth to disparity for SUN RGB-D

Hi

I'm curious whether the SUN RGB-D data with disparity can be released or the code to covert the depth to disparity for the SUN RGB-D can be released.

Thanks

opened by liyunsheng13 3
Tab 3 and Tab 7 NYUv2 mIoU

Hello,

Thanks for the amazing work. I am curious about the performance differences between Tab 3 and Tab 7 on NYUv2 segmentation.

Can you confirm that the performance difference is due to the different datasets used in pre-training? In Tab3, is Omnivore Swin-B pre-trained on IN1K ? In Tab 7, is Omnivore Swin-B pre-trained on IN21K, IN1K, K400, and SUN ?

Thanks

opened by Zongwei97 3
SUN RGB-D 19 scene classification labels

Hi,

Thanks for the really cool work, and for sharing the repository! I was wondering if you can provide more details on obtaining the 19 scene classification labels for SUN RGB-D dataset. When I downloaded directly from https://rgbd.cs.princeton.edu/ (SUN RGBD V1), and I looked in the scene.txt file for each image, it seemed like there were more than 19 scene labels (I saw 44 different scene labels).

Thanks!

opened by mjkleinman 2
loss function

Hello author, I am very interested in your code! I would like to ask if it is possible to publish the code of the loss function for the training period, and I would like to know if the labels of the 3 datasets are constrained separately or together. Or is it computing 3 classification tasks or one classification task.

opened by 184446223 2
Is "pip install pytorchvideo" or "pip install torchvideo"?

I got an error said: ERROR: Could not find a version that satisfies the requirement pytorchvideo (from versions: none) ERROR: No matching distribution found for pytorchvideo

opened by dspcad 2
replicating action recognition accuracy for EPIC-KITCHENS-100
Hello Omnivore!

Thank you very much for everything you have done and provided through this repository.

I'm interested in building-upon your work for egocentric action recognition, and my first step was to replicate the action recognition results quoted in your paper for the EK100 dataset. If I understand correctly, using the provided omnivore_swimB_epic checkpoint, I should be able to obtain an action recognition accuracy close to 50% on the validation subset of the data. However, trying, I only got an accuracy of 35.91% (for top1 action).

This is a similar problem as the one listed in #20 . However, that issue was closed without a clear resolution. I hope you can help me find what I might be missing. Here are more details about what I am doing:

Loading the data

I adapted the epic kitchens data loader found here to do the following:

For each action listed in the Epic-Kitchens validation csv:

find the center frame

load 32 sequential frames (separated with a stride of 2), starting at the center frame. The resulting shape is [1, 3, 256, 456]

divide all loaded frame values by 255

subtract the omnivore `mean = [0.485, 0.456, 0.406] from each channel

divide by the omnivore std = [0.229, 0.224, 0.225]

crop the center [256x256] pixels in the image to yield the shape [1,3,256,256]

I'm tracking my work in this repository: https://github.com/iranroman/ego_actrecog_analysis

Any hints will be greatly appreciated. Thank you very much in advance!
opened by iranroman 6
Extract video features

Hi. Thank you for your amazing code. May I ask how to use the omnimae model to extract the features of the videos instead of output the keywords. Thank you very much.

opened by Vincent6896 1
Fix inference_tutorial notebook
Hi.

fixed ModuleNotFoundError while running inference_tutorial.ipynb notebook

removed unnecessary imports os, torch.nn.functional, matplotlib.image;

fixed referenced to CONTRIBUTING, LICENCE, NOTICE, CODE_OF_CONDUCT inside {omnivore,omnimae}/README.md

P.S: during this import from torchvision.transforms._transforms_video import NormalizeVideo a UserWarning raised: so probably it's better to fix this import for future.

P.S.S: tested in colab environment, "Run all cells" without any errors.
CLA Signed
opened by amrzv 0
Fine-tuning parameter on SUN RGBD and Kinetics400

Hi.

Thank you very much for your excellent work and for sharing the repository!

I was wondering if you could provide more details on the hyperparameter of fine-tuning on SUN RGBD and Kinetics400 (in Table 2).

I think I would use the ImageNet-1k pre-trained model (ImageSwin) and fine-tune the parameters according to Supplement A., right? Also, is the performance of the Omnivore model in Table 2 without using a pre-trained model?

opened by ryosuke-yamada 2

OMNIVORE is a single vision model for many different visual modalities

Related tags

Overview

Omnivore: A Single Model for Many Visual Modalities

Usage

Setup and Installation

Run Inference

Model Zoo

Torch Hub

Citation

Contributing

License

Comments

Loading the data

Owner

Meta Research

Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Contrastive Learning for Many-to-many Multilingual Neural Machine Translation(mCOLT/mRASP2), ACL2021

M2MRF: Many-to-Many Reassembly of Features for Tiny Lesion Segmentation in Fundus Images

Implementation of Vision Transformer, a simple way to achieve SOTA in vision classification with only a single transformer encoder, in Pytorch

Step by Step on how to create an vision recognition model using LOBE.ai, export the model and run the model in an Azure Function

A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

Many Class Activation Map methods implemented in Pytorch for CNNs and Vision Transformers. Including Grad-CAM, Grad-CAM++, Score-CAM, Ablation-CAM and XGrad-CAM

Eye-Blink-Counter - Python based Computer Vision project which counts how many time a person blinks

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

Code for One-shot Talking Face Generation from Single-speaker Audio-Visual Correlation Learning (AAAI 2022)

This is the official implementation of "One Question Answering Model for Many Languages with Cross-lingual Dense Passage Retrieval".

Pynomial - a lightweight python library for implementing the many confidence intervals for the risk parameter of a binomial model

So-ViT: Mind Visual Tokens for Vision Transformer

Episodic Transformer (E.T.) is a novel attention-based architecture for vision-and-language navigation. E.T. is based on a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.

VOLO: Vision Outlooker for Visual Recognition

MLP-Like Vision Permutator for Visual Recognition (PyTorch)

Code and data for "Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning" (EMNLP 2021).

Multi-Scale Vision Longformer: A New Vision Transformer for High-Resolution Image Encoding

A task-agnostic vision-language architecture as a step towards General Purpose Vision