TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

Overview

TorchMultimodal (Alpha Release)

Introduction

TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale. It provides:

  • A repository of modular and composable building blocks (models, fusion layers, loss functions, datasets and utilities).
  • A repository of examples that show how to combine these building blocks with components and common infrastructure from across the PyTorch Ecosystem to replicate state-of-the-art models published in the literature. These examples should serve as baselines for ongoing research in the field, as well as a starting point for future work.

As a first open source example, researchers will be able to train and extend FLAVA using TorchMultimodal.

Installation

TorchMultimodal requires Python >= 3.8. The library can be installed with or without CUDA support.

Building from Source

  1. Create conda environment
    conda create -n torch-multimodal python=<python_version>
    conda activate torch-multimodal
    
  2. Install pytorch, torchvision, and torchtext. See PyTorch documentation. For now we only support Linux platform.
    conda install pytorch torchvision torchtext cudatoolkit=11.3 -c pytorch-nightly -c nvidia
    
    # For CPU-only install
    conda install pytorch torchvision torchtext cpuonly -c pytorch-nightly
    
  3. Download and install TorchMultimodal and remaining requirements.
    git clone --recursive https://github.com/facebookresearch/multimodal.git torchmultimodal
    cd torchmultimodal
    
    pip install -e .
    
    For developers please follow the development installation.

Documentation

The library builds on the following concepts:

  • Architectures: These are general and composable classes that capture the core logic associated with a family of models. In most cases these take modules as inputs instead of flat arguments (see Models below). Examples include the LateFusionArchitecture, FLAVA and CLIPArchitecture. Users should either reuse an existing architecture or a contribute a new one. We avoid inheritance as much as possible.

  • Models: These are specific instantiations of a given architecture implemented using builder functions. The builder functions take as input all of the parameters for constructing the modules needed to instantiate the architecture. See cnn_lstm.py for an example.

  • Modules: These are self-contained components that can be stitched up in various ways to build an architecture. See lstm_encoder.py as an example.

Contributing

See the CONTRIBUTING file for how to help out.

License

TorchMultimodal is BSD licensed, as found in the LICENSE file.

Comments
  • [MDETR] Phrase grounding evaluation

    [MDETR] Phrase grounding evaluation

    Stack from ghstack (oldest at bottom):

    • -> #110

    This PR adds support for the MDETR phrase grounding evaluation task. For now we use a main training loop (so no Lightning trainer or module) with a simple Lightning data module. We also add evaluator classes, dataset classes, transforms, and various utils as needed. Checkpoint loading utils will be removed after our MDETR checkpoint is on AWS.

    Test plan:

    python -m torch.distributed.launch --nproc_per_node=2 --use_env phrase_grounding.py --resume /data/home/ebs/data/mdetr/pretrained_resnet101_checkpoint.pth?download=1 --ema --eval --dataset_config /data/home/ebs/torchmultimodal/examples/mdetr/phrase_grounding.json Test: Total time: 0:02:39 (0.1280 s / it) Averaged stats: +-------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | Recall@k | all | animals | bodyparts | clothing | instruments | other | people | scene | vehicles | +-------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | Recall@1 | 0.8228365551167464 | 0.9292543021032504 | 0.6229205175600739 | 0.884796573875803 | 0.8258064516129032 | 0.6866626065773447 | 0.8890418028556684 | 0.79191128506197 | 0.8550295857988166 | | Recall@5 | 0.9283586226009839 | 0.97131931166348 | 0.8207024029574861 | 0.9601713062098501 | 0.9419354838709677 | 0.8590133982947625 | 0.9666265267503871 | 0.9021526418786693 | 0.9349112426035503 | | Recall@10 | 0.9482436083974226 | 0.9770554493307839 | 0.8576709796672828 | 0.9738758029978587 | 0.9548387096774194 | 0.8946406820950061 | 0.9778083605711336 | 0.9315068493150684 | 0.9497041420118343 | | Upper_bound | 0.9852421533984619 | 0.9980879541108987 | 0.9297597042513863 | 0.9944325481798715 | 0.9806451612903225 | 0.9658952496954933 | 0.9958713228969551 | 0.9863013698630136 | 0.9940828402366864 | +-------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+

    These results match what we see when running the same command from the MDETR repo.

    Differential Revision: D37390040

    CLA Signed 
    opened by ebsmothers 13
  • Add Flickr postprocessing transform for phrase grounding

    Add Flickr postprocessing transform for phrase grounding

    Stack from ghstack (oldest at bottom):

    • #110
    • -> #109

    Test plan: Added a unit test under examples/mdetr python -m pytest examples/mdetr/test/test_transforms.py ======================================= test session starts ======================================== platform linux -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0 rootdir: /data/home/ebs/torchmultimodal collected 4 items

    examples/mdetr/test/test_transforms.py .... [100%]

    ======================================== 4 passed in 3.56s =========================================

    Differential Revision: D37390043

    CLA Signed 
    opened by ebsmothers 13
  • Add MDETR transformer and model class

    Add MDETR transformer and model class

    Stack from ghstack (oldest at bottom):

    • #110
    • #109
    • -> #77

    This PR adds the multimodal transformer and main model class for MDETR. Similar to the previous PRs, this is still an initial version. The transformer closely follows the original implementation, but without the intermediate caching of encoder outputs. The model class has been decoupled from the losses and takes in all encoders, transformers, and various embedding or projection modules and returns classification logits and their corresponding bounding boxes in its forward.

    Rather than writing a unit test for the class, I've added a notebook that demonstrates how to load weights from the pretrained model, call forward, and check that the results match.

    Differential Revision: D37390042

    CLA Signed 
    opened by ebsmothers 11
  • [refactor,flava] Data file into separate files and add requirements

    [refactor,flava] Data file into separate files and add requirements

    Stack from ghstack (oldest at bottom):

    • -> #9

    This PR refactors the data file into multiple modules for better management as the codebase gets more complex.

    Specifically:

    • A datamodules file which hosts all of the datamodules
    • Definitions for HFdatasets and torchvision datasets
    • MultiTasking classes
    • Rest of the utils

    This PR also adds requirements.txt required for running this project.

    Test Plan:

    Tested locally with finetuning

    Differential Revision: D35362848

    CLA Signed 
    opened by apsdehal 11
  • [feat] Add classification fine-tuning utilities

    [feat] Add classification fine-tuning utilities

    Stack from ghstack (oldest at bottom):

    • #10
    • #9
    • -> #8
    • The PR aims at ending starter classification utils to flava examples.

    As of now the PR adds following things:

    • Finetuning trainer
    • Classification FLAVA
    • TorchVisionDataModule for easy composability of datasets from torchvision
    • Some changes to MLP module for more generalization
    • Some improvements/bug fixes to original FLAVA code
    • Splits the datamodules to better service their individual concerns.

    TODOs:

    • Add support for rest of the datasets. This involves levaraging the existing datamodules that we created in this PR along with support for seamlessly plugging different dataset
    • Add command line overriding on top
    • Add support for retrieval, zero-shot and other downstream tasks in an easily accessible form
    • Expose more things from the model other than just the loss

    Test Plan:

    The code is not in 100% working stage. I have tested only the changes in my PR. I expect everything to be stable by the end of the stack.

    Differential Revision: D35361821

    CLA Signed 
    opened by apsdehal 11
  • [MUGEN] Add MultimodalGPT Module

    [MUGEN] Add MultimodalGPT Module

    Stack from ghstack:

    • -> #257
    • #264

    Summary:

    • Defines the model architecture for the full multimodal GPT as the basis for the builder
    • Defines the API for integration with generation utility
    • Added latent_shape to reshape the token ids for decoding back to the real data
    • Pulled token embedding layers out of MultimodalTransformerDecoder and put in MultimodalGPT.

    Test Plan:

    $ python -m pytest --cov=torchmultimodal/models/ test/models/test_gpt.py -vv
    ================================================= test session starts ==================================================
    platform darwin -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0 -- /Users/langong/local/miniconda3/envs/t2v/bin/python
    cachedir: .pytest_cache
    rootdir: /Users/langong/gpt_attention, configfile: pyproject.toml
    plugins: mock-3.8.2, cov-3.0.0
    collected 20 items
    
    test/models/test_gpt.py::TestMultimodalGPT::test_tokenizers_missing_methods PASSED       [  3%]
    test/models/test_gpt.py::TestMultimodalGPT::test_encode_invalid_modality PASSED          [  7%]
    test/models/test_gpt.py::TestMultimodalGPT::test_decode_tokens_wrong_shape PASSED        [ 11%]
    test/models/test_gpt.py::TestMultimodalGPT::test_decode_tokens_reshape PASSED            [ 15%]
    test/models/test_gpt.py::TestMultimodalGPT::test_lookup_invalid_modality PASSED          [ 19%]
    test/models/test_gpt.py::TestMultimodalGPT::test_lookup_in_modality PASSED               [ 23%]
    test/models/test_gpt.py::TestMultimodalGPT::test_lookup_out_modality PASSED              [ 26%]
    test/models/test_gpt.py::TestMultimodalGPT::test_fwd_bad_input PASSED                    [ 30%]
    test/models/test_gpt.py::TestMultimodalGPT::test_fwd_for_generation PASSED               [ 34%]
    test/models/test_gpt.py::TestMultimodalGPT::test_forward PASSED                          [ 38%]
    test/models/test_gpt.py::TestMultimodalGPT::test_forward_logits_mask PASSED              [ 42%]
    test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_bad_input PASSED         [ 46%]
    test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_forward_in_modality PASSED [ 50%]
    test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_forward_out_modality PASSED [ 53%]
    test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_forward_two_modality PASSED [ 57%]
    test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_forward_eval_right_shift_on PASSED [ 61%]
    test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_forward_eval_right_shift_off PASSED [ 65%]
    test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_bad_pos_ids PASSED       [ 69%]
    test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_optional_pos_ids PASSED  [ 73%]
    test/models/test_gpt.py::TestTransformerDecoder::test_forward PASSED                     [ 76%]
    test/models/test_gpt.py::TestTransformerDecoder::test_forward_additional_output PASSED   [ 80%]
    test/models/test_gpt.py::TestTransformerDecoderLayer::test_forward PASSED                [ 84%]
    test/models/test_gpt.py::TestTransformerDecoderLayer::test_forward_masked PASSED         [ 88%]
    test/models/test_gpt.py::TestTransformerDecoderLayer::test_forward_additional_output PASSED [ 92%]
    test/models/test_gpt.py::test_sigmoid_linear_unit PASSED                                 [ 96%]
    test/models/test_gpt.py::test_right_shift PASSED                                         [100%]
    
    ---------- coverage: platform darwin, python 3.8.13-final-0 ----------
    Name                                            Stmts   Miss  Cover
    -------------------------------------------------------------------
    torchmultimodal/models/gpt.py                     181      4    98%
    
    
    ==== 26 passed in 1.80s =======
    

    Differential Revision: D38642048

    CLA Signed 
    opened by langong347 10
  • [FLAVA]Change some initialization orders and corresponding tests

    [FLAVA]Change some initialization orders and corresponding tests

    • Currently the projections are part of contrastive loss which means we need to use "flava for pretraining" for zero shot. This is weird since zero shot should just involve core model (and not pretraining model)
    • The next PR in this stack tried to fix it but broke the tests because of changing initialization order of several components
    • So splitting that PR into 2 to make sure my logic changes are not actually breaking anything
      1. This PR which simply changes the initialization order of codebook and contrastive loss and changes the test assert values
      2. Next PR which makes projections part of flava model and doesn't touch the tests

    Test plan pytest

    Stack from ghstack (oldest at bottom):

    • #195
    • #132
    • #131
    • #106
    • -> #105

    Differential Revision: D37466221

    CLA Signed 
    opened by ankitade 10
  • Generalize CLIPArchitecture

    Generalize CLIPArchitecture

    Summary: Generalize CLIPArchitecture to allow two encoders of any modalities and added a test suite for CLIPArchitecture. Ultimately, the goal is to support multimodal models beyond image/text, like MUGEN which uses audio/text/video.

    Test plan: Run command pytest --cov=torchmultimodal/architectures/ test/architectures/test_clip.py::TestCLIPArchitecture -vv to run the unit test included in this PR. Screen Shot 2022-06-16 at 8 27 42 PM

    CLA Signed 
    opened by sophiazhi 10
  • [feat] FLAVA: Zero-Shot validation, support for pretrained models

    [feat] FLAVA: Zero-Shot validation, support for pretrained models

    Stack from ghstack (oldest at bottom):

    • #10
    • #9
    • #8
    • -> #6
    • This PR adds support for ImageNet zero-shot on FLAVA model.
    • Also, adds a mixin to easily support pretrained models loading with a key and torch hub.
    • Currently, the zero-shot evaluations run on start of validation
    • Multiple other features and bug fixes

    Differential Revision: D35232320

    CLA Signed 
    opened by apsdehal 10
  • Can this model be used for duplicate detection from both image and text?

    Can this model be used for duplicate detection from both image and text?

    🚀 The feature, motivation and pitch

    A model for near duplicate detection from both image and text.

    Given two pairs of input composed of image and text, determine whether they are semantically duplicate or not.

    inputA = (imageA, textA)
    inputB = (imageB, textB)
    

    Determine whether inputA and inputB are near duplicate or not?

    Alternatives

    No response

    Additional context

    No response

    opened by smith-co 9
  • [FLAVA] Make projections part of the core model

    [FLAVA] Make projections part of the core model

    Move projections from the contrastive loss to the core model This will allow users to use the model (instead of the pretraining model) for doing zero shot Also moved to using the translated the checkpoint.

    Test plan

    1. pytest
    2. python -m flava.train config=flava/configs/pretraining/debug.yaml
    3. python -m flava.finetune config=flava/configs/finetuning/qnli.yaml

    Stack from ghstack (oldest at bottom):

    • #195
    • #132
    • #131
    • -> #106
    • #105

    Differential Revision: D37481127

    CLA Signed 
    opened by ankitade 9
  • Incremental addition of the new modality

    Incremental addition of the new modality

    🚀 The feature, motivation and pitch

    🤗 Hello! Thank you for your work!

    I see model configurations which working with certain modalities in this repo and it is great.

    I have a question though, what if I have pretrained encoder for other modality (e.g. for audio) and a data for training (audio-text pairs and audio-image pairs).

    • How can I train a model which will be able to solve tasks with my new modality?
    • In other words, which components I should use to fuse new modality with other ones? Should I implement a new model or I can use existed components as fusers?

    Alternatives

    No response

    Additional context

    It will be great if the user that have N pretrained encoders for arbitrary modalities will be able to pass them to some fuser model and train it to solve cross modal tasks. Or add the new modality to existing model.

    opened by averkij 2
  • ALBEF: Train from scratch

    ALBEF: Train from scratch

    🚀 The feature, motivation and pitch

    Hi, thanks for your great efforts for this excellent work! I want to train ALBEF from scratch, but I just find the code find-tuning. In the ALBEF paper, they use a pre-trained VIT, and also use BERT to initialize the weights for the text encoder and the multimodal encoder (except cross-attention modules). But I didn't find these initializations in this code. Could you please let me know where did you do that initialization?

    Mant thanks!

    Alternatives

    No response

    Additional context

    No response

    opened by XinhaoMei 2
  • Use CLIP models with pretrained weights

    Use CLIP models with pretrained weights

    Issue description

    Hi, I wanted to ask if it is possible to load openai/clip-vit-base-patch16 weights to torchmultimodal.models.clip.model.CLIP model provided by the library.

    opened by konradkalita 1
  • Clip model sample training code

    Clip model sample training code

    🚀 The feature, motivation and pitch

    Hello I wonder if you are going to have sample training code (like the ones you have in "/examples" folder) for CLIP model?

    Alternatives

    No response

    Additional context

    No response

    opened by ShahabMokari 3
  •  Image transform results between HF and our version does not line up

    Image transform results between HF and our version does not line up

    Issue description

    Image transform results between HF and our version does not line up

    Code example

    A minimal repro here https://colab.research.google.com/drive/1tcghYqhPjy2G1sbkzy2UUbOmbzrQTkG5#scrollTo=wdCanLBZC2w8 if you see last few cells, the text outputs match but image outputs dont

    A possible discrepancy is that HF version has center crop which is missing in our transform

    • https://github.com/huggingface/transformers/blob/v4.24.0/src/transformers/models/flava/feature_extraction_flava.py#L326
    • https://github.com/facebookresearch/multimodal/blob/main/examples/flava/data/transforms.py#L339

    Need eyes from @apsdehal to move forward

    opened by ankitade 1
Owner
Meta Research
Meta Research
Quickly comparing your image classification models with the state-of-the-art models (such as DenseNet, ResNet, ...)

Image Classification Project Killer in PyTorch This repo is designed for those who want to start their experiments two days before the deadline and ki

null 349 Dec 8, 2022
This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

MultiModal-InfoMax This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Informa

Deep Cognition and Language Research (DeCLaRe) Lab 89 Dec 26, 2022
QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

null 152 Jan 2, 2023
LWCC: A LightWeight Crowd Counting library for Python that includes several pretrained state-of-the-art models.

LWCC: A LightWeight Crowd Counting library for Python LWCC is a lightweight crowd counting framework for Python. It wraps four state-of-the-art models

Matija Teršek 39 Dec 28, 2022
LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models

LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models. Developers can reproduce these SOTA methods and build their own methods.

TuZheng 405 Jan 4, 2023
A complete, self-contained example for training ImageNet at state-of-the-art speed with FFCV

ffcv ImageNet Training A minimal, single-file PyTorch ImageNet training script designed for hackability. Run train_imagenet.py to get... ...high accur

FFCV 92 Dec 31, 2022
PaddleViT: State-of-the-art Visual Transformer and MLP Models for PaddlePaddle 2.0+

PaddlePaddle Vision Transformers State-of-the-art Visual Transformer and MLP Models for PaddlePaddle ?? PaddlePaddle Visual Transformers (PaddleViT or

null 1k Dec 28, 2022
PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.

PySlowFast PySlowFast is an open source video understanding codebase from FAIR that provides state-of-the-art video classification models with efficie

Meta Research 5.3k Jan 3, 2023
Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

MultiMAE: Multi-modal Multi-task Masked Autoencoders Roman Bachmann*, David Mizrahi*, Andrei Atanov, Amir Zamir Website | arXiv | BibTeX Official PyTo

Visual Intelligence & Learning Lab, Swiss Federal Institute of Technology (EPFL) 385 Jan 6, 2023
Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

Here is deepparse. Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning. Use deepparse to Use the pr

GRAAL/GRAIL 192 Dec 20, 2022
tsai is an open-source deep learning package built on top of Pytorch & fastai focused on state-of-the-art techniques for time series classification, regression and forecasting.

Time series Timeseries Deep Learning Pytorch fastai - State-of-the-art Deep Learning with Time Series and Sequences in Pytorch / fastai

timeseriesAI 2.8k Jan 8, 2023
State-of-the-art data augmentation search algorithms in PyTorch

MuarAugment Description MuarAugment is a package providing the easiest way to a state-of-the-art data augmentation pipeline. How to use You can instal

null 43 Dec 12, 2022
😇A pyTorch implementation of the DeepMoji model: state-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc

------ Update September 2018 ------ It's been a year since TorchMoji and DeepMoji were released. We're trying to understand how it's being used such t

Hugging Face 865 Dec 24, 2022
🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

English | 简体中文 | 繁體中文 State-of-the-art Natural Language Processing for Jax, PyTorch and TensorFlow ?? Transformers provides thousands of pretrained mo

Hugging Face 77.2k Jan 2, 2023
deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

null 63 Oct 17, 2022
Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

NÜWA - Pytorch (wip) Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch. This repository will be popul

Phil Wang 463 Dec 28, 2022
Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch

?? Flamingo - Pytorch Implementation of Flamingo, state-of-the-art few-shot visual question answering attention net, in Pytorch. It will include the p

Phil Wang 630 Dec 28, 2022
Implementation of ETSformer, state of the art time-series Transformer, in Pytorch

ETSformer - Pytorch Implementation of ETSformer, state of the art time-series Transformer, in Pytorch Install $ pip install etsformer-pytorch Usage im

Phil Wang 121 Dec 30, 2022