TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

Meta Research

Last update: Jan 6, 2023

Related tags

Deep Learning multimodal

Overview

TorchMultimodal (Alpha Release)

Introduction

TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale. It provides:

A repository of modular and composable building blocks (models, fusion layers, loss functions, datasets and utilities).
A repository of examples that show how to combine these building blocks with components and common infrastructure from across the PyTorch Ecosystem to replicate state-of-the-art models published in the literature. These examples should serve as baselines for ongoing research in the field, as well as a starting point for future work.

As a first open source example, researchers will be able to train and extend FLAVA using TorchMultimodal.

Installation

TorchMultimodal requires Python >= 3.8. The library can be installed with or without CUDA support.

Building from Source

Create conda environment

conda create -n torch-multimodal python=<python_version>
conda activate torch-multimodal

Install pytorch, torchvision, and torchtext. See PyTorch documentation. For now we only support Linux platform.

conda install pytorch torchvision torchtext cudatoolkit=11.3 -c pytorch-nightly -c nvidia

# For CPU-only install
conda install pytorch torchvision torchtext cpuonly -c pytorch-nightly

Download and install TorchMultimodal and remaining requirements.

git clone --recursive https://github.com/facebookresearch/multimodal.git torchmultimodal
cd torchmultimodal

pip install -e .

For developers please follow the development installation.

Documentation

The library builds on the following concepts:

Architectures: These are general and composable classes that capture the core logic associated with a family of models. In most cases these take modules as inputs instead of flat arguments (see Models below). Examples include the LateFusionArchitecture, FLAVA and CLIPArchitecture. Users should either reuse an existing architecture or a contribute a new one. We avoid inheritance as much as possible.
Models: These are specific instantiations of a given architecture implemented using builder functions. The builder functions take as input all of the parameters for constructing the modules needed to instantiate the architecture. See cnn_lstm.py for an example.
Modules: These are self-contained components that can be stitched up in various ways to build an architecture. See lstm_encoder.py as an example.

Contributing

See the CONTRIBUTING file for how to help out.

License

TorchMultimodal is BSD licensed, as found in the LICENSE file.

Comments

[MDETR] Phrase grounding evaluation
Stack from ghstack (oldest at bottom):

-> #110

This PR adds support for the MDETR phrase grounding evaluation task. For now we use a main training loop (so no Lightning trainer or module) with a simple Lightning data module. We also add evaluator classes, dataset classes, transforms, and various utils as needed. Checkpoint loading utils will be removed after our MDETR checkpoint is on AWS.

Test plan:

python -m torch.distributed.launch --nproc_per_node=2 --use_env phrase_grounding.py --resume /data/home/ebs/data/mdetr/pretrained_resnet101_checkpoint.pth?download=1 --ema --eval --dataset_config /data/home/ebs/torchmultimodal/examples/mdetr/phrase_grounding.json Test: Total time: 0:02:39 (0.1280 s / it) Averaged stats: +-------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | Recall@k | all | animals | bodyparts | clothing | instruments | other | people | scene | vehicles | +-------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+ | Recall@1 | 0.8228365551167464 | 0.9292543021032504 | 0.6229205175600739 | 0.884796573875803 | 0.8258064516129032 | 0.6866626065773447 | 0.8890418028556684 | 0.79191128506197 | 0.8550295857988166 | | Recall@5 | 0.9283586226009839 | 0.97131931166348 | 0.8207024029574861 | 0.9601713062098501 | 0.9419354838709677 | 0.8590133982947625 | 0.9666265267503871 | 0.9021526418786693 | 0.9349112426035503 | | Recall@10 | 0.9482436083974226 | 0.9770554493307839 | 0.8576709796672828 | 0.9738758029978587 | 0.9548387096774194 | 0.8946406820950061 | 0.9778083605711336 | 0.9315068493150684 | 0.9497041420118343 | | Upper_bound | 0.9852421533984619 | 0.9980879541108987 | 0.9297597042513863 | 0.9944325481798715 | 0.9806451612903225 | 0.9658952496954933 | 0.9958713228969551 | 0.9863013698630136 | 0.9940828402366864 | +-------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+

These results match what we see when running the same command from the MDETR repo.

Differential Revision: D37390040
CLA Signed
opened by ebsmothers 13
Add Flickr postprocessing transform for phrase grounding
Stack from ghstack (oldest at bottom):

#110

-> #109

Test plan: Added a unit test under examples/mdetr python -m pytest examples/mdetr/test/test_transforms.py ======================================= test session starts ======================================== platform linux -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0 rootdir: /data/home/ebs/torchmultimodal collected 4 items

examples/mdetr/test/test_transforms.py .... [100%]

======================================== 4 passed in 3.56s =========================================

Differential Revision: D37390043
CLA Signed
opened by ebsmothers 13
Add MDETR transformer and model class
Stack from ghstack (oldest at bottom):

#110

#109

-> #77

This PR adds the multimodal transformer and main model class for MDETR. Similar to the previous PRs, this is still an initial version. The transformer closely follows the original implementation, but without the intermediate caching of encoder outputs. The model class has been decoupled from the losses and takes in all encoders, transformers, and various embedding or projection modules and returns classification logits and their corresponding bounding boxes in its forward.

Rather than writing a unit test for the class, I've added a notebook that demonstrates how to load weights from the pretrained model, call forward, and check that the results match.

Differential Revision: D37390042
CLA Signed
opened by ebsmothers 11
[refactor,flava] Data file into separate files and add requirements
Stack from ghstack (oldest at bottom):

-> #9

This PR refactors the data file into multiple modules for better management as the codebase gets more complex.

Specifically:

A datamodules file which hosts all of the datamodules

Definitions for HFdatasets and torchvision datasets

MultiTasking classes

Rest of the utils

This PR also adds requirements.txt required for running this project.

Test Plan:

Tested locally with finetuning

Differential Revision: D35362848
CLA Signed
opened by apsdehal 11
[feat] Add classification fine-tuning utilities
Stack from ghstack (oldest at bottom):

#10

#9

-> #8

The PR aims at ending starter classification utils to flava examples.

As of now the PR adds following things:

Finetuning trainer

Classification FLAVA

TorchVisionDataModule for easy composability of datasets from torchvision

Some changes to MLP module for more generalization

Some improvements/bug fixes to original FLAVA code

Splits the datamodules to better service their individual concerns.

TODOs:

Add support for rest of the datasets. This involves levaraging the existing datamodules that we created in this PR along with support for seamlessly plugging different dataset

Add command line overriding on top

Add support for retrieval, zero-shot and other downstream tasks in an easily accessible form

Expose more things from the model other than just the loss

Test Plan:

The code is not in 100% working stage. I have tested only the changes in my PR. I expect everything to be stable by the end of the stack.

Differential Revision: D35361821
CLA Signed
opened by apsdehal 11

[MUGEN] Add MultimodalGPT Module

Stack from ghstack:

-> #257
#264

Summary:

Defines the model architecture for the full multimodal GPT as the basis for the builder
Defines the API for integration with generation utility
Added latent_shape to reshape the token ids for decoding back to the real data
Pulled token embedding layers out of MultimodalTransformerDecoder and put in MultimodalGPT.

Test Plan:

$ python -m pytest --cov=torchmultimodal/models/ test/models/test_gpt.py -vv
================================================= test session starts ==================================================
platform darwin -- Python 3.8.13, pytest-7.1.2, pluggy-1.0.0 -- /Users/langong/local/miniconda3/envs/t2v/bin/python
cachedir: .pytest_cache
rootdir: /Users/langong/gpt_attention, configfile: pyproject.toml
plugins: mock-3.8.2, cov-3.0.0
collected 20 items

test/models/test_gpt.py::TestMultimodalGPT::test_tokenizers_missing_methods PASSED       [  3%]
test/models/test_gpt.py::TestMultimodalGPT::test_encode_invalid_modality PASSED          [  7%]
test/models/test_gpt.py::TestMultimodalGPT::test_decode_tokens_wrong_shape PASSED        [ 11%]
test/models/test_gpt.py::TestMultimodalGPT::test_decode_tokens_reshape PASSED            [ 15%]
test/models/test_gpt.py::TestMultimodalGPT::test_lookup_invalid_modality PASSED          [ 19%]
test/models/test_gpt.py::TestMultimodalGPT::test_lookup_in_modality PASSED               [ 23%]
test/models/test_gpt.py::TestMultimodalGPT::test_lookup_out_modality PASSED              [ 26%]
test/models/test_gpt.py::TestMultimodalGPT::test_fwd_bad_input PASSED                    [ 30%]
test/models/test_gpt.py::TestMultimodalGPT::test_fwd_for_generation PASSED               [ 34%]
test/models/test_gpt.py::TestMultimodalGPT::test_forward PASSED                          [ 38%]
test/models/test_gpt.py::TestMultimodalGPT::test_forward_logits_mask PASSED              [ 42%]
test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_bad_input PASSED         [ 46%]
test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_forward_in_modality PASSED [ 50%]
test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_forward_out_modality PASSED [ 53%]
test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_forward_two_modality PASSED [ 57%]
test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_forward_eval_right_shift_on PASSED [ 61%]
test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_forward_eval_right_shift_off PASSED [ 65%]
test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_bad_pos_ids PASSED       [ 69%]
test/models/test_gpt.py::TestMultimodalTransformerDecoder::test_optional_pos_ids PASSED  [ 73%]
test/models/test_gpt.py::TestTransformerDecoder::test_forward PASSED                     [ 76%]
test/models/test_gpt.py::TestTransformerDecoder::test_forward_additional_output PASSED   [ 80%]
test/models/test_gpt.py::TestTransformerDecoderLayer::test_forward PASSED                [ 84%]
test/models/test_gpt.py::TestTransformerDecoderLayer::test_forward_masked PASSED         [ 88%]
test/models/test_gpt.py::TestTransformerDecoderLayer::test_forward_additional_output PASSED [ 92%]
test/models/test_gpt.py::test_sigmoid_linear_unit PASSED                                 [ 96%]
test/models/test_gpt.py::test_right_shift PASSED                                         [100%]

---------- coverage: platform darwin, python 3.8.13-final-0 ----------
Name                                            Stmts   Miss  Cover
-------------------------------------------------------------------
torchmultimodal/models/gpt.py                     181      4    98%


==== 26 passed in 1.80s =======

Differential Revision: D38642048

CLA Signed

opened by langong347 10

[FLAVA]Change some initialization orders and corresponding tests
Currently the projections are part of contrastive loss which means we need to use "flava for pretraining" for zero shot. This is weird since zero shot should just involve core model (and not pretraining model)

The next PR in this stack tried to fix it but broke the tests because of changing initialization order of several components

So splitting that PR into 2 to make sure my logic changes are not actually breaking anything

This PR which simply changes the initialization order of codebook and contrastive loss and changes the test assert values

Next PR which makes projections part of flava model and doesn't touch the tests

Test plan pytest

Stack from ghstack (oldest at bottom):

#195

#132

#131

#106

-> #105

Differential Revision: D37466221
CLA Signed
opened by ankitade 10
Generalize CLIPArchitecture

Summary: Generalize CLIPArchitecture to allow two encoders of any modalities and added a test suite for CLIPArchitecture. Ultimately, the goal is to support multimodal models beyond image/text, like MUGEN which uses audio/text/video.

Test plan: Run command pytest --cov=torchmultimodal/architectures/ test/architectures/test_clip.py::TestCLIPArchitecture -vv to run the unit test included in this PR.
CLA Signed

opened by sophiazhi 10
[feat] FLAVA: Zero-Shot validation, support for pretrained models
Stack from ghstack (oldest at bottom):

#10

#9

#8

-> #6

This PR adds support for ImageNet zero-shot on FLAVA model.

Also, adds a mixin to easily support pretrained models loading with a key and torch hub.

Currently, the zero-shot evaluations run on start of validation

Multiple other features and bug fixes

Differential Revision: D35232320
CLA Signed
opened by apsdehal 10
Can this model be used for duplicate detection from both image and text?
🚀 The feature, motivation and pitch

A model for near duplicate detection from both image and text.

Given two pairs of input composed of image and text, determine whether they are semantically duplicate or not.

inputA = (imageA, textA) inputB = (imageB, textB)

Determine whether inputA and inputB are near duplicate or not?

Alternatives

No response

Additional context

No response
opened by smith-co 9
[FLAVA] Make projections part of the core model
Move projections from the contrastive loss to the core model This will allow users to use the model (instead of the pretraining model) for doing zero shot Also moved to using the translated the checkpoint.

Test plan

pytest

python -m flava.train config=flava/configs/pretraining/debug.yaml

python -m flava.finetune config=flava/configs/finetuning/qnli.yaml

Stack from ghstack (oldest at bottom):

#195

#132

#131

-> #106

#105

Differential Revision: D37481127
CLA Signed
opened by ankitade 9
Incremental addition of the new modality
🚀 The feature, motivation and pitch

🤗 Hello! Thank you for your work!

I see model configurations which working with certain modalities in this repo and it is great.

I have a question though, what if I have pretrained encoder for other modality (e.g. for audio) and a data for training (audio-text pairs and audio-image pairs).

How can I train a model which will be able to solve tasks with my new modality?

In other words, which components I should use to fuse new modality with other ones? Should I implement a new model or I can use existed components as fusers?

Alternatives

No response

Additional context

It will be great if the user that have N pretrained encoders for arbitrary modalities will be able to pass them to some fuser model and train it to solve cross modal tasks. Or add the new modality to existing model.
opened by averkij 2
ALBEF: Train from scratch

🚀 The feature, motivation and pitch

Hi, thanks for your great efforts for this excellent work! I want to train ALBEF from scratch, but I just find the code find-tuning. In the ALBEF paper, they use a pre-trained VIT, and also use BERT to initialize the weights for the text encoder and the multimodal encoder (except cross-attention modules). But I didn't find these initializations in this code. Could you please let me know where did you do that initialization?

Mant thanks!

Alternatives

No response

Additional context

No response

opened by XinhaoMei 2
Use CLIP models with pretrained weights

Issue description

Hi, I wanted to ask if it is possible to load openai/clip-vit-base-patch16 weights to torchmultimodal.models.clip.model.CLIP model provided by the library.

opened by konradkalita 1
Clip model sample training code

🚀 The feature, motivation and pitch

Hello I wonder if you are going to have sample training code (like the ones you have in "/examples" folder) for CLIP model?

Alternatives

No response

Additional context

No response

opened by ShahabMokari 3
Image transform results between HF and our version does not line up
Issue description

Image transform results between HF and our version does not line up

Code example

A minimal repro here https://colab.research.google.com/drive/1tcghYqhPjy2G1sbkzy2UUbOmbzrQTkG5#scrollTo=wdCanLBZC2w8 if you see last few cells, the text outputs match but image outputs dont

A possible discrepancy is that HF version has center crop which is missing in our transform

https://github.com/huggingface/transformers/blob/v4.24.0/src/transformers/models/flava/feature_extraction_flava.py#L326

https://github.com/facebookresearch/multimodal/blob/main/examples/flava/data/transforms.py#L339

Need eyes from @apsdehal to move forward
opened by ankitade 1

TorchMultimodal is a PyTorch library for training state-of-the-art multimodal multi-task models at scale.

Related tags

Overview

TorchMultimodal (Alpha Release)

Introduction

Installation

Building from Source

Documentation

Contributing

License

Comments

🚀 The feature, motivation and pitch

Alternatives

Additional context

🚀 The feature, motivation and pitch

Alternatives

Additional context

🚀 The feature, motivation and pitch

Alternatives

Additional context

Issue description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Issue description

Code example

Owner

Meta Research

Quickly comparing your image classification models with the state-of-the-art models (such as DenseNet, ResNet, ...)

This repository contains the official implementation code of the paper Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis, accepted at EMNLP 2021.

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

LWCC: A LightWeight Crowd Counting library for Python that includes several pretrained state-of-the-art models.

LaneDet is an open source lane detection toolbox based on PyTorch that aims to pull together a wide variety of state-of-the-art lane detection models

We evaluate our method on different datasets (including ShapeNet, CUB-200-2011, and Pascal3D+) and achieve state-of-the-art results, outperforming all the other supervised and unsupervised methods and 3D representations, all in terms of performance, accuracy, and training time.

A complete, self-contained example for training ImageNet at state-of-the-art speed with FFCV

PaddleViT: State-of-the-art Visual Transformer and MLP Models for PaddlePaddle 2.0+

PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.

Code and pre-trained models for MultiMAE: Multi-modal Multi-task Masked Autoencoders

Deepparse is a state-of-the-art library for parsing multinational street addresses using deep learning

tsai is an open-source deep learning package built on top of Pytorch & fastai focused on state-of-the-art techniques for time series classification, regression and forecasting.

State-of-the-art data augmentation search algorithms in PyTorch

😇A pyTorch implementation of the DeepMoji model: state-of-the-art deep learning model for analyzing sentiment, emotion, sarcasm etc

🤗 Transformers: State-of-the-art Natural Language Processing for Pytorch, TensorFlow, and JAX.

deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

Implementation of NÜWA, state of the art attention network for text to video synthesis, in Pytorch

Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch

Implementation of ETSformer, state of the art time-series Transformer, in Pytorch