multimodal transformer

Overview

This repo holds the code to perform experiments with the multimodal autoregressive probabilistic model Transflower.

Overview of the repo

It is structured into folders which hold the code or assets for different parts of the workflow. I am using a general framework which was inspired by the CycleGAN project, and then modified to work with Pytorch Lightning.

  • Training holds the training code. The entry point file is train.py
    • Datasets holds the pytorch datasets and also convenience utilities to create the datasets and dataloaders. A dataset is defined as a class inheriting from BaseDataset, with a class name which should be "DatasetNameDataset" where "DatasetName" (but lowercased) will be the name used to refer to the dataset in the options.
    • experiments holds the results of training runs, in folders given by the experiment id. Its contents are .gitignored
    • options holds utilities to set up general argument options, as well as collect the options specific to the model, or dataset, specified for the experiment
    • hparams holds the configuration files with hyperparameters that I'm using to run different experiments
  • models defines the models as Lightning modules (actually they inherit from BaseModel, for some convenient common functions). A model is defined as a class inheriting from BaseModel, with a class name which should be "ModelNameModel" where "ModelName" (but lowercased) will be the name used to refer to the model in the options.
  • inference holds the code and results for testing the model at inference time. In our case, this means generation, as we are working with generative models. Note that this is just a skeleton, and it relies on a generate function defined on each model.
  • feature_extraction holds code to extract features. At the moment functions to process features for audio and motion data are available
  • data holds the actual data, and a couple of scripts to download data
  • analysis holds code to analyze and explore the data, as well as visualizing results. Beware of files named "sandbox", as they are used by me using Hydrogen, so the outside observer would look like a jumbled mess of code (which it is). It is literally a sandbox of code, to play around.

How to run training and inference

The scripts in the root folder are for running the common high level tasks. script_generate.sh generates samples for the model of a particular experiment. script_train.sh trains a model on some data, as specified by a particular hyperparameters file, though any of the hparams can be overriden as argparse arguments. You can add --help to to get some explanation of what the different options/hparams do. Note that the options parser (in training/options/base_options.py) gathers the set of available options by aggregating the base options with the options defined in the particular model and dataset you have specified in the model and dataset_name options.

You can run ./script_train.sh transflower_expmap to train using the hparams file in training/hparams/dance_combined/transflower_expmap.yaml, which is the configuration we used in the paper. This assumes the data is inside data/dance_combined3 but can be changed with the --data_dir argument.

You can run ./script_generate.sh transflower_expmap the_basement --generate_bvh --data_dir data/dance_combined3 --generate_video to generate using the model trained in the experiment transflower_expmap, on the sequence with sequence id the_basement. This sequence_id is just the basename of the original sound file.

Training on your own data

First install requirements with pip install -r requirements.txt

Then, I assume you have a folder, say inside data/myData which contains bvh files and wav files which are paired and time-syncced. In particular, they should have the same basename, for example, sequence1.bvh and sequence1.wav would be paired. With that, you can run ./feature_extraction/audio_feature_extraction.sh data/myData and ./feature_extraction/motion_feature_extraction.sh data/myData to extract the audio and motion features. You can pass the argument --replace_existing if you want to recompute features with the same names. This will create files with names like sequence1.audio_feats_scaled_20.npy and sequence1.expmap_cr_scaled_20.npy. The expmap_cr_scaled_20 and expmap_cr_scaled_20 parts are the "feature names" which should match the ones that are given in the input_modalities arguments when training (see the different hparams files). There are different audio and motion feature options available which can be seen in the different scripts called by audio_feature_extraction.sh and motion_feature_extraction.sh.

One this is done, you can now call ./script_train.sh transflower_expmap --data_dir=data/myData to train, as mentioned above. The checkpoints and results will be found in training/experiments/transflower_expmap. You can use tensorboard --logdir training/experiments to track progress using tensorboard.

More options: You can specify a different experiment name which will look for a different hparams file inside training/hparams/dance_combined. You can also directly pass arguments to ./script_train.sh. You can see some of them by typing python training/train.py -h, but this is not a complete list, as some arguments are specific to datasets and models. You can see the model and dataset specific arguments by looking at their class definitions. The dataset definition is in training/datasets/multimodal_dataset.py and the model is in models/transflower_model.py.

Generating a dance

To give a more concrete example, and following the exact workflow in the provided Google colab which I recommend using to try the model, assume we have a filename called myCoolSong.wav. After installing dependencies, we can create a folder to store it, and also one for the seeds and pretrained models, and download them. This assumes you have gsutil installed.

pip install -r requirements.txt
apt-get install ffmpeg
mkdir songs
mkdir training/experiments
gsutil -m cp -r gs://metagen/models/transflower_expmap_old training/experiments/
gsutil -m cp -r gs://metagen/models/transflower_expmap_finetune2_old training/experiments/
gsutil -m cp -r gs://metagen/scalers/* songs/
gsutil -m cp -r gs://metagen/seeds/* songs/

We then extract the music features as follows

chmod +x ./feature_extraction/audio_feature_extraction_test.sh
chmod +x ./feature_extraction/script_to_list_filenames
./feature_extraction/audio_feature_extraction_test.sh songs/

Finally, we can generate as follows

./script_generate.sh transflower_expmap_old myCoolSong --generate_bvh --generate_video --data_dir=songs

and the result will be inside inference/generated/transflower_expmap_old/videos.


Next we explain in some more detail the structure and options of the general dataset format we use, as well as models available and their main options

Multimodal dataset

The multimodal dataset in training/datasets is designed to be a general purpose dataset which works for all the experiments within the scope of this repository. It expects numpy arrays corresponding to time-synced and same-length sequneces of a number of "input modalities" and a number of "output modalities", to be found in the folder given in --data_dir argument. These should be numpy arrays of rank 2, where the first dimension is the sequence dimension, and the second one is the feature dimension. The files in the folder should be formated as [sequence_id].[feature_name].npy. The dataset will read which sequence ids to use for training from a file named base_filenames_train.txt on the same folder as the data, and then it will pick up the desired input and output modalities, according to the modalities specified (comma-separated) in the arguments --input_modalities and output_modalities. It will constract individual data points from these sequences, by specifying the length of window that you wish for each input modality (comma-separated) in --input_lenghts, and for each output modality (comma_separated in --output_lengths, you can provide offsets for each modality in --output_time_offsets and --input_time_offsets, both zero-indexed

In the image below you see an illustration of how a data point would look like with two input modalities (e.g. music, and movement), and one output modality (e.g. movement), with the specified lengths and offsets. This way of specifying data points is quite general and can conver a variety of tasks, including the multimodal autoregressive generation tasks we are studying here.

multimodal dataset example

Note that although not a technical requirement for the above dataset to work, for more tasks, we expect the sequences to be sampled at the same frequency, and have the same length (the later requirement could be relaxed, as we are gonna take subwindows of different sizes, but it at the moment it's what it expects)

Multimodal models

Most of the models studied here are designed to work with the above multimodal dataset, and so they will build a model to fit the corresponding data point structure (number, and lengths, of the input and output modalities). The modality dimensions (number of features) are specified in the --dins and --douts arguments.

Transformer model

The basic (multimodal) transformer model is a generalization of the model proposed in this paper, to work with any number of input or output modalities.

Here's an example of how it would look with the input and output modalities described above:

multimodal transformer example

Note:

  • The "output" of the model is taken to be the first N outputs of the cross modal transformer, starting from the first, where N is the desired output_length. Note that this limits the output length to be at most the sum of input lengths. One could extend the model to relax this
  • If there were multiple output modalities, the input modality transformers would be shared, but a new "cross-modal" head would be added, per output modality.

Transflower model(s)

The idea of using autoregressive normalizing flows (NF), in particular for probabilistic modelling of motion, was inspired by MoGlow. The idea of the Transflower model is to combine that idea with the multimodal transformer defined above. This can be done simply by feeding the output of the multimodal transformer (MT) as conditioning for the normalizing flow, which is done through the coupling layers of the normalizing flow, as illustrated below.

transflower example

Within the normalizing flow there are a series of different architectural choices possible, some of which correspond to the model variations in the model folder.

  • The Transflower model itself is using the Flow++ architecture, but with hyperparameter choices that simplify the model as we are not working with images. Furthermore, the coupling neural network has been replaced with a Transformer network.
  • The Transformerflow model is an earlier iteration which is the same, except that the coupling neural network has been left to be the same as what it was for Flow++.
  • The Transglower model is an attempt at combining the MT with the original MoGlow model (which feeds an LSTM into the coupling layers of the Glow model, which is very similar to the Flow++ one we are using, given our simplified hyperparameters).
  • There's also an implementation of the original MoGlow model itselft to be used as a baseline
  • The ResidualFlower model is a new experiment I'm trying now where the mean of output is predicted by a (deterministic) MT, and the NF models only the deviation (residal) from the mean.

You can see some more details in this post

Normalizing flows

A very quick summary of normalizing flows. Normalizing flows are implicit generative models, which also allow computation of the exact probability (so they are explicit at the same time? I find the nomenclature confusing). This is done because the function mapping the stochastic latent to the output is reversible. We usually define the following

  • The forward flow, is a function mapping a vector/tensor in the "output domain" to a vector/tensor of the same size, in the "latent domain". This flow has to be reversible, which in particular implies that the Jacobian is non-singular
  • The reverse flow, is the inverse of the above function. This is the flow that allows to sample from the model, by sampling a latent vector from the latent distribution, and sending it through the flow.

The model is trained by exact likelihood maximization. We can compute the likelihood of the data, by using the transformation of variables formula, to compute P(X) as the determinant of the Jacobian of the forward flow, times Prior(f(X)), where Prior is the latent distribution (usually a standard Gaussian), and f(X) is the Z to which X is mapped by the forward flow. By the chain rule, the Jacobian determinant of the flow, can be decomposed into a product of the Jacobian determinant of each of the atomic operations in the flow. There are tricks to make these Jacobians more efficient to compute. sldj on the code I think stands for "sum of log determinant of Jacobian", as the product becomes sum after taking the log.

Most of the ingenuity in desining normalizing flows has gone into desining expressive reversible functions. The Real NVP paper introduced reversible "coupling layers" which are a very flexible (and clever I think) class of reversible transformations, which also have simple Jacobians. I recommend the Real NVP paper as a good introduction to NFs. The Glow paper introduced reversible 1x1 convolutoins, which are also often used.

MoGlow made use of the coupling layers to make the NF conditional on the recent history, thus making the model autoregressive.

As an interesting recent development in NFs, there's FFJORD, which uses the interesting fact that as ODEs are reversible, one can use Neural ODEs to make NFs!

There's a recent review of different types of probabilistic deep generative models. However, it only focuses on images, so it is unclear how much their conclusions transfer to other domains. I think the question of which generative models work well for different tasks is a very interesting and valuable one, as these models are so fundamental and useful for real-world applications, where stochasticity and uncertainty means that we need to model probability distributions!

As a very quick summary, from that paper + my experience, the main advantages of NFs are:

  • They model the distribution very faithfully, avoiding problems of model collapse, for example. This typically means they are better at producing diverse solutions than models that suffer from mode collapse.
  • They do exact inference, meaning that if your model is flexible enough, and your optimizer is reasonable, you shouldn't have stability problems like often found in GANs.

The main disadvantages are:

  • They are slower at training and inference. I am not sure how much this is the case quantiatively, and I think also it may depend on different factors, like how good your optimizer is. But generally, a well-tempered GAN can train faster, and also will be faster at inference time. I think the slower inference of NFs is mostly because they are less parameter efficient (you need larger models to reach similar performance). Continuous NFs (like FFJORD) seem to be a lot more parameter efficient, but they are slow for other reasons, namely that Neural ODEs are slow.
  • Because they are so good at modelling the distribution of the data, they can also model errors/flukes in the data, which show up (with the frequency at which they showed in the data), while some less powerful models may actually be unaffected by these outliers.

Overall, at least in image world (less clear about other modalities), they produce better quality results than VAEs, but worse than GANs. However, they may be more stable and thus easier to train than GANs. I think GANs are worth giving a try, though. Other models that I think are worth giving a try are: VQ-VAEs and energy-based models, which have different strengths and weaknesses, but have both shown very good performance at modeling complex distributions (again mostly in images). VQ-VAEs are also state of the art at music modelling, and close to SOTA at image modelling, so they are worth looking at I think. I am giving NFs a try mostly because they are SOTA for motion modelling currently, and they have nice theoretical properties (thanks to exact inference)

Data

We are compiling the dataset, which comprises several sources, and will release it soon.

Some parts of the dataset are already publicly available. The AIST++ dataset is available here, and most of the VR dances are available from several playing lists including this one, this one, this one and this channel. These dances can all be visualized in 3D in the VRChat world Papa Dance by lox9973. The VR dance videos can be converted to bvh (and retargetted to any humanoid skeleton) using this Unity tool, and the features processed as explained in the code. However, we will release the full set of features soon, so that it will be easier to replicate and build on this dataset:)

Cite

If you use models or code from this repo, cite us using

@article{vallepérez2021transflower,
      title={Transflower: probabilistic autoregressive dance generation with multimodal attention}, 
      author={Guillermo Valle-Pérez and Gustav Eje Henter and Jonas Beskow and André Holzapfel and Pierre-Yves Oudeyer and Simon Alexanderson},
      year={2021},
      eprint={2106.13871},
      archivePrefix={arXiv},
      primaryClass={cs.SD}
}

Community

I'll be continuing some extensions to this work, with the aim to develop open source tools and datasets. You can join this community discord, where we discuss progress in this project, which is part of an umbrella project to explore and develop the intersection between VR and AI:). See http://metagen.ai

You might also like...
PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos
PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos. By adopting a unified pipeline-based API design, PyKale enforces standardization and minimalism, via reusing existing resources, reducing repetitions and redundancy, and recycling learning models across areas.

Code and datasets for the paper
Code and datasets for the paper "Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction" (RA-L, 2021)

Combining Events and Frames using Recurrent Asynchronous Multimodal Networks for Monocular Depth Prediction This is the code for the paper Combining E

Framework for joint representation learning, evaluation through multimodal registration and comparison with image translation based approaches
Framework for joint representation learning, evaluation through multimodal registration and comparison with image translation based approaches

CoMIR: Contrastive Multimodal Image Representation for Registration Framework 🖼 Registration of images in different modalities with Deep Learning 🤖

MERLOT: Multimodal Neural Script Knowledge Models
MERLOT: Multimodal Neural Script Knowledge Models

merlot MERLOT: Multimodal Neural Script Knowledge Models MERLOT is a model for learning what we are calling "neural script knowledge" -- representatio

Preprocessed Datasets for our Multimodal NER paper

Unified Multimodal Transformer (UMT) for Multimodal Named Entity Recognition (MNER) Two MNER Datasets and Codes for our ACL'2020 paper: Improving Mult

Code for SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations
Code for SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations

The Second Situated Interactive MultiModal Conversations (SIMMC 2.0) Challenge 2021 Welcome to the Second Situated Interactive Multimodal Conversation

FairyTailor: Multimodal Generative Framework for Storytelling
FairyTailor: Multimodal Generative Framework for Storytelling

FairyTailor: Multimodal Generative Framework for Storytelling

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.
Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision. ICCV 2021.

Towers of Babel: Combining Images, Language, and 3D Geometry for Learning Multimodal Vision Download links and PyTorch implementation of "Towers of Ba

The code repository for EMNLP 2021 paper
The code repository for EMNLP 2021 paper "Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization".

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization [Paper] accepted at the EMNLP 2021: Vision Guided Genera

Comments
  • Loader positional argument missing - Colab notebook

    Loader positional argument missing - Colab notebook

    Hello, When I run the "generate dance" cell in the colab notebook, I get an error that says "Traceback (most recent call last): File "inference/generate.py", line 76, in exp_opt = yaml.load(open(str(checkpoint_dir)+"/hparams.yaml","r").read()) TypeError: load() missing 1 required positional argument: 'Loader'"

    How can this be error be stopped?

    opened by fromglow 0
  • ImportError: cannot import name 'comb_filters'

    ImportError: cannot import name 'comb_filters'

    I have already installed package 'madmom' but when I run

    ./feature_extraction/audio_feature_extraction_test.sh songs/
    

    it shows error like

    Traceback (most recent call last):
      File "./feature_extraction/process_audio.py", line 12, in <module>
        from audio_feature_utils import *
      File "/home/shinsoumain/transflower-lightning/feature_extraction/audio_feature_utils.py", line 11, in <module>
        import feature_extraction.madmom as madmom
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/__init__.py", line 28, in <module>
        from . import audio, evaluation, features, io, ml, models, processors, utils
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/audio/__init__.py", line 27, in <module>
        from . import comb_filters, filters, signal, spectrogram, stft
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/audio/filters.py", line 13, in <module>
        from ..processors import Processor
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/processors.py", line 25, in <module>
        import madmom
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/__init__.py", line 28, in <module>
        from . import audio, evaluation, features, io, ml, models, processors, utils
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/audio/__init__.py", line 27, in <module>
        from . import comb_filters, filters, signal, spectrogram, stft
    ImportError: cannot import name 'comb_filters'
    HIIIIIIOOO
    HOOOOOO
    HOOOOOO
    HIIIIIIOOO
    Traceback (most recent call last):
      File "./feature_extraction/generate_ddc_features.py", line 50, in <module>
        from mpi4py import MPI
    ImportError: libmpi.so.12: cannot open shared object file: No such file or directory
    Traceback (most recent call last):
      File "./feature_extraction/process_audio.py", line 12, in <module>
        from audio_feature_utils import *
      File "/home/shinsoumain/transflower-lightning/feature_extraction/audio_feature_utils.py", line 11, in <module>
        import feature_extraction.madmom as madmom
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/__init__.py", line 28, in <module>
        from . import audio, evaluation, features, io, ml, models, processors, utils
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/audio/__init__.py", line 27, in <module>
        from . import comb_filters, filters, signal, spectrogram, stft
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/audio/filters.py", line 13, in <module>
        from ..processors import Processor
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/processors.py", line 25, in <module>
        import madmom
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/__init__.py", line 28, in <module>
        from . import audio, evaluation, features, io, ml, models, processors, utils
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/audio/__init__.py", line 27, in <module>
        from . import comb_filters, filters, signal, spectrogram, stft
    ImportError: cannot import name 'comb_filters'
    Traceback (most recent call last):
      File "./feature_extraction/apply_transforms.py", line 19, in <module>
        from audio_feature_utils import extract_features_hybrid, extract_features_mel, extract_features_multi_mel
      File "/home/shinsoumain/transflower-lightning/feature_extraction/audio_feature_utils.py", line 11, in <module>
        import feature_extraction.madmom as madmom
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/__init__.py", line 28, in <module>
        from . import audio, evaluation, features, io, ml, models, processors, utils
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/audio/__init__.py", line 27, in <module>
        from . import comb_filters, filters, signal, spectrogram, stft
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/audio/filters.py", line 13, in <module>
        from ..processors import Processor
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/processors.py", line 25, in <module>
        import madmom
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/__init__.py", line 28, in <module>
        from . import audio, evaluation, features, io, ml, models, processors, utils
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/audio/__init__.py", line 27, in <module>
        from . import comb_filters, filters, signal, spectrogram, stft
    ImportError: cannot import name 'comb_filters'
    Traceback (most recent call last):
      File "./feature_extraction/combine_feats.py", line 18, in <module>
        from audio_feature_utils import extract_features_hybrid, extract_features_mel, extract_features_multi_mel
      File "/home/shinsoumain/transflower-lightning/feature_extraction/audio_feature_utils.py", line 11, in <module>
        import feature_extraction.madmom as madmom
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/__init__.py", line 28, in <module>
        from . import audio, evaluation, features, io, ml, models, processors, utils
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/audio/__init__.py", line 27, in <module>
        from . import comb_filters, filters, signal, spectrogram, stft
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/audio/filters.py", line 13, in <module>
        from ..processors import Processor
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/processors.py", line 25, in <module>
        import madmom
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/__init__.py", line 28, in <module>
        from . import audio, evaluation, features, io, ml, models, processors, utils
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/audio/__init__.py", line 27, in <module>
        from . import comb_filters, filters, signal, spectrogram, stft
    ImportError: cannot import name 'comb_filters'
    Traceback (most recent call last):
      File "./feature_extraction/apply_transforms.py", line 19, in <module>
        from audio_feature_utils import extract_features_hybrid, extract_features_mel, extract_features_multi_mel
      File "/home/shinsoumain/transflower-lightning/feature_extraction/audio_feature_utils.py", line 11, in <module>
        import feature_extraction.madmom as madmom
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/__init__.py", line 28, in <module>
        from . import audio, evaluation, features, io, ml, models, processors, utils
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/audio/__init__.py", line 27, in <module>
        from . import comb_filters, filters, signal, spectrogram, stft
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/audio/filters.py", line 13, in <module>
        from ..processors import Processor
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/processors.py", line 25, in <module>
        import madmom
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/__init__.py", line 28, in <module>
        from . import audio, evaluation, features, io, ml, models, processors, utils
      File "/home/shinsoumain/transflower-lightning/feature_extraction/madmom/audio/__init__.py", line 27, in <module>
        from . import comb_filters, filters, signal, spectrogram, stft
    ImportError: cannot import name 'comb_filters'
    

    How can I solve this problem?

    opened by hatoriMain 0
  • Implement conditional VDVAE

    Implement conditional VDVAE

    As an alternative method that has shown promise in the image domain, we could impement a hierarhical deep VAE (see VDVAE), but conditioned (in the same way as standard conditional VAEs) on the previous motion and the music (or whichever modealities one is studying)

    opened by guillefix 0
  • Implement jukebox

    Implement jukebox

    Implement a version of jukebox applied to the task of motion prediction.

    We could start with just a single level in the hierarchy, so that we basically implement a VQ-VAE for a "1-dimensional image" corresponding to a window of motions of a certain size.

    Then we train an autoregressive transformer to predict the VQ-VAE latent tokens (which encode poses), conditioned on music (in the same way the current multimodal transformer works).

    Could use dVAEs (as in DALL-E)

    opened by guillefix 0
Owner
Guillermo Valle
ML theory PhD student. Making tools to augment our experience and ability, in particular for science and maths. Very interested in AI, IA, AR, nano/biote
Guillermo Valle
Alex Pashevich 62 Dec 24, 2022
This repo provides the official code for TransBTS: Multimodal Brain Tumor Segmentation Using Transformer (https://arxiv.org/pdf/2103.04430.pdf).

TransBTS: Multimodal Brain Tumor Segmentation Using Transformer This repo is the official implementation for TransBTS: Multimodal Brain Tumor Segmenta

Raymond 247 Dec 28, 2022
multimodal transformer

This repo holds the code to perform experiments with the multimodal autoregressive probabilistic model Transflower. Overview of the repo It is structu

Guillermo Valle 68 Dec 13, 2022
METER: Multimodal End-to-end TransformER

METER Code and pre-trained models will be publicized soon. Citation @article{dou2021meter, title={An Empirical Study of Training End-to-End Vision-a

Zi-Yi Dou 257 Jan 6, 2023
Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion"

MKGFormer Code for the SIGIR 2022 paper "Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion" Model Architecture Illu

ZJUNLP 68 Dec 28, 2022
VSR-Transformer - This paper proposes a new Transformer for video super-resolution (called VSR-Transformer).

VSR-Transformer By Jiezhang Cao, Yawei Li, Kai Zhang, Luc Van Gool This paper proposes a new Transformer for video super-resolution (called VSR-Transf

Jiezhang Cao 225 Nov 13, 2022
A Comparative Framework for Multimodal Recommender Systems

Cornac Cornac is a comparative framework for multimodal recommender systems. It focuses on making it convenient to work with models leveraging auxilia

Preferred.AI 671 Jan 3, 2023
Deep Multimodal Neural Architecture Search

MMNas: Deep Multimodal Neural Architecture Search This repository corresponds to the PyTorch implementation of the MMnas for visual question answering

Vision and Language Group@ MIL 23 Dec 21, 2022
Rethinking the U-Net architecture for multimodal biomedical image segmentation

MultiResUNet Rethinking the U-Net architecture for multimodal biomedical image segmentation This repository contains the original implementation of "M

Nabil Ibtehaz 308 Jan 5, 2023
Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020

XDVioDet Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020. The proj

peng 64 Dec 12, 2022