VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan

Last update: Dec 30, 2022

Related tags

Deep Learning VideoGPT

Overview

VideoGPT: Video Generation using VQ-VAE and Transformers

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models.

Approach

Installation

Change the cudatoolkit version compatible to your machine.

$ conda install --yes -c pytorch pytorch=1.7.1 torchvision cudatoolkit=11.0
$ pip install git+https://github.com/wilson1yan/VideoGPT.git

Sparse Attention (Optional)

For limited compute scenarios, it may be beneficial to use sparse attention.

$ sudo apt-get install llvm-9-dev
$ DS_BUILD_SPARSE_ATTN=1 pip install deepspeed

After installng deepspeed, you can train a sparse transformer by setting the flag --attn_type sparse in scripts/train_videogpt.py. The default support sparsity configuration is an N-d strided sparsity layout, however, you can write your own arbitrary layouts to use.

Dataset

The default code accepts data as an HDF5 file with the specified format in videogpt/data.py, and a directory format with the follow structure:

video_dataset/
    train/
        class_0/
            video1.mp4
            video2.mp4
            ...
        class_1/
            video1.mp4
            ...
        ...
        class_n/
            ...
    test/
        class_0/
            video1.mp4
            video2.mp4
            ...
        class_1/
            video1.mp4
            ...
        ...
        class_n/
            ...

An example of such a dataset can be constructed from UCF-101 data by running the script

sh scripts/preprocess/create_ucf_dataset.sh datasets/ucf101

You may need to install unrar and unzip for the code to work correctly.

If you do not care about classes, the class folders are not necessary and the dataset file structure can be collapsed into train and test directories of just videos.

Using Pretrained VQ-VAEs

There are four available pre-trained VQ-VAE models. All strides listed with each model are downsampling amounts across THW for the encoders.

bair_stride4x2x2: trained on 16 frame 64 x 64 videos from the BAIR Robot Pushing dataset
ucf101_stride4x4x4: trained on 16 frame 128 x 128 videos from UCF-101
kinetics_stride4x4x4: trained on 16 frame 128 x 128 videos from Kinetics-600
kinetics_stride2x4x4: trained on 16 frame 128 x 128 videos from Kinetics-600, with 2x larger temporal latent codes (achieves slightly better reconstruction)

from torchvision.io import read_video
from videogpt import load_vqvae
from videogpt.data import preprocess

video_filename = 'path/to/video_file.mp4'
sequence_length = 16
resolution = 128
device = torch.device('cuda')

vqvae = load_vqvae('kinetics_stride2x4x4')
video = read_video(video_filename, pts_unit='sec')[0]
video = preprocess(video, resolution, sequence_length).unsqueeze(0).to(device)

encodings = vqvae.encode(video)
video_recon = vqvae.decode(encodings)

Training VQ-VAE

Use the scripts/train_vqvae.py script to train a VQ-VAE. Execute python scripts/train_vqvae.py -h for information on all available training settings. A subset of more relevant settings are listed below, along with default values.

VQ-VAE Specific Settings

--embedding_dim: number of dimensions for codebooks embeddings
--n_codes 2048: number of codes in the codebook
--n_hiddens 240: number of hidden features in the residual blocks
--n_res_layers 4: number of residual blocks
--downsample 4 4 4: T H W downsampling stride of the encoder

Training Settings

--gpus 2: number of gpus for distributed training
--sync_batchnorm: uses SyncBatchNorm instead of BatchNorm3d when using > 1 gpu
--gradient_clip_val 1: gradient clipping threshold for training
--batch_size 16: batch size per gpu
--num_workers 8: number of workers for each DataLoader

Dataset Settings

--data_path : path to an hdf5 file or a folder containing train and test folders with subdirectories of videos
--resolution 128: spatial resolution to train on
--sequence_length 16: temporal resolution, or video clip length

Training VideoGPT

You can download a pretrained VQ-VAE, or train your own. Afterwards, use the scripts/train_videogpt.py script to train an VideoGPT model for sampling. Execute python scripts/train_videogpt.py -h for information on all available training settings. A subset of more relevant settings are listed below, along with default values.

VideoGPT Specific Settings

--vqvae kinetics_stride4x4x4: path to a vqvae checkpoint file, OR a pretrained model name to download. Available pretrained models are: bair_stride4x2x2, ucf101_stride4x4x4, kinetics_stride4x4x4, kinetics_stride2x4x4. BAIR was trained on 64 x 64 videos, and the rest on 128 x 128 videos
--n_cond_frames 0: number of frames to condition on. 0 represents a non-frame conditioned model
--class_cond: trains a class conditional model if activated
--hidden_dim 576: number of transformer hidden features
--heads 4: number of heads for multihead attention
--layers 8: number of transformer layers
--dropout 0.2': dropout probability applied to features after attention and positionwise feedforward layers
--attn_type full: full or sparse attention. Refer to the Installation section for install sparse attention
--attn_dropout 0.3: dropout probability applied to the attention weight matrix

Training Settings

--gpus 2: number of gpus for distributed training
--sync_batchnorm: uses SyncBatchNorm instead of BatchNorm3d when using > 1 gpu
--gradient_clip_val 1: gradient clipping threshold for training
--batch_size 16: batch size per gpu
--num_workers 8: number of workers for each DataLoader

Dataset Settings

--data_path : path to an hdf5 file or a folder containing train and test folders with subdirectories of videos
--resolution 128: spatial resolution to train on
--sequence_length 16: temporal resolution, or video clip length

Sampling VideoGPT

After training, the VideoGPT model can be sampled using the scripts/sample_videogpt.py. You may need to install ffmpeg: sudo apt-get install ffmpeg

Reproducing Paper Results

Note that this repo is primarily designed for simplicity and extending off of our method. Reproducing the full paper results can be done using code found at a separate repo. However, be aware that the code is not as clean.

Citation

Please consider using the follow citation when using our code:

@misc{yan2021videogpt,
      title={VideoGPT: Video Generation using VQ-VAE and Transformers}, 
      author={Wilson Yan and Yunzhi Zhang and Pieter Abbeel and Aravind Srinivas},
      year={2021},
      eprint={2104.10157},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Comments

Could this be used as a next frame predictor?

Do you have a minimal example of feeding the model with N frames and then reconstructing N + M frames or M frames only which however begin at the last frame of N? I would want to use it for next-frame-predictor.

opened by radi-cho 9
unknown error

when i try to import the packages, i get this error:

ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data' (/opt/conda/lib/python3.7/site-packages/torchmetrics/utilities/data.py)

i'm running the program on the kaggle laboratory. can anyone help me? thanks

opened by BoccheseGiacomo 7
Generating longer sequences

Hi! Thank you for your wonderful work and for releasing the code publicly! Could you please tell whether there is an easy way (from the implementation perspective) to generate videos longer than 16 frames with your model trained on 16-frames videos?

My initial attempt was to simply edit the sample(...) method to run over the latent shape of (8, 32, 32) instead of (4, 32, 32) (for UCF-101), but that didn't work due to caching and positional embeddings being constrained to the (4, 32, 32) shape. So I suppose that I need to change other places as well to discard old context, but it's not clear what these places should be.

opened by universome 6
Some questions on VideoGPT.

Hi. Thanks for this great work. I have some questions on the general status of video generative modeling and the implementation sides, which I would appreciate answers from authors and other people who might be interested in making them.

Does this model worth being called "a video GPT" when compared to the community's consensus on the achievable encoding/decoding capacities and qualities of publicly available state-of-the-art video generative models?

After looking through the implementation, I quickly noticed that VideoGPT consumes video files that are composed of around 16 consecutive frames from 25+ fps raw videos (e.g., UCF-101). That is, modeling video clips that run much less than one second is the problem of interest in this model.

I am not clearly aware of the current status of video generative modeling, however, I am just curious even these short (I am not even sure if I can say ``short'', but since most YouTube video clips are, I think, at least longer than 10 seconds, I am choosing the word short.) video models are significantly harder to design/train than VQ-VAE families for still images.

The first impression I had from the name "VideoGPT" was like it can model videos of tens of seconds, just like the GPT models which can generate paragraph-length sentences. But I think, in the current form of VideoGPT, there may be lots of spaces for further improvements to actually achieve similar generative capabilities of the GPT from field of natural language generation.

Why are codebook vectors first initialized with very-specific initializers like randn, zeros, then re-initialized with latent states of a training batch?

Refering to lines 126-156 at https://github.com/wilson1yan/VideoGPT/blob/master/videogpt/vqvae.py.

I am just wondering if there are some hacky reasons behind doing this. It is obviously simpler to just initialize the codebook for just one time with those calculated latents of a training batch.

Why are Transformer and VQ-VAE separately (not jointly) trained?

I think there would be mixed reasons for this (e.g., memory consumption, training instabilities, etc.). But what is the major & most important reason not to perform joint training?

opened by kami93 4
Number of Layers for Prior Model

Will there be a significant performance drop with 8 attention layer or 4 attention layer prior models on the bair pushing dataset? I also found paper used a 16 attention layer prior model but the default setting in the code base is 8 attention layer.

opened by xinbowu2 4
vqvae finetuning procedures

Hi, I'm attempting to finetune the pretrained ucf vqvae to another data set. But the reconstruction immediately deteriorate into something blurry with watery motion very quickly (~400 training steps; batch size=32).

I was wondering if there are any procedures that I need to do to for fintuneing to work (eg. setting codebook._need_init = False)?

Thanks in advance! (sorry to bother you with such minute details :'( )

opened by Goulustis 4
How many frames (seconds) are there in each video sample used in the training process?

How many frames (seconds) are there in each video sample used in the training process? What’‘s the video length in the dataset? Did you directly use the complete video or slice the video?

opened by 962858249 3
Issues about the python version

the python version I use is 3.6.13, when I ran "pip install git+https://github.com/wilson1yan/VideoGPT.git", I found "p~=0.8.1" in the requirements.txt can not be installed. could you tell me the need python version to run the code ?

opened by baiyuting 3
Question about data dependent initialization

Hi, thanks for sharing the nice repo. I have a question about the codebook initialization. If you re-initialize the codebook only one-time using training-batch encoder output at the beginning of the training stage, will you get a very low commit loss after that? Will it affect the generalization of testing data? In my case, I want to borrow the data-dependent re-initialization technique and apply it to my own project. I realize that it is improving code usage and prevents codebook collapse but fails to generalize on testing data (having a high reconstruction and commit loss on testing data. Is there any insight about it? Thank you!

opened by sukun1045 3
deepspeed version

Dear authors,

which version of deepspeed have you used? I am using deepspeed = 0.5.10 and get error with from deepspeed.ops.sparse_attention import MatMul, Softmax

I import the module with from deepspeed.ops.sparse_attention.matmul import MatMul. However, IndexError happens in the forward.

best

opened by yrcong 2
the previous dependency is only one time step during inference?

https://github.com/wilson1yan/VideoGPT/blob/d157da51b3b9766648eb1e54a1008ff965e26b65/videogpt/gpt.py#L97-L107

hi, @wilson1yan! In these lines, it seems that the iterative generation of the next code only depends on the one previous time step? The shape of embeddings_slice is always [bs, 1, 1, 1, embed_dim].

opened by PanXiebit 2
Questions about downsample and codebook

Hello, in videoGPT, if I use the ucf101 dataset for training and change the model stride (such as 2x16x16). Will the dimensions of the codebook change accordingly? If it changes, do I need to set any hyperparameters during training, or just set --downsample 2 16 16?

opened by qklee-lz 0
Extending your code for poses
This is a "support" request rather than a bug or a feature.

Idea

I have a "video" sequence that is represented as skeletal poses rather than video frames. Each pose was extracted from a video frame, and is now a tensor of shape [frames, keypoints, dimensions] such that a [100,137,2] tensor would be a 2D pose, of 137 keypoints, over 100 frames.

As there is no consistent spatial information between strides, we can imagine the dimensions to be equivalent to channels, and apply a full-size convolution for in_channels=2, out_channels=C, kernel_size=(F, 137), stride=S. (Where C, S, and F are hyperparameters, and can run for multiple layers)

After multiple layers of convolution, these representations will then be quantized, and be decoded in the reverse way.

Why fork from your library?

It seems stable

It works for the same sample rate - unlike audio based models

Support request:

While I can make the data loading model to load these tensors, and perform the necessary data augmentation, etc, I'm having some trouble understanding how to properly implement the convolutional encoder and decoder. (this is different as it is not downsampling, over a specially/temporally consistent input)

Could you please offer some guidance?

Thanks!
opened by AmitMY 1
Tensor Size Mismatch in attention.py

Getting the following runtime error when running train_videogpt.py on a custom dataset. As for the training args, I trained the vqvae with a batch size of 16, and doing so the same for videogpt. Attached is the traceback. New to pytorch, so any pointers as to where I could go to fix this? Currently only using 1 GPU as I was getting DDP errors from lightning using two, but will be going back later once I can make sure the training can run properly. Any idea what this tensor has in dim3? The dataset I'm using is simply a train/test set of mp4s.

Used this command with basic arguments. python3 scripts/train_videogpt.py --data_path <custom_data_path> --gpus 1 --batch_size 16 --vqvae <custom_data_path/epoch=0-step=11702.ckpt> --max_steps 200000

Traceback (most recent call last): File "scripts/train_videogpt.py", line 43, in main() File "scripts/train_videogpt.py", line 39, in main trainer.fit(model, data) File "/home/humdaan/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit self._run(model) File "/home/humdaan/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run self.dispatch() File "/home/humdaan/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch self.accelerator.start_training(self) File "/home/humdaan/miniconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training self.training_type_plugin.start_training(trainer) File "/home/humdaan/miniconda3/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training self._results = trainer.run_stage() File "/home/humdaan/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage return self.run_train() File "/home/humdaan/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 842, in run_train self.run_sanity_check(self.lightning_module) File "/home/humdaan/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1107, in run_sanity_check self.run_evaluation() File "/home/humdaan/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 962, in run_evaluation output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx) File "/home/humdaan/miniconda3/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 174, in evaluation_step output = self.trainer.accelerator.validation_step(args) File "/home/humdaan/miniconda3/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 226, in validation_step return self.training_type_plugin.validation_step(*args) File "/home/humdaan/miniconda3/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in validation_step return self.lightning_module.validation_step(*args, **kwargs) File "/home/humdaan/Documents/school/AME494/videoGPT/VideoGPT/videogpt/gpt.py", line 158, in validation_step loss = self.training_step(batch, batch_idx) File "/home/humdaan/Documents/school/AME494/videoGPT/VideoGPT/videogpt/gpt.py", line 154, in training_step loss, _ = self(x, targets, cond) File "/home/humdaan/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/humdaan/Documents/school/AME494/videoGPT/VideoGPT/videogpt/gpt.py", line 131, in forward h = self.attn_stack(h, cond, decode_step, decode_idx) File "/home/humdaan/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/humdaan/Documents/school/AME494/videoGPT/VideoGPT/videogpt/attention.py", line 55, in forward x = self.pos_embd(x, decode_step, decode_idx) File "/home/humdaan/miniconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/humdaan/Documents/school/AME494/videoGPT/VideoGPT/videogpt/attention.py", line 493, in forward return x + embs RuntimeError: The size of tensor a (32) must match the size of tensor b (16) at non-singleton dimension 3

opened by humishum 3
Issue with memory only with your code
I got the following error while running train_vqvae.py:

OSError: [WinError 1455] The paging file is too small for this operation to complete. Error loading "C:\Users\sebdb\Anaconda3\envs\pythonProject\lib\site-packages\torch\lib\cusolver64_10.dll" or one of its dependencies

This is an issue related to the memory (I have 32GB of ram with a GTX 1080TI (11GB )).

I just tried loading UFC101 : data = VideoData(args), this work but then the error appear with:

data.train_dataloader() data.test_dataloader()

Those are the parameters I used:

parser.add_argument('--sequence_length', type=int, default=8) parser.add_argument('--resolution', type=int, default=64) parser.add_argument('--batch_size', type=int, default=1) parser.add_argument('--num_workers', type=int, default=0) #parser.add_argument('--num_workers', type=int, default=1) also same error

Here my paging file size: 98GB...
opened by Scienceseb 1

Running on 4 V100s, but epoch stays at 0

I'm running the model on 4 V100 GPUs using SLURM and the following sbatch script:

#!/bin/bash
#SBATCH -p gpu
#SBATCH --gres=gpu:4
#SBATCH -p reserved --reservation=slerman-20210821 -t 2-00:00:00 
#SBATCH -t 5-00:00:00 -o ./vgpt.log -J vgpt
#SBATCH --mem=50gb 
#SBATCH -C V100
source /scratch/slerman/miniconda/bin/activate vid
python3 train_videogpt.py --max_steps 200000 --vqvae ucf101_stride4x4x4 --data_path ./datasets/ucf101/ --gpus 4

However, after a full day, the logs still show the model stuck at epoch 0.

Do you know what's going wrong?

Thank you.

opened by slerman12 6

FVD score not conform to what's reported in the paper

Hi, I tested the Bair pre-trained VideoGPT model, but your evaluation script reported FVD to be 1000+, however FVD* was around 100, probably there's a mistake with your evaluation script?

opened by Gabriel-Huang 5

Owner

Wilson Yan

1st year PhD interested in unsupervised learning and reinforcement learning

GitHub

Collection of generative models, e.g. GAN, VAE in Pytorch and Tensorflow.

Generative Models Collection of generative models, e.g. GAN, VAE in Pytorch and Tensorflow. Also present here are RBM and Helmholtz Machine. Note: Gen

7k Jan 2, 2023

Annotated, understandable, and visually interpretable PyTorch implementations of: VAE, BIRVAE, NSGAN, MMGAN, WGAN, WGANGP, LSGAN, DRAGAN, BEGAN, RaGAN, InfoGAN, fGAN, FisherGAN

Overview PyTorch 0.4.1 | Python 3.6.5 Annotated implementations with comparative introductions for minimax, non-saturating, wasserstein, wasserstein g

471 Dec 16, 2022

PyTorch package for the discrete VAE used for DALL·E.

Overview [Blog] [Paper] [Model Card] [Usage] This is the official PyTorch package for the discrete VAE used for DALL·E. Installation Before running th

9.5k Jan 5, 2023

CVPR 2021: "Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE"

Diverse Structure Inpainting ArXiv | Papar | Supplementary Material | BibTex This repository is for the CVPR 2021 paper, "Generating Diverse Structure

152 Nov 4, 2022

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

End-to-End Optimization of Scene Layout Code release for: End-to-End Optimization of Scene Layout CVPR 2020 (Oral) Project site, Bibtex For help conta

41 Dec 9, 2022

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

138 Oct 28, 2022

Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

ACTOR Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021. Please visit our we

248 Dec 23, 2022

PyTorch Autoencoders - Implementing a Variational Autoencoder (VAE) Series in Pytorch.

PyTorch Autoencoders Implementing a Variational Autoencoder (VAE) Series in Pytorch. Inspired by this repository Model List check model paper conferen

8 Nov 21, 2022

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

1 Jan 23, 2022

Video lie detector using xgboost - A video lie detector using OpenFace and xgboost

video_lie_detector_using_xgboost a video lie detector using OpenFace and xgboost

2 Jan 11, 2022

Direct application of DALLE-2 to video synthesis, using factored space-time Unet and Transformers

DALLE2 Video (wip) ** only to be built after DALLE2 image is done and replicated, and the importance of the prior network is validated ** Direct appli

105 May 15, 2022

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Spacetimeformer Multivariate Forecasting This repository contains the code for the paper, "Long-Range Transformers for Dynamic Spatiotemporal Forecast

440 Jan 2, 2023

Official repository for "PAIR: Planning and Iterative Refinement in Pre-trained Transformers for Long Text Generation"

pair-emnlp2020 Official repository for the paper: Xinyu Hua and Lu Wang: PAIR: Planning and Iterative Refinement in Pre-trained Transformers for Long

31 Oct 13, 2022

Changing the Mind of Transformers for Topically-Controllable Language Generation

We will first introduce the how to run the IPython notebook demo by downloading our pretrained models. Then, we will introduce how to run our training and evaluation code.

20 Dec 6, 2022

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Storium GPT-2 Models This is the official repository for the GPT-2 models described in the EMNLP 2020 paper [STORIUM: A Dataset and Evaluation Platfor

27 Dec 20, 2022

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

35 Nov 20, 2022

Image-generation-baseline - MUGE Text To Image Generation Baseline

MUGE Text To Image Generation Baseline Requirements and Installation More detail

23 Oct 17, 2022

[CVPR 2022] Official PyTorch Implementation for "Reference-based Video Super-Resolution Using Multi-Camera Video Triplets"

Reference-based Video Super-Resolution (RefVSR) Official PyTorch Implementation of the CVPR 2022 Paper Project | arXiv | RealMCVSR Dataset This repo c

151 Dec 30, 2022

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers

VisTR: End-to-End Video Instance Segmentation with Transformers This is the official implementation of the VisTR paper: Installation We provide instru

687 Jan 7, 2023

VideoGPT: Video Generation using VQ-VAE and Transformers

Related tags

Overview

VideoGPT: Video Generation using VQ-VAE and Transformers

Approach

Installation

Sparse Attention (Optional)

Dataset

Using Pretrained VQ-VAEs

Training VQ-VAE

VQ-VAE Specific Settings

Training Settings

Dataset Settings

Training VideoGPT

VideoGPT Specific Settings

Training Settings

Dataset Settings

Sampling VideoGPT

Reproducing Paper Results

Citation

Comments

Idea

Why fork from your library?

Support request:

Owner

Wilson Yan

Collection of generative models, e.g. GAN, VAE in Pytorch and Tensorflow.

Annotated, understandable, and visually interpretable PyTorch implementations of: VAE, BIRVAE, NSGAN, MMGAN, WGAN, WGANGP, LSGAN, DRAGAN, BEGAN, RaGAN, InfoGAN, fGAN, FisherGAN

PyTorch package for the discrete VAE used for DALL·E.

CVPR 2021: "Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE"

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

PyTorch Autoencoders - Implementing a Variational Autoencoder (VAE) Series in Pytorch.

Video-Captioning - A machine Learning project to generate captions for video frames indicating the relationship between the objects in the video

Video lie detector using xgboost - A video lie detector using OpenFace and xgboost

Direct application of DALLE-2 to video synthesis, using factored space-time Unet and Transformers

Multivariate Time Series Forecasting with efficient Transformers. Code for the paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting."

Official repository for "PAIR: Planning and Iterative Refinement in Pre-trained Transformers for Long Text Generation"

Changing the Mind of Transformers for Topically-Controllable Language Generation

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

Image-generation-baseline - MUGE Text To Image Generation Baseline

[CVPR 2022] Official PyTorch Implementation for "Reference-based Video Super-Resolution Using Multi-Camera Video Triplets"

[CVPR2021 Oral] End-to-End Video Instance Segmentation with Transformers