PyTorch package for the discrete VAE used for DALL·E.

OpenAI

Last update: Jan 5, 2023

Related tags

Deep Learning DALL-E

Overview

[Blog] [Paper] [Model Card] [Usage]

This is the official PyTorch package for the discrete VAE used for DALL·E.

Installation

Before running the example notebook, you will need to install the package using

pip install git+https://github.com/openai/DALL-E.git

Comments

How to sample or generate a new image?

Hi, it's a great work! But I am a little confused about how to generate a new image? Shall I give the sentence tokens and then use them to predict the image tokens? And where to inject the noise? It will be very appreciate that you can answer these questions, thank you!

opened by JohnDreamer 36

Error on executing usage.ipynb notebook on a cuda:0 device

I changed this line as sugggested to use the GPU:

# This can be changed to a GPU, e.g. 'cuda:0'.
dev = torch.device('cuda:0')

And I tried to execute the notebook. I got the following error message:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/tmp/ipykernel_11/3257249919.py in <module>
      1 import torch.nn.functional as F
      2 
----> 3 z_logits = enc(x)
      4 z = torch.argmax(z_logits, axis=1)
      5 z = F.one_hot(z, num_classes=enc.vocab_size).permute(0, 3, 1, 2).float()

/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []

/opt/conda/lib/python3.8/site-packages/dall_e/encoder.py in forward(self, x)
     91                         raise ValueError('input must have dtype torch.float32')
     92 
---> 93                 return self.blocks(x)

/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []

/opt/conda/lib/python3.8/site-packages/torch/nn/modules/container.py in forward(self, input)
    139     def forward(self, input):
    140         for module in self:
--> 141             input = module(input)
    142         return input
    143 

/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
   1108         if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1109                 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1110             return forward_call(*input, **kwargs)
   1111         # Do not call functions when jit is used
   1112         full_backward_hooks, non_full_backward_hooks = [], []

/opt/conda/lib/python3.8/site-packages/dall_e/utils.py in forward(self, x)
     41                         w, b = self.w, self.b
     42 
---> 43                 return F.conv2d(x, w, b, padding=(self.kw - 1) // 2)
     44 
     45 def map_pixels(x: torch.Tensor) -> torch.Tensor:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cpu and cuda:0! (when checking argument for argument weight in method wrapper___slow_conv2d_forward)

opened by esparig 4

KL Loss

I am having trouble getting the dVAE to train properly if I include the KL loss term with a uniform prior over the number of visual tokens. Does anyone here has had similar experiences or problems? The paper mentions an increasing schedule for the kl weight factor but I cant get it to work properly and results are always better if I set the KL loss to zero altogether.

Maybe someone can help?

opened by CDitzel 4
Hyperparameters of the bottleneck
Thanks for releasing the paper as well as the codes!

Could you give some hints on the hyperparameters of the bottleneck that might affect the performance?

The downsampling ratio. In the original VQ-VAE paper, they only use 4 times downsampling (comparing to 8 in DALL-E) and it seems their generated images lack a global structure (I assumed also because they didn't use a powerful prior model). Is that using a higher downsampling rate, the global structure is better preserved? Or easier for the prior model to learn?

The codebook size is set to 2^13, have you tried using smaller codebook size? Presumably, as the codebook size shrinks, the VAE can hardly reconstruct the image. What does the reconstructed image with a small codebook look like? Is the texture still preserved but the global structure distorted or something else?

I also have an additional question to the inference stage of the model. Is the image tokens sampled from the prior transformer auto-regressively or using other searching technique? Also, how to control the number of the generated image tokens to be exactly 32 * 32?
opened by cdjhz 4
Implementation Doubts

Although this codebase is for the vae part. would appreciate if you could help in understanding few components of the transformer part also. In the paper and blog released, you mentioned that you use Child et al paper. Can you elaborate on what you use as the block size? 8/16/32 If we use a block size of 16 for example then how do you implement the convolutional kernel, it has gaps of only 1 block but if you have sparse block of size 16 then it doesn't make sense.

Also, when you are training the gpt style model. Even though the loss and perplexity reduce, how do you identify when the perplexity/loss value of the 1.2B parameter model is sufficient? like is a loss of 4/5 good or should it be <1.

opened by shubhamag97 3
Help

Hello i am brand new to the github community and coding, i have zero idea how to install this but im an artist and it would be an excellent resource for non copyrighted images, i know its alot to ask but can someone please tell me how to install this code i made my account for this specifically for this

opened by DandelionBones 3
questions on notebook

I just downloaded the repo to my local file system and used jupyter notebook and then opened and played the notebook. I also downloaded the encoder and decoder to the same folder for ease of loading. It says that 'preprocess' is not defined, but it seems to be. Admittedly, a bit rusty. Running Python 3.9 on Mac OSX. Also, I may be way out of line with respect to the purpose, but I was expecting to see code that took natural language input (e.g. "Show me a penguin on snow") and then DALL*E returns the provided image.

Originally posted by @metaphorz in https://github.com/openai/DALL-E/issues/5#issuecomment-787310649

opened by metaphorz 3
an TypeError

pytorch:1.7.1 torchversion: 0.8.2 when run the code, it seems wrong:

Traceback (most recent call last): File "E:/github/DALL-E-master/test.py", line 46, in display(T.ToPILImage(mode='RGB')(x[0])) File "C:\ProgramData\Anaconda3\lib\site-packages\torchvision\transforms\transforms.py", line 185, in call return F.to_pil_image(pic, self.mode) File "C:\ProgramData\Anaconda3\lib\site-packages\torchvision\transforms\functional.py", line 202, in to_pil_image 'not {}'.format(type(npimg))) TypeError: Input pic must be a torch.Tensor or NumPy ndarray, not <class 'numpy.ndarray'> Original image:

opened by shen51000 2
Why do we need logit_laplace_eps in utils.py?

What's the meaning of logit_laplace_eps here since both the input and output are [0,1] tensors. https://github.com/openai/DALL-E/blob/5be4b236bc3ade6943662354117a0e83752cc322/dall_e/utils.py#L51

opened by cientgu 2
Why the output dimension of the decoder is 2 * output_channels which is 6, not 3 (RGB)?

Hi, thanks for the code!

It's a simple question, I found the output dimension of decoder is set to 2 * self.output_channels, which is 6. I expect the output dimension should be 3 (RGB). Can you kindly explain the reason?

Thank you in advance!

https://github.com/openai/DALL-E/blob/3381ae9a10bafe3cb1c7c9fff554565ad7751e7f/dall_e/decoder.py#L82

opened by nashory 2
Hi! It works? I for ex cant launch it. Why? Look!

screens / installation

https://disk.yandex.ru/i/kCSzh6LodRkWlQ

CAN You explain me what should i do here and how launch the app?

thank you in advance

opened by mazzzai 2

Owner

OpenAI

GitHub

PyTorch Autoencoders - Implementing a Variational Autoencoder (VAE) Series in Pytorch.

PyTorch Autoencoders Implementing a Variational Autoencoder (VAE) Series in Pytorch. Inspired by this repository Model List check model paper conferen

8 Nov 21, 2022

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

DALL-E in Pytorch Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch. It will also contain CLIP for ranking the ge

5k Jan 4, 2023

Collection of generative models, e.g. GAN, VAE in Pytorch and Tensorflow.

Generative Models Collection of generative models, e.g. GAN, VAE in Pytorch and Tensorflow. Also present here are RBM and Helmholtz Machine. Note: Gen

7k Jan 2, 2023

Annotated, understandable, and visually interpretable PyTorch implementations of: VAE, BIRVAE, NSGAN, MMGAN, WGAN, WGANGP, LSGAN, DRAGAN, BEGAN, RaGAN, InfoGAN, fGAN, FisherGAN

Overview PyTorch 0.4.1 | Python 3.6.5 Annotated implementations with comparative introductions for minimax, non-saturating, wasserstein, wasserstein g

471 Dec 16, 2022

Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

ACTOR Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021. Please visit our we

248 Dec 23, 2022

Open-AI's DALL-E for large scale training in mesh-tensorflow.

DALL-E in Mesh-Tensorflow [WIP] Open-AI's DALL-E in Mesh-Tensorflow. If this is similarly efficient to GPT-Neo, this repo should be able to train mode

432 Dec 16, 2022

RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP

[Paper] [Хабр] [Model Card] [Colab] [Kaggle] RuDOLPH ?? ?? ☃️ One Hyper-Modal Tr

230 Dec 31, 2022

CVPR 2021: "Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE"

Diverse Structure Inpainting ArXiv | Papar | Supplementary Material | BibTex This repository is for the CVPR 2021 paper, "Generating Diverse Structure

152 Nov 4, 2022

VideoGPT: Video Generation using VQ-VAE and Transformers

VideoGPT: Video Generation using VQ-VAE and Transformers [Paper][Website][Colab][Gradio Demo] We present VideoGPT: a conceptually simple architecture

470 Dec 30, 2022

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

End-to-End Optimization of Scene Layout Code release for: End-to-End Optimization of Scene Layout CVPR 2020 (Oral) Project site, Bibtex For help conta

41 Dec 9, 2022

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

VAENAR-TTS This repo contains code accompanying the paper "VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis". Sa

138 Oct 28, 2022

A library built upon PyTorch for building embeddings on discrete event sequences using self-supervision

pytorch-lifestream a library built upon PyTorch for building embeddings on discrete event sequences using self-supervision. It can process terabyte-si

103 Dec 17, 2022

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation This project hosts the code for implementing the DCT-MASK algorithms

57 Nov 27, 2022

Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

ResDAVEnet-VQ Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech What is in this repo? M

21 Aug 23, 2022

This is 2nd term discrete maths project done by UCU students that uses backtracking to solve various problems.

Backtracking Project Sponsors This is a project made by UCU students: Olha Liuba - crossword solver implementation Hanna Yershova - sudoku solver impl

4 Oct 17, 2021

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations Implementation of the method described in the Speech Resynthesis from Di

253 Jan 6, 2023

PyTorch package for the discrete VAE used for DALL·E.

Related tags

Overview

Overview

Installation

Comments

How to sample or generate a new image?

Error on executing usage.ipynb notebook on a cuda:0 device

KL Loss

Hyperparameters of the bottleneck

Implementation Doubts

Help

questions on notebook

an TypeError

Why do we need logit_laplace_eps in utils.py?

Why the output dimension of the decoder is 2 * output_channels which is 6, not 3 (RGB)?

Hi! It works? I for ex cant launch it. Why? Look!

Owner

OpenAI

PyTorch Autoencoders - Implementing a Variational Autoencoder (VAE) Series in Pytorch.

Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Collection of generative models, e.g. GAN, VAE in Pytorch and Tensorflow.

Annotated, understandable, and visually interpretable PyTorch implementations of: VAE, BIRVAE, NSGAN, MMGAN, WGAN, WGANGP, LSGAN, DRAGAN, BEGAN, RaGAN, InfoGAN, fGAN, FisherGAN

Official Pytorch implementation of the paper "Action-Conditioned 3D Human Motion Synthesis with Transformer VAE", ICCV 2021

Open-AI's DALL-E for large scale training in mesh-tensorflow.

RuDOLPH: One Hyper-Modal Transformer can be creative as DALL-E and smart as CLIP

CVPR 2021: "Generating Diverse Structure for Image Inpainting With Hierarchical VQ-VAE"

VideoGPT: Video Generation using VQ-VAE and Transformers

Official code for "End-to-End Optimization of Scene Layout" -- including VAE, Diff Render, SPADE for colorization (CVPR 2020 Oral)

The official implementation of VAENAR-TTS, a VAE based non-autoregressive TTS model.

A library built upon PyTorch for building embeddings on discrete event sequences using self-supervision

DCT-Mask: Discrete Cosine Transform Mask Representation for Instance Segmentation

Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

This is 2nd term discrete maths project done by UCU students that uses backtracking to solve various problems.

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions

Auto HMM: Automatic Discrete and Continous HMM including Model selection

This Jupyter notebook shows one way to implement a simple first-order low-pass filter on sampled data in discrete time.