Implementation / replication of DALL-E, OpenAI's Text to Image Transformer, in Pytorch

Overview

DALL-E in Pytorch

Implementation / replication of DALL-E (paper), OpenAI's Text to Image Transformer, in Pytorch. It will also contain CLIP for ranking the generations.

Sid, Ben, and Aran over at Eleuther AI are working on DALL-E for Mesh Tensorflow! Please lend them a hand if you would like to see DALL-E trained on TPUs.

Yannic Kilcher's video

Before we replicate this, we can settle for Deep Daze or Big Sleep

Open In Colab Train in Colab

Status

  • Hannu has managed to train a small 6 layer DALL-E on a dataset of just 2000 landscape images! (2048 visual tokens)

  • Kobiso, a research engineer from Naver, has trained on both the CUBS and COCO dataset here, using full and deepspeed sparse attention
  • afiaka87 has managed one epoch using a 32 layer reversible DALL-E here

Install

$ pip install dalle-pytorch

Usage

Train VAE

import torch
from dalle_pytorch import DiscreteVAE

vae = DiscreteVAE(
    image_size = 256,
    num_layers = 3,           # number of downsamples - ex. 256 / (2 ** 3) = (32 x 32 feature map)
    num_tokens = 8192,        # number of visual tokens. in the paper, they used 8192, but could be smaller for downsized projects
    codebook_dim = 512,       # codebook dimension
    hidden_dim = 64,          # hidden dimension
    num_resnet_blocks = 1,    # number of resnet blocks
    temperature = 0.9,        # gumbel softmax temperature, the lower this is, the harder the discretization
    straight_through = False, # straight-through for gumbel softmax. unclear if it is better one way or the other
)

images = torch.randn(4, 3, 256, 256)

loss = vae(images, return_loss = True)
loss.backward()

# train with a lot of data to learn a good codebook

Train DALL-E with pretrained VAE from above

import torch
from dalle_pytorch import DiscreteVAE, DALLE

vae = DiscreteVAE(
    image_size = 256,
    num_layers = 3,
    num_tokens = 8192,
    codebook_dim = 1024,
    hidden_dim = 64,
    num_resnet_blocks = 1,
    temperature = 0.9
)

dalle = DALLE(
    dim = 1024,
    vae = vae,                  # automatically infer (1) image sequence length and (2) number of image tokens
    num_text_tokens = 10000,    # vocab size for text
    text_seq_len = 256,         # text sequence length
    depth = 12,                 # should aim to be 64
    heads = 16,                 # attention heads
    dim_head = 64,              # attention head dimension
    attn_dropout = 0.1,         # attention dropout
    ff_dropout = 0.1            # feedforward dropout
)

text = torch.randint(0, 10000, (4, 256))
images = torch.randn(4, 3, 256, 256)
mask = torch.ones_like(text).bool()

loss = dalle(text, images, mask = mask, return_loss = True)
loss.backward()

# do the above for a long time with a lot of data ... then

images = dalle.generate_images(text, mask = mask)
images.shape # (4, 3, 256, 256)

OpenAI's Pretrained VAE

You can also skip the training of the VAE altogether, using the pretrained model released by OpenAI! The wrapper class should take care of downloading and caching the model for you auto-magically.

import torch
from dalle_pytorch import OpenAIDiscreteVAE, DALLE

vae = OpenAIDiscreteVAE()       # loads pretrained OpenAI VAE

dalle = DALLE(
    dim = 1024,
    vae = vae,                  # automatically infer (1) image sequence length and (2) number of image tokens
    num_text_tokens = 10000,    # vocab size for text
    text_seq_len = 256,         # text sequence length
    depth = 1,                  # should aim to be 64
    heads = 16,                 # attention heads
    dim_head = 64,              # attention head dimension
    attn_dropout = 0.1,         # attention dropout
    ff_dropout = 0.1            # feedforward dropout
)

text = torch.randint(0, 10000, (4, 256))
images = torch.randn(4, 3, 256, 256)
mask = torch.ones_like(text).bool()

loss = dalle(text, images, mask = mask, return_loss = True)
loss.backward()

Taming Transformer's Pretrained VQGAN VAE

You can also use the pretrained VAE offered by the authors of Taming Transformers! Currently only the VAE with a codebook size of 1024 is offered, with the hope that it may train a little faster than OpenAI's, which has a size of 8192.

In contrast to OpenAI's VAE, it also has an extra layer of downsampling, so the image sequence length is 256 instead of 1024 (this will lead to a 16 reduction in training costs, when you do the math). Whether it will generalize as well as the original DALL-E is up to the citizen scientists out there to discover.

from dalle_pytorch import VQGanVAE1024

vae = VQGanVAE1024()

# the rest is the same as the above example

Ranking the generations

Train CLIP

import torch
from dalle_pytorch import CLIP

clip = CLIP(
    dim_text = 512,
    dim_image = 512,
    dim_latent = 512,
    num_text_tokens = 10000,
    text_enc_depth = 6,
    text_seq_len = 256,
    text_heads = 8,
    num_visual_tokens = 512,
    visual_enc_depth = 6,
    visual_image_size = 256,
    visual_patch_size = 32,
    visual_heads = 8
)

text = torch.randint(0, 10000, (4, 256))
images = torch.randn(4, 3, 256, 256)
mask = torch.ones_like(text).bool()

loss = clip(text, images, text_mask = mask, return_loss = True)
loss.backward()

To get the similarity scores from your trained Clipper, just do

images, scores = dalle.generate_images(text, mask = mask, clip = clip)

scores.shape # (2,)
images.shape # (2, 3, 256, 256)

# do your topk here, in paper they sampled 512 and chose top 32

Or you can just use the official CLIP model to rank the images from DALL-E

Scaling depth

In the blog post, they used 64 layers to achieve their results. I added reversible networks, from the Reformer paper, in order for users to attempt to scale depth at the cost of compute. Reversible networks allow you to scale to any depth at no memory cost, but a little over 2x compute cost (each layer is rerun on the backward pass).

Simply set the reversible keyword to True for the DALLE class

dalle = DALLE(
    dim = 1024,
    vae = vae,
    num_text_tokens = 10000,
    text_seq_len = 256,
    depth = 64,
    heads = 16,
    reversible = True  # <-- reversible networks https://arxiv.org/abs/2001.04451
)

Sparse Attention

The blogpost alluded to a mixture of different types of sparse attention, used mainly on the image (while the text presumably had full causal attention). I have done my best to replicate these types of sparse attention, on the scant details released. Primarily, it seems as though they are doing causal axial row / column attention, combined with a causal convolution-like attention.

By default DALLE will use full attention for all layers, but you can specify the attention type per layer as follows.

  • full full attention

  • axial_row axial attention, along the rows of the image feature map

  • axial_col axial attention, along the columns of the image feature map

  • conv_like convolution-like attention, for the image feature map

The sparse attention only applies to the image. Text will always receive full attention, as said in the blogpost.

dalle = DALLE(
    dim = 1024,
    vae = vae,
    num_text_tokens = 10000,
    text_seq_len = 256,
    depth = 64,
    heads = 16,
    reversible = True,
    attn_types = ('full', 'axial_row', 'axial_col', 'conv_like')  # cycles between these four types of attention
)

Deepspeed Sparse Attention

You can also train with Microsoft Deepspeed's Sparse Attention, with any combination of dense and sparse attention that you'd like. However, you will have to endure the installation process.

First, you need to install Deepspeed with Sparse Attention

$ sh install_deepspeed.sh

Next, you need to install the pip package triton

$ pip install triton

If both of the above succeeded, now you can train with Sparse Attention!

dalle = DALLE(
    dim = 512,
    vae = vae,
    num_text_tokens = 10000,
    text_seq_len = 256,
    depth = 64,
    heads = 8,
    attn_types = ('full', 'sparse')  # interleave sparse and dense attention for 64 layers
)

Training

This section will outline how to train the discrete variational autoencoder as well as the final multi-modal transformer (DALL-E). We are going to use Weights & Biases for all the experiment tracking.

(You can also do everything in this section in a Google Colab, link below)

Open In Colab Train in Colab

$ pip install wandb

Followed by

$ wandb login

VAE

To train the VAE, you just need to run

$ python train_vae.py --image_folder /path/to/your/images

If you installed everything correctly, a link to the experiments page should show up in your terminal. You can follow your link there and customize your experiment, like the example layout below.

You can of course open up the training script at ./train_vae.py, where you can modify the constants, what is passed to Weights & Biases, or any other tricks you know to make the VAE learn better.

Model will be saved periodically to ./vae.pt

In the experiment tracker, you will have to monitor the hard reconstruction, as we are essentially teaching the network to compress images into discrete visual tokens for use in the transformer as a visual vocabulary.

Weights and Biases will allow you to monitor the temperature annealing, image reconstructions (encoder and decoder working properly), as well as to watch out for codebook collapse (where the network decides to only use a few tokens out of what you provide it).

Once you have trained a decent VAE to your satisfaction, you can move on to the next step with your model weights at ./vae.pt.

DALL-E

Now you just have to invoke the ./train_dalle.py script, indicating which VAE model you would like to use, as well as the path to your folder if images and text.

The dataset I am currently working with contains a folder of images and text files, arbitraily nested in subfolders, where text file name corresponds with the image name, and where each text file contains multiple descriptions, delimited by newlines. The script will find and pair all the image and text files with the same names, and randomly select one of the textual descriptions during batch creation.

ex.

📂image-and-text-data
 ┣ 📜cat.png
 ┣ 📜cat.txt
 ┣ 📜dog.jpg
 ┣ 📜dog.txt
 ┣ 📜turtle.jpeg
 ┗ 📜turtle.txt

ex. cat.txt

A black and white cat curled up next to the fireplace
A fireplace, with a cat sleeping next to it
A black cat with a red collar napping

If you have a dataset with its own directory structure for tying together image and text descriptions, do let me know in the issues, and I'll see if I can accommodate it in the script.

$ python train_dalle.py --vae_path ./vae.pt --image_text_folder /path/to/data

You likely will not finish DALL-E training as quickly as you did your Discrete VAE. To resume from where you left off, just run the same script, but with the path to your DALL-E checkpoints.

$ python train_dalle.py --dalle_path ./dalle.pt --image_text_folder /path/to/data

DALL-E with OpenAI's VAE

You can now also train DALL-E without having to train the Discrete VAE at all, courtesy to their open-sourcing their model. You simply have to invoke the train_dalle.py script without specifying the --vae_path

$ python train_dalle.py --image_text_folder /path/to/coco/dataset

Generation

Once you have successfully trained DALL-E, you can then used the saved model for generation!

$ python generate.py --dalle_path ./dalle.pt --text 'fireflies in a field under a full moon'

You should see your images saved as ./outputs/{your prompt}/{image number}.jpg

Citations

@misc{ramesh2021zeroshot,
    title   = {Zero-Shot Text-to-Image Generation}, 
    author  = {Aditya Ramesh and Mikhail Pavlov and Gabriel Goh and Scott Gray and Chelsea Voss and Alec Radford and Mark Chen and Ilya Sutskever},
    year    = {2021},
    eprint  = {2102.12092},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}
@misc{unpublished2021clip,
    title  = {CLIP: Connecting Text and Images},
    author = {Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal},
    year   = {2021}
}
@misc{kitaev2020reformer,
    title   = {Reformer: The Efficient Transformer},
    author  = {Nikita Kitaev and Łukasz Kaiser and Anselm Levskaya},
    year    = {2020},
    eprint  = {2001.04451},
    archivePrefix = {arXiv},
    primaryClass = {cs.LG}
}
@misc{esser2021taming,
    title   = {Taming Transformers for High-Resolution Image Synthesis},
    author  = {Patrick Esser and Robin Rombach and Björn Ommer},
    year    = {2021},
    eprint  = {2012.09841},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

Those who do not want to imitate anything, produce nothing. - Dali

Comments
  • VQGanVAE1024

    VQGanVAE1024 "vae must be an instance of DiscreteVAE"

    @lucidrains I believe the relevant line is here:

    https://github.com/lucidrains/DALLE-pytorch/blob/2268864941d8eef2ba73a4488fe05673d447d493/dalle_pytorch/dalle_pytorch.py#L306

    I tried adding it in myself, but it needs the taming imports and I'm not familiar with those.

    opened by afiaka87 64
  • More

    More "OpenAI Blog Post" Training | Depth 32 | Heads 8 | LR 5e-4

    Edit: Moved to discussions: https://github.com/lucidrains/DALLE-pytorch/discussions/106

    Hey, all. Some of you might know I'm practicing and learning about machine learning with dalle-pytorch and a dataset consisting of the images OpenAI presented in the DALLE blog post. I honestly dont have the money to train this whole dataset,

    edit: this is no longer true. Using the 1024 VQGAN from the "Taming Transformers" research, it's now quite possible to train a full dataset of 1,000,000 image-text pairs and i'm doing just that. I hope to have it finished in about a week. I assume someone else will release a dalle-pytorch trained properly on COCO and other image sets before then, but if they dont, check here for updates.

    Anway, it ran for ~36000 steps. As you can see it...still really likes mannequins. I'm considering removing them from the dataset. But also, you'll notice that the network has actually got a decent idea of the sort of general colors that belong in types of prompts.

    Some Samples from Near the End of Training

    results

    Every Text-Image Reconstruction

    https://wandb.ai/afiaka87/dalle_pytorch_live_training/reports/dalle-pytorch-Test-Run-2--Vmlldzo1MzM5MjQ

    Deliverables (my train_dalle.py)

    https://gist.github.com/afiaka87/850fb3cc48edde8a7ed4cb1ce53b6bd2

    This has some code in it that actually manages to deal with truncated images via Try Catch. Apparently detecting a corrupted PNG is harder than P vs NP. PIL's imverify() function doesnt catch all of them. Python's built in imghdr library doesn't catch all of them either. So you just sort of catch OSError and return an item further along. Works well enough.

    Parameters

    SHUFFLE = True
    EPOCHS = 28 # This wound up being less than a single epoch, of course. 
    BATCH_SIZE = 16
    LEARNING_RATE = 0.0005 # I found this learning rate to be more suitable than 0.0003 in my hyperparameter sweep post
    GRAD_CLIP_NORM = 0.5
    DEPTH = 32
    HEADS = 8
    MODEL_DIM = 512
    TEXT_SEQ_LEN = 256
    DIM_HEAD = 64
    REVERSIBLE = True,
    ATTN_TYPES = ('full')
    

    Dataset Description

    https://github.com/lucidrains/DALLE-pytorch/issues/61#issuecomment-796663342

    Just for more info on the dataset itself, it is roughly 1,100,000 256x256 image-text pairs that were generated by OpenAI's DALL-E. They presented roughly ~30k unique text prompts of which they posted the top 32 of 512 generations on https://openai.com/blog/dall-e/. Many images were corrupt, and not every prompt has a full 32 examples, but the total number of images winds up being about 1.1 million. If you look at many of the examples on that page, you'll see that DALL-E (in that form at least), can and will make mistakes. These mistakes are also in this dataset. Anyway I'm just messing around having fun training and what not. This is definitely not going to produce a good model or anything.

    There are also a large number of images in the dataset which are intended to be used with the "mask" feature. I don't know if that's possible yet in DALLE-pytorch though. Anyway, that can't be helping much.

    opened by afiaka87 31
  • Huge GPU memory consumption when using DeepSpeed

    Huge GPU memory consumption when using DeepSpeed

    So I decided to try the recently introduced DeepSpeed training on 4x V100 GPUs. Previously I was training on a single T4 or V100 (both have 16GB RAM) with certain model parameters (1024 model dim, 16 layers, 8 heads, 64 head dim if that's important). I was able to use a batch size of 2 with this configuration (256px images and no texts).

    I tried to launch distributed training with DeepSpeed with the same model parameters, but to my surprise, it gave an OOM error. Using binary search I found that it's possible to train the model using DeepSpeed but only when I set the number of layers to 4 (4x reduction) and use only a single sample per batch per GPU (so, 2x reduction).

    Am I missing something? I'm using the latest master branch with some minor changes that are very unlikely to cause such behavior.

    Any help/suggestion is very much appreciated!

    opened by ex4sperans 29
  • Add tensorboard support.

    Add tensorboard support.

    Weights and biases is cool - but they don't really support the offline usecase as well as tensorboard does currently. This fix simply adds tensorboard as a fallback option - W&B will still run if it's available.

    opened by afiaka87 22
  • Trained for 17K iters on COCO2014, OpenImages and OpenAI's Blog Images

    Trained for 17K iters on COCO2014, OpenImages and OpenAI's Blog Images

    In case you haven't read my usual disclaimer: this data set is weird. The repetition in the OpenAI images causes those to be highly overfit (mannequins) and the remainder of the dataset is much more diverse, which dalle-pytorch doesnt manage to capture very well here. Also, keep in mind - this isn't even a full epoch. Just having fun. Try not to evaluate this as representative of dalle-pytorch's current capabilities.

    closetotheend

    Hey everyone. @lucidrains got the the new, lighter pretrained VAE from the taming-transformers group recently. It uses substantially less memory and compute. I decided to take all the datasets ive collected thus far, put them in a single folder on an A100, and train dalle-pytorch for several hours.

    Here are the results:

    https://wandb.ai/afiaka87/OpenImagesV6/reports/Training-on-COCO-OpenImage-Blogpost--Vmlldzo1NDE3NjU

    I'm exhausted so that's all for now, but please click the link and have a look at the thousands of reconstructions it made (and the horrible captions from the "Localized Narratives" dataset I got from Google). I'll be updating this post with more info throughout the day.

    opened by afiaka87 19
  • New error using the new update.

    New error using the new update.

    [2021-09-13 11:39:11,114] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.5.1, git-hash=unknown, git-branch=unknown [2021-09-13 11:39:11,216] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed groups [2021-09-13 11:39:11,216] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed model parallel group with size 1 [2021-09-13 11:39:11,216] [INFO] [logging.py:68:log_dist] [Rank 0] initializing deepspeed expert parallel group with size 1 [2021-09-13 11:39:11,217] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert data parallel process group with ranks: [0] [2021-09-13 11:39:11,217] [INFO] [logging.py:68:log_dist] [Rank 0] creating expert parallel process group with ranks: [0] [2021-09-13 11:39:11,240] [INFO] [engine.py:198:init] DeepSpeed Flops Profiler Enabled: False Traceback (most recent call last): File "train_dalle.py", line 497, in config_params=deepspeed_config, File "/home/valterjordan/DALLE-pytorch/dalle_pytorch/distributed_backends/distributed_backend.py", line 152, in distribute **kwargs, File "/home/valterjordan/DALLE-pytorch/dalle_pytorch/distributed_backends/deepspeed_backend.py", line 162, in _distribute **kwargs, File "/home/valterjordan/miniconda3/envs/dalle_env/lib/python3.7/site-packages/deepspeed/init.py", line 141, in initialize config_params=config_params) File "/home/valterjordan/miniconda3/envs/dalle_env/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 204, in init self.training_dataloader = self.deepspeed_io(training_data) File "/home/valterjordan/miniconda3/envs/dalle_env/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 1188, in deepspeed_io data_parallel_rank=data_parallel_rank) File "/home/valterjordan/miniconda3/envs/dalle_env/lib/python3.7/site-packages/deepspeed/runtime/dataloader.py", line 52, in init rank=data_parallel_rank) File "/home/valterjordan/miniconda3/envs/dalle_env/lib/python3.7/site-packages/torch/utils/data/distributed.py", line 87, in init self.num_samples = math.ceil(len(self.dataset) / self.num_replicas) # type: ignore TypeError: object of type 'Processor' has no len()

    opened by jaoeded 18
  • Ideas for datasets to uses? (WIT just came out)

    Ideas for datasets to uses? (WIT just came out)

    Hey all,

    I'm compiling a list of the various datasets we'll need and how to download them:

    Keep in mind, not all of these datasets ship with captions. However many of them do ship with a class descriptor of some type. I've only done mild testing with this, but usually you can just generate labels by doing something like "an image of {class_name}". Not sure what the best way to go about that would be though.

    https://github.com/lucidrains/DALLE-pytorch/discussions/109

    As it stands, this is turning out to be humongous. I just added the new Wikipedia dataset (11 million images).

    Does anyone know of other captioned datasets we could use?

    opened by afiaka87 17
  • refactor

    refactor

    We need a refactor. Maybe a redesign (breaking backwards compatibility). The codebase has accrued just a little too much tech debt. Particularly in the train_dalle.py and README.md

    Training loop

    • The training loop needs to be refactored into a function in the style of pytorch lightning. This dramatically improves readability and makes early returns possible which is a cleaner solution than raising an Exception.

    Argument parsing

    • All arg-parsing need to be well-consolidated into a single file. Perhaps another file at the root of the repo as it's still something people are going to want to open up quickly.

    Style guide and automatic formatting in CI

    • This one's easy - we need a consistent target format for multiple contributors to target. I like PEP8 but honestly if we just pick one I'll be happy we're using a linter/formatter at all. Alternatives are black/yapf. I could integrate this into Github Actions so that code is automatically formatted upon merge to main.

    Configuration:

    • we should provide reasonable defaults deepspeed_config.json in a folder at the root called "config" and encourage people to update that file rather than update the deepspeed_config variable in Python which is a bit of a hassle to get to and find each time. A comment at the top of each file with a links to:
    • https://deepspeed.readthedocs.io/en/latest/schedulers.html # Various (otherwise) undocumented schedulers from DeepSpeed
    • https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training # Options specific to zero.

    README.md

    • The readme is a bit long; but that's alright. I prefer the idea of a single source of truth with documentation; especially one that doesn't require the internet after you've downloaded it e.g. the wiki
    • It just needs a table of contents and a "quick start" at the top. I do like the idea of continuing to give progress updates on the current best at the top as well. So let's keep that. taming-transformers has a wonderfully clean README.md with a table of contents which could be used as a basis.

    Feel free to criticize the current design (please be constructive and polite - we're all lazy coders at the end of the day) so we can decide if breaking backwards compatibility for a new design would be worthwhile.

    opened by afiaka87 12
  •  DALL-E Image Embedding

    DALL-E Image Embedding

    A token is any symbol from a discrete vocabulary; for humans, each English letter is a token from a 26-letter alphabet. DALL·E’s vocabulary has tokens for both text and image concepts. Specifically, each image caption is represented using a maximum of 256 BPE-encoded tokens with a vocabulary size of 16384, and the image is represented using 1024 tokens with a vocabulary size of 8192. The images are preprocessed to 256x256 resolution during training. Similar to VQVAE, each image is compressed to a 32x32 grid of discrete latent codes using a discrete VAE that we pretrained using a continuous relaxation. We found that training using the relaxation obviates the need for an explicit codebook, EMA loss, or tricks like dead code revival, and can scale up to large vocabulary sizes.

    We can use openAI CLIP implementation to filter the good samples, but I would assume they didn*t used it to create the embedding. So therefore we could assume they used some kind of VQ-VAE? For example https://github.com/openai/vdvae or https://github.com/NVlabs/NVAE ?

    So this GIT should have 2-step Training Step 1 - Pretrained a autoencoder to tokenize the images. We could go small first and do it with a 16x16 Embedding and a relatively low vocab size. (2k-4k?) Step 2 - Train the Decoder-Transformer. Here we should have a preprocessing step to convert the image-text pairs to tokens. Some Huggingface tokenizer for Text and the encoder of VQ-VAE for the image.

    We hope that someone will offer a pretrained model weights for CLIP to remove bad samples during Inference. If it was trained on something like the Microsoft Dataset, then it should be general enough for most usecases.

    Some Open Questions:

    • They use Sparse Attention for the Image Part. We could just use full-attention for the whole network for now or go full sparse?
    • If its not a VQ-VAE, which GANs work well with discrete latent values?
    • If its VQ-VAE, its some kind of Hierarchical one. Does DALL-E model the first latent value and the rest is just randomly sampled during reconstructions?
    opened by adrian-spataru 12
  • stable_softmax, wanb_entity, visible discord, replace buggy colab

    stable_softmax, wanb_entity, visible discord, replace buggy colab

    edit: alright rom1504 is being awesome and implementing things the proper modular way for us. I'm gonna focus this PR on a few outstanding issues

    Seems the CompVis team hasn't updated their PyPi because their latest pip wheel still doesn't contain the necessary GumbelVQ class. I've had to install this as a submodule to taming-transformers to get it to work which doesnt feel quite right.

    opened by afiaka87 11
  • Out of memory errors no matter what parameters with deep speed

    Out of memory errors no matter what parameters with deep speed

    Using these fairly lightweight parameters:

    BATCH_SIZE = 8
    LEARNING_RATE = 3e-4
    
    MODEL_DIM = 512
    TEXT_SEQ_LEN = 128
    DEPTH = 4
    HEADS = 4
    DIM_HEAD = 64
    REVERSIBLE = True
    LOSS_IMG_WEIGHT = 7
    

    A single V100 GPU only needs 6356MB of RAM.

    [0] Tesla V100-SXM2-16GB | 57'C, 81 % | 6356 / 16160 MB |

    When run with deepspeed - memory usage immediately balloons to filling up each GPU's 16 GiB of RAM until finally running out of memory before a single iteration completes.

    Aside - please dont take these personal ha - we have pinned versions and what not - just trying to be thorough so I can come back and try to fix them myself.

    Traceback (most recent call last): File "train_dalle.py", line 271, in loss = distr_dalle(text, images, mask = mask, return_loss = True) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/opt/conda/lib/python3.7/site-packages/deepspeed/runtime/engine.py", line 914, in forward loss = self.module(*inputs, **kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/root/DALLE-pytorch/dalle_pytorch/dalle_pytorch.py", line 495, in forward loss_img = F.cross_entropy(logits[:, :, self.text_seq_len:], labels[:, self.text_seq_len:], ignore_index=0) File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 2422, in cross_entropy return nll_loss(log_softmax(input, 1), target, weight, None, ignore_index, None, reduction) File "/opt/conda/lib/python3.7/site-packages/torch/nn/functional.py", line 1591, in log_softmax ret = input.log_softmax(dim) RuntimeError: CUDA out of memory. Tried to allocate 394.00 MiB (GPU 0; 15.78 GiB total capacity; 1.80 GiB already allocated; 178.75

    opened by afiaka87 11
  • Cant run example models in colab due to lightning error

    Cant run example models in colab due to lightning error

    Hi, I've tried a few of the notebooks you provided for the examples but I run in to the same error across all of them in colab with the following error:

    ImportError                               Traceback (most recent call last)
    [<ipython-input-6-548ac97a7512>](https://localhost:8080/#) in <module>
         15 # dalle classes
         16 
    ---> 17 from dalle_pytorch import DiscreteVAE
         18 
         19 # constants
    
    4 frames
    [/usr/local/lib/python3.7/dist-packages/taming/main.py](https://localhost:8080/#) in <module>
         10 from pytorch_lightning.trainer import Trainer
         11 from pytorch_lightning.callbacks import ModelCheckpoint, Callback, LearningRateMonitor
    ---> 12 from pytorch_lightning.utilities.distributed import rank_zero_only
         13 
         14 def get_obj_from_str(string, reload=False):
    
    ImportError: cannot import name 'rank_zero_only' from 'pytorch_lightning.utilities.distributed' (/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/distributed.py)
    
    

    I think the version of lightning that is installed by the script might be incorrect now?

    Any help getting this fixed is greatly appreciated!

    Thanks

    opened by neramas1221 0
  • DALLE trained on FashionGen Dataset RESULTS 💯

    DALLE trained on FashionGen Dataset RESULTS 💯

    DALLE on FashionGen

    • I trained Dall-E + VQGAN on the FashionGen dataset (https://arxiv.org/abs/1806.08317) on Google Colab and got decent results.
    • Without the VQGAN training on the FashionGen dataset, DALLE is really bad at generating faces which makes clothing generations looking extremely strange.

    Text to image generation and re-ranking by CLIP

    Best 16 of 48 generations ranked by CLIP

    Generations from the training set (Including their Groundtruths)

    Download (5) Download (6) Download (7) Download (8) Download (4)

    Generations based on custom prompts (withouttheir Groundtruths)

    Download (1) Download (2) Download (3) Download (9) Download

    Model specifications

    VAE Trained VQGAN for 1 epoch on Fashion-Gen dataset Embeddings: 1024 Batch size: 5

    DALLE Trained DALLE for 1 epoch on Fashion-Gen dataset dim = 312 text_seq_len = 80 depth = 36 heads = 12 dim_head = 64 reversible = 0 attn_types =('full', 'axial_row', 'axial_col', 'conv_like')

    Optimization Optimizer: Adam Learning rate: 4.5e-4 Gradient Clipping: 0.5 Batch size: 7

    image

    opened by alexriedel1 8
  • Text transformers

    Text transformers

    Hi again :) Is there any way to change the transformer architecture easily as in x-clip ? I would like to use my own ( which is pretrained ) :) Thanks !

    opened by ethancohen123 1
  • dvae training resulting in an irregular latent space

    dvae training resulting in an irregular latent space

    Hello! I'm not sure whether this should be raised as an issue or it is a fault completely on my side. But I've reached a point where I can't seem to figure it out on my own, so I hope someone could enlighten me.

    I'm trying to train the DiscreteVAE with some custom dataset, but the trained model seems to fail in learning a regular latent space.

    For instance, when I generate from a codebook index decoded from one of my dataset images, the output image seems fine, but when I try to interpolate between two indices, the latent space between two indices result in completely unrecognizable images.

    I am told that the kl loss value have something to do with the regularizing of the latent space, but according to some issues raised before, this does not seem to be a usable option.

    Is there a known reason for this kind of irregularity in the latent space? Or rather, has anyone succeeded in smooth latent interpolation while training DiscreteVAE model? It would be really helpful if someone has succeeded and can tell me about the relevant parameters.

    opened by hlp-pls 0
Releases(1.6.4)
Owner
Phil Wang
Working with Attention. It's all we need.
Phil Wang
WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.

Google Research Datasets 740 Dec 24, 2022
Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks

Official PyTorch code for ClipBERT, an efficient framework for end-to-end learning on image-text and video-text tasks. It takes raw videos/images + text as inputs, and outputs task predictions. ClipBERT is designed based on 2D CNNs and transformers, and uses a sparse sampling strategy to enable efficient end-to-end video-and-language learning.

Jie Lei 雷杰 612 Jan 4, 2023
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 4.6k Jan 1, 2023
Code for the paper "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer"

T5: Text-To-Text Transfer Transformer The t5 library serves primarily as code for reproducing the experiments in Exploring the Limits of Transfer Lear

Google Research 3.2k Feb 17, 2021
Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

null 2 Oct 17, 2021
PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer

Cross-Covariance Image Transformer (XCiT) PyTorch implementation and pretrained models for XCiT models. See XCiT: Cross-Covariance Image Transformer L

Facebook Research 605 Jan 2, 2023
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Chung-Ming Chien 1k Dec 30, 2022
Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

Pytorch-NLU,一个中文文本分类、序列标注工具包,支持中文长文本、短文本的多类、多标签分类任务,支持中文命名实体识别、词性标注、分词等序列标注任务。 Ptorch NLU, a Chinese text classification and sequence annotation toolkit, supports multi class and multi label classification tasks of Chinese long text and short text, and supports sequence annotation tasks such as Chinese named entity recognition, part of speech tagging and word segmentation.

null 186 Dec 24, 2022
Code for Text Prior Guided Scene Text Image Super-Resolution

Code for Text Prior Guided Scene Text Image Super-Resolution

null 82 Dec 26, 2022
Unofficial Implementation of Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration This repo contains only model Implementation of Zero-Shot Text-to-Speech for Text

Rishikesh (ऋषिकेश) 33 Sep 22, 2022
A fast and easy implementation of Transformer with PyTorch.

FasySeq FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which

宁羽 7 Jul 18, 2022
Implementation of Fast Transformer in Pytorch

Fast Transformer - Pytorch Implementation of Fast Transformer in Pytorch. This only work as an encoder. Yannic video AI Epiphany Install $ pip install

Phil Wang 167 Dec 27, 2022
A PyTorch implementation of the Transformer model in "Attention is All You Need".

Attention is all you need: A Pytorch Implementation This is a PyTorch implementation of the Transformer model in "Attention is All You Need" (Ashish V

Yu-Hsiang Huang 7.1k Jan 5, 2023
Transformer-based Text Auto-encoder (T-TA) using TensorFlow 2.

T-TA (Transformer-based Text Auto-encoder) This repository contains codes for Transformer-based Text Auto-encoder (T-TA, paper: Fast and Accurate Deep

Jeong Ukjae 13 Dec 13, 2022
Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

COCO LM Pretraining (wip) Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch. They were a

Phil Wang 44 Jul 28, 2022
A PyTorch Implementation of End-to-End Models for Speech-to-Text

speech Speech is an open-source package to build end-to-end models for automatic speech recognition. Sequence-to-sequence models with attention, Conne

Awni Hannun 647 Dec 25, 2022
A PyTorch implementation of paper "Learning Shared Semantic Space for Speech-to-Text Translation", ACL (Findings) 2021

Chimera: Learning Shared Semantic Space for Speech-to-Text Translation This is a Pytorch implementation for the "Chimera" paper Learning Shared Semant

Chi Han 43 Dec 28, 2022
PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation

StyleSpeech - PyTorch Implementation PyTorch Implementation of Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. Status (2021.06.09

Keon Lee 142 Jan 6, 2023