Vector Quantized Diffusion Model for Text-to-Image Synthesis

Overview

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Due to company policy, I have to set microsoft/VQ-Diffusion to private for now, so I provide the same code here.

Overview

This is the official repo for the paper: Vector Quantized Diffusion Model for Text-to-Image Synthesis.

VQ-Diffusion is based on a VQ-VAE whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). It produces significantly better text-to-image generation results when compared with Autoregressive models with similar numbers of parameters. Compared with previous GAN-based methods, VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin.

Framework

Requirements

We suggest to use the docker. Also, you may run:

bash install_req.sh

Data Preparing

Microsoft COCO

│MSCOCO_Caption/
├──annotations/
│  ├── captions_train2014.json
│  ├── captions_val2014.json
├──train2014/
│  ├── train2014/
│  │   ├── COCO_train2014_000000000009.jpg
│  │   ├── ......
├──val2014/
│  ├── val2014/
│  │   ├── COCO_val2014_000000000042.jpg
│  │   ├── ......

CUB-200

│CUB-200/
├──images/
│  ├── 001.Black_footed_Albatross/
│  ├── 002.Laysan_Albatross
│  ├── ......
├──text/
│  ├── text/
│  │   ├── 001.Black_footed_Albatross/
│  │   ├── 002.Laysan_Albatross
│  │   ├── ......
├──train/
│  ├── filenames.pickle
├──test/
│  ├── filenames.pickle

ImageNet

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Pretrained Model

We release four text-to-image pretrained model, trained on Conceptual Caption, MSCOCO, CUB200, and LAION-human datasets. Also, we release the ImageNet pretrained model, and provide the CLIP pretrained model for convenient. These should be put under OUTPUT/pretrained_model/ . These pretrained model file may be large because they are training checkpoints, which contains gradient information, optimizer information, ema model and others.

Besides, we provide the VQVAE models on FFHQ, OpenImages, and imagenet datasets, these model are from Taming Transformer, we provide them here for convenient. Please put them under OUTPUT/pretrained_model/taming_dvae/ .

Inference

To generate image from given text:

from inference_VQ_Diffusion import VQ_Diffusion
VQ_Diffusion_model = VQ_Diffusion(config='OUTPUT/pretrained_model/config_text.yaml', path='OUTPUT/pretrained_model/human_pretrained.pth')
VQ_Diffusion_model.inference_generate_sample_with_condition("a beautiful smiling woman",truncation_rate=0.85, save_root="RESULT",batch_size=4)
VQ_Diffusion_model.inference_generate_sample_with_condition("a woman in yellow dress",truncation_rate=0.85, save_root="RESULT",batch_size=4,fast=2) # for fast inference

You may change human_pretrained.pth to other pretrained model to test different text.

To generate image from given ImageNet class label:

from inference_VQ_Diffusion import VQ_Diffusion
VQ_Diffusion_model = VQ_Diffusion(config='OUTPUT/pretrained_model/config_imagenet.yaml', path='OUTPUT/pretrained_model/imagenet_pretrained.pth')
VQ_Diffusion_model.inference_generate_sample_with_class(407,truncation_rate=0.86, save_root="RESULT",batch_size=4)

Training

First, change the data_root to correct path in configs/coco.yaml or other configs.

Train Text2Image generation on MSCOCO dataset:

python running_command/run_train_coco.py

Train Text2Image generation on CUB200 dataset:

python running_command/run_train_cub.py

Train conditional generation on ImageNet dataset:

python running_command/run_train_imagenet.py

Train unconditional generation on FFHQ dataset:

python running_command/run_train_ffhq.py

Cite VQ-Diffusion

if you find our code helpful for your research, please consider citing:

@article{gu2021vector,
  title={Vector Quantized Diffusion Model for Text-to-Image Synthesis},
  author={Gu, Shuyang and Chen, Dong and Bao, Jianmin and Wen, Fang and Zhang, Bo and Chen, Dongdong and Yuan, Lu and Guo, Baining},
  journal={arXiv preprint arXiv:2111.14822},
  year={2021}
}

Acknowledgement

Thanks to everyone who makes their code and models available. In particular,

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using VQ-Diffusion, please submit a GitHub issue. For other communications related to VQ-Diffusion, please contact Shuyang Gu ([email protected]) or Dong Chen ([email protected]).

Comments
  • Problem of reproducing the VQ-Diffusion-S results on CUB-200

    Problem of reproducing the VQ-Diffusion-S results on CUB-200

    Hi there,

    Thanks for your excellent work! I am trying to reproduce the results of VQ-Diffusion-S on CUB-200 with the provided configs. But the trained model cannot generate high-fidelity images and results in a FID score of more than 30.

    I checked the code and dataset but do not locate the problem. Can you give me some suggestions to reproduce the results, like which hyper-parameters should I try to change?

    Thanks a lot!

    opened by Yikai-Wang 7
  • Can you provide the config file of

    Can you provide the config file of "Text guided image editing by VQ-Diffusion"?

    Hi there, I am interested in your excellent work and try to understand your code. Can you provide the config file to run inference on Figure 5 (Text guided image editing by VQ-Diffusion.)? It will be very helpful. Thanks!

    opened by Yikai-Wang 6
  • Question about the q_posterior function

    Question about the q_posterior function

    Thank you for releasing the code of this excellent work!

    Regarding the below function, I couldn't figure out the usage of the q_pred function in L215 and L237.

    https://github.com/cientgu/VQ-Diffusion/blob/37bbcccdd4aef1794dac645128d864a9f69ed985/image_synthesis/modeling/transformers/diffusion_transformer.py#L206

    I understand q_pred as a function that takes an initial state and a time, and returns a distribution with noise at that time. However, the q_pred function in L215 receives log_x_t instead of log_x_start, while the comment says it returns q(xt|x0). In addition, I would be grateful if you would tell me which equation q_pred in L237 corresponds to.

    opened by mikittt 4
  • why don't directly predict x_0 in the inference but predict iteratively?

    why don't directly predict x_0 in the inference but predict iteratively?

    https://github.com/cientgu/VQ-Diffusion/blob/37bbcccdd4aef1794dac645128d864a9f69ed985/image_synthesis/modeling/transformers/diffusion_transformer.py?_pjax=%23js-repo-pjax-container#L186 https://github.com/cientgu/VQ-Diffusion/blob/37bbcccdd4aef1794dac645128d864a9f69ed985/image_synthesis/modeling/transformers/diffusion_transformer.py?_pjax=%23js-repo-pjax-container#L240

    As shown in line186, you predict x_0 from x_t at any timestep with transformer model. In the line240 for inference, x_t -> x_{t-1}, you predict x_0 with p(x_0|x_t), and then predict x_{t-1} using q_posterior q(x_{t-1}|x_t, x_0) function.

    So why don't directly predict the x_0 with p(x_0|x_T) in the inference?

    opened by PanXiebit 2
  • question about 'filter_ratio' parameter

    question about 'filter_ratio' parameter

    Hi, thanks for the excellent implementation!

    Could you please tell me what is the purpose of the 'filter_ratio' parameter in the sampling function? And I also note that the intermediate training results are sampled with different filter_ratio parameters. How should we interpret the results with different values? Thanks!

    opened by yzxing87 2
  • How to set using Mutiple GPUs to train?

    How to set using Mutiple GPUs to train?

    Hi, can you share how to use multiple GPUs to train the code? According to you scripts in running_command folder, it seems only support use one single GPU.

    opened by yangdongchao 2
  • Cub200Dataset

    Cub200Dataset

    Hi , thanks for your code. I want to run your code on CUB200 dataset. I download this dataset from http://www.vision.caltech.edu/visipedia/CUB-200.html. But I find that it does not incude filenames.pickle files. Furthermore the annotation of this dataset is .mat format, but in your dataset code, the anotation seems .txt format. (this_text_path = os.path.join(data_root, 'text', 'text', name+'.txt')). So I want to ask whether you can tell me how to get the CUB200 dataset as your description.

    Looking forward to your reply.

    opened by yangdongchao 2
  • About training time

    About training time

    Thanks for releasing the codes of this awesome work! May I know the training cost of the VQ-Diffusion-B model? How long does the training take when using 8 V100 GPUs?

    opened by yzxing87 2
  • Hardware source requirement.

    Hardware source requirement.

    Nice work, and thanks to your kindful share! But I wonder the GPU requirement for the training and how to switch the VQ-Diffusion-S for the training? Looking forward to your reply!

    opened by ENJOY-Yin-jiong 2
  • Question on Equation and Implementation

    Question on Equation and Implementation

    Hi author, thanks for sharing your inspiring work!

    I found that some codes could be hard to follow. For example, in line 182, what is the goal of this log_add_exp function, and how this calculation of log_probs corresponds to equations in the paper? It seems that you implemented VQ diffusion all in log scale, which is different from the original DDPM to a large extent. Could you explain Why?

    Thanks. Looking forward to your reply.

    opened by Rongjiehuang 1
  • How to using Multi-machine Multi-GPU to train

    How to using Multi-machine Multi-GPU to train

    Hi, I want to ask whether the code supports to use multi-machine and multi-gpu. In fact, I want to use two machines, each machine includes 8 GPUs. But when I try to use them with your code, I find only 8 GPU is used.

    opened by yangdongchao 0
  • About the classifier free guidance

    About the classifier free guidance

    Hi, In your code, you use the following code to set whether using null text vector: is_empty_text = torch.logical_not(input['condition_mask'][:, 2]).unsqueeze(1).unsqueeze(2).repeat(1, 77, 512) But, I found that if all of the caption length larger than 2, is_empty_text will always be False. So, I want to ask how to control the classifier free guidance? Whether we add some <image, null text> pair to the training dataset?

    opened by yangdongchao 0
  • About unconditional synthesis on FFHQ.

    About unconditional synthesis on FFHQ.

    Hi Authors,

    Thanks for sharing this nice work! I am trying to reproduce the results of unconditional synthesis on FFHQ dataset. Compared to other tasks, however, training and inference details for this experiment seem to be insufficient. Could you tell me the training details and inference code for unconditional image generation on FFHQ?

    Thanks a lot:)

    opened by Godkimchiy 0
  • Change the dimension of the input and ouptu image

    Change the dimension of the input and ouptu image

    Hi. The current version of the code seems to work with 256X256 input/output images. I am wondering if there is any way to modify the size of input and output images.

    Thanks,

    opened by ClinicalAI 0
  • Oxford flower dateset pretrained model release

    Oxford flower dateset pretrained model release

    Hi, could you please release the pretrained model on Oxford flower dataset? Meanwhile, did you follow the previous work which split the flower dataset into 82 classes in training data and 20 classes (class 1 to 20) in testing data? Thanks very much!

    opened by xouyang0079 0
Owner
Shuyang Gu
Shuyang Gu
Minimal diffusion models - Minimal code and simple experiments to play with Denoising Diffusion Probabilistic Models (DDPMs)

Minimal code and simple experiments to play with Denoising Diffusion Probabilist

Rithesh Kumar 16 Oct 6, 2022
Pytorch-diffusion - A basic PyTorch implementation of 'Denoising Diffusion Probabilistic Models'

PyTorch implementation of 'Denoising Diffusion Probabilistic Models' This reposi

Arthur Juliani 76 Jan 7, 2023
This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

OpenAI 3k Dec 26, 2022
Codebase for Diffusion Models Beat GANS on Image Synthesis.

Codebase for Diffusion Models Beat GANS on Image Synthesis.

Katherine Crowson 128 Dec 2, 2022
High-Resolution Image Synthesis with Latent Diffusion Models

Latent Diffusion Models Requirements A suitable conda environment named ldm can be created and activated with: conda env create -f environment.yaml co

CompVis Heidelberg 5.6k Jan 4, 2023
Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

GLIDE This is the official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing w

OpenAI 2.9k Jan 4, 2023
Generate saved_model, tfjs, tf-trt, EdgeTPU, CoreML, quantized tflite and .pb from .tflite.

tflite2tensorflow Generate saved_model, tfjs, tf-trt, EdgeTPU, CoreML, quantized tflite and .pb from .tflite. 1. Supported Layers No. TFLite Layer TF

Katsuya Hyodo 214 Dec 29, 2022
BitPack is a practical tool to efficiently save ultra-low precision/mixed-precision quantized models.

BitPack is a practical tool that can efficiently save quantized neural network models with mixed bitwidth.

Zhen Dong 36 Dec 2, 2022
Quantized tflite models for ailia TFLite Runtime

ailia-models-tflite Quantized tflite models for ailia TFLite Runtime About ailia TFLite Runtime ailia TF Lite Runtime is a TensorFlow Lite compatible

ax Inc. 13 Dec 23, 2022
Quantized models with python

quantized-network download .pth files to qmodels/: googlenet : https://download.

adreamxcj 2 Dec 28, 2021
BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Bilateral Denoising Diffusion Models (BDDMs) This is the official PyTorch implementation of the following paper: BDDM: BILATERAL DENOISING DIFFUSION M

null 172 Dec 23, 2022
Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech"

GradTTS Unofficial Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech" (arxiv) About this repo This is an unoffic

HeyangXue1997 103 Dec 23, 2022
A 1.3B text-to-image generation model trained on 14 million image-text pairs

minDALL-E on Conceptual Captions minDALL-E, named after minGPT, is a 1.3B text-to-image generation model trained on 14 million image-text pairs for no

Kakao Brain 604 Dec 14, 2022
SEOVER: Sentence-level Emotion Orientation Vector based Conversation Emotion Recognition Model

SEOVER-Master This code is the implementation of paper: SEOVER: Sentence-level Emotion Orientation Vector based Conversation Emotion Recognition Model

null 4 Feb 24, 2022
PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

DiffGAN-TTS - PyTorch Implementation PyTorch implementation of DiffGAN-TTS: High

Keon Lee 157 Jan 1, 2023
NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling

NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling For Official repo of NU-Wave: A Diffusion Probabilistic Model for Neural Audio Up

Rishikesh (ऋषिकेश) 38 Oct 11, 2022
A denoising diffusion probabilistic model (DDPM) tailored for conditional generation of protein distograms

Denoising Diffusion Probabilistic Model for Proteins Implementation of Denoising Diffusion Probabilistic Model in Pytorch. It is a new approach to gen

Phil Wang 108 Nov 23, 2022
NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling @ INTERSPEECH 2021 Accepted

NU-Wave — Official PyTorch Implementation NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling Junhyeok Lee, Seungu Han @ MINDsLab Inc

MINDs Lab 242 Dec 23, 2022
Implementation of GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022).

GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation [OpenReview] [arXiv] [Code] The official implementation of GeoDiff: A Geome

Minkai Xu 155 Dec 26, 2022