Vector Quantized Diffusion Model for Text-to-Image Synthesis

Shuyang Gu

Last update: Jan 5, 2023

Related tags

Deep Learning VQ-Diffusion

Overview

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Due to company policy, I have to set microsoft/VQ-Diffusion to private for now, so I provide the same code here.

Overview

This is the official repo for the paper: Vector Quantized Diffusion Model for Text-to-Image Synthesis.

VQ-Diffusion is based on a VQ-VAE whose latent space is modeled by a conditional variant of the recently developed Denoising Diffusion Probabilistic Model (DDPM). It produces significantly better text-to-image generation results when compared with Autoregressive models with similar numbers of parameters. Compared with previous GAN-based methods, VQ-Diffusion can handle more complex scenes and improve the synthesized image quality by a large margin.

Framework

Requirements

We suggest to use the docker. Also, you may run:

bash install_req.sh

Data Preparing

Microsoft COCO

│MSCOCO_Caption/
├──annotations/
│  ├── captions_train2014.json
│  ├── captions_val2014.json
├──train2014/
│  ├── train2014/
│  │   ├── COCO_train2014_000000000009.jpg
│  │   ├── ......
├──val2014/
│  ├── val2014/
│  │   ├── COCO_val2014_000000000042.jpg
│  │   ├── ......

CUB-200

│CUB-200/
├──images/
│  ├── 001.Black_footed_Albatross/
│  ├── 002.Laysan_Albatross
│  ├── ......
├──text/
│  ├── text/
│  │   ├── 001.Black_footed_Albatross/
│  │   ├── 002.Laysan_Albatross
│  │   ├── ......
├──train/
│  ├── filenames.pickle
├──test/
│  ├── filenames.pickle

ImageNet

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Pretrained Model

We release four text-to-image pretrained model, trained on Conceptual Caption, MSCOCO, CUB200, and LAION-human datasets. Also, we release the ImageNet pretrained model, and provide the CLIP pretrained model for convenient. These should be put under OUTPUT/pretrained_model/ . These pretrained model file may be large because they are training checkpoints, which contains gradient information, optimizer information, ema model and others.

Besides, we provide the VQVAE models on FFHQ, OpenImages, and imagenet datasets, these model are from Taming Transformer, we provide them here for convenient. Please put them under OUTPUT/pretrained_model/taming_dvae/ .

Inference

To generate image from given text:

from inference_VQ_Diffusion import VQ_Diffusion
VQ_Diffusion_model = VQ_Diffusion(config='OUTPUT/pretrained_model/config_text.yaml', path='OUTPUT/pretrained_model/human_pretrained.pth')
VQ_Diffusion_model.inference_generate_sample_with_condition("a beautiful smiling woman",truncation_rate=0.85, save_root="RESULT",batch_size=4)
VQ_Diffusion_model.inference_generate_sample_with_condition("a woman in yellow dress",truncation_rate=0.85, save_root="RESULT",batch_size=4,fast=2) # for fast inference

You may change human_pretrained.pth to other pretrained model to test different text.

To generate image from given ImageNet class label:

from inference_VQ_Diffusion import VQ_Diffusion
VQ_Diffusion_model = VQ_Diffusion(config='OUTPUT/pretrained_model/config_imagenet.yaml', path='OUTPUT/pretrained_model/imagenet_pretrained.pth')
VQ_Diffusion_model.inference_generate_sample_with_class(407,truncation_rate=0.86, save_root="RESULT",batch_size=4)

Training

First, change the data_root to correct path in configs/coco.yaml or other configs.

Train Text2Image generation on MSCOCO dataset:

python running_command/run_train_coco.py

Train Text2Image generation on CUB200 dataset:

python running_command/run_train_cub.py

Train conditional generation on ImageNet dataset:

python running_command/run_train_imagenet.py

Train unconditional generation on FFHQ dataset:

python running_command/run_train_ffhq.py

Cite VQ-Diffusion

if you find our code helpful for your research, please consider citing:

@article{gu2021vector,
  title={Vector Quantized Diffusion Model for Text-to-Image Synthesis},
  author={Gu, Shuyang and Chen, Dong and Bao, Jianmin and Wen, Fang and Zhang, Bo and Chen, Dongdong and Yuan, Lu and Guo, Baining},
  journal={arXiv preprint arXiv:2111.14822},
  year={2021}
}

Acknowledgement

Thanks to everyone who makes their code and models available. In particular,

License

This project is licensed under the license found in the LICENSE file in the root directory of this source tree.

Microsoft Open Source Code of Conduct

Contact Information

For help or issues using VQ-Diffusion, please submit a GitHub issue. For other communications related to VQ-Diffusion, please contact Shuyang Gu ([email protected]) or Dong Chen ([email protected]).

Comments

Problem of reproducing the VQ-Diffusion-S results on CUB-200

Hi there,

Thanks for your excellent work! I am trying to reproduce the results of VQ-Diffusion-S on CUB-200 with the provided configs. But the trained model cannot generate high-fidelity images and results in a FID score of more than 30.

I checked the code and dataset but do not locate the problem. Can you give me some suggestions to reproduce the results, like which hyper-parameters should I try to change?

Thanks a lot!

opened by Yikai-Wang 7
Can you provide the config file of "Text guided image editing by VQ-Diffusion"?

Hi there, I am interested in your excellent work and try to understand your code. Can you provide the config file to run inference on Figure 5 (Text guided image editing by VQ-Diffusion.)? It will be very helpful. Thanks!

opened by Yikai-Wang 6
Question about the q_posterior function

Thank you for releasing the code of this excellent work!

Regarding the below function, I couldn't figure out the usage of the q_pred function in L215 and L237.

https://github.com/cientgu/VQ-Diffusion/blob/37bbcccdd4aef1794dac645128d864a9f69ed985/image_synthesis/modeling/transformers/diffusion_transformer.py#L206

I understand q_pred as a function that takes an initial state and a time, and returns a distribution with noise at that time. However, the q_pred function in L215 receives log_x_t instead of log_x_start, while the comment says it returns q(xt|x0). In addition, I would be grateful if you would tell me which equation q_pred in L237 corresponds to.

opened by mikittt 4
why don't directly predict x_0 in the inference but predict iteratively?

https://github.com/cientgu/VQ-Diffusion/blob/37bbcccdd4aef1794dac645128d864a9f69ed985/image_synthesis/modeling/transformers/diffusion_transformer.py?_pjax=%23js-repo-pjax-container#L186 https://github.com/cientgu/VQ-Diffusion/blob/37bbcccdd4aef1794dac645128d864a9f69ed985/image_synthesis/modeling/transformers/diffusion_transformer.py?_pjax=%23js-repo-pjax-container#L240

As shown in line186, you predict x_0 from x_t at any timestep with transformer model. In the line240 for inference, x_t -> x_{t-1}, you predict x_0 with p(x_0|x_t), and then predict x_{t-1} using q_posterior q(x_{t-1}|x_t, x_0) function.

So why don't directly predict the x_0 with p(x_0|x_T) in the inference?

opened by PanXiebit 2
question about 'filter_ratio' parameter

Hi, thanks for the excellent implementation!

Could you please tell me what is the purpose of the 'filter_ratio' parameter in the sampling function? And I also note that the intermediate training results are sampled with different filter_ratio parameters. How should we interpret the results with different values? Thanks!

opened by yzxing87 2
How to set using Mutiple GPUs to train?

Hi, can you share how to use multiple GPUs to train the code? According to you scripts in running_command folder, it seems only support use one single GPU.

opened by yangdongchao 2
Cub200Dataset

Hi , thanks for your code. I want to run your code on CUB200 dataset. I download this dataset from http://www.vision.caltech.edu/visipedia/CUB-200.html. But I find that it does not incude filenames.pickle files. Furthermore the annotation of this dataset is .mat format, but in your dataset code, the anotation seems .txt format. (this_text_path = os.path.join(data_root, 'text', 'text', name+'.txt')). So I want to ask whether you can tell me how to get the CUB200 dataset as your description.

Looking forward to your reply.

opened by yangdongchao 2
About training time

Thanks for releasing the codes of this awesome work! May I know the training cost of the VQ-Diffusion-B model? How long does the training take when using 8 V100 GPUs?

opened by yzxing87 2
Hardware source requirement.

Nice work, and thanks to your kindful share! But I wonder the GPU requirement for the training and how to switch the VQ-Diffusion-S for the training? Looking forward to your reply!

opened by ENJOY-Yin-jiong 2
Question on Equation and Implementation

Hi author, thanks for sharing your inspiring work!

I found that some codes could be hard to follow. For example, in line 182, what is the goal of this log_add_exp function, and how this calculation of log_probs corresponds to equations in the paper? It seems that you implemented VQ diffusion all in log scale, which is different from the original DDPM to a large extent. Could you explain Why?

Thanks. Looking forward to your reply.

opened by Rongjiehuang 1
How to using Multi-machine Multi-GPU to train

Hi, I want to ask whether the code supports to use multi-machine and multi-gpu. In fact, I want to use two machines, each machine includes 8 GPUs. But when I try to use them with your code, I find only 8 GPU is used.

opened by yangdongchao 0
About the classifier free guidance

Hi, In your code, you use the following code to set whether using null text vector: is_empty_text = torch.logical_not(input['condition_mask'][:, 2]).unsqueeze(1).unsqueeze(2).repeat(1, 77, 512) But, I found that if all of the caption length larger than 2, is_empty_text will always be False. So, I want to ask how to control the classifier free guidance? Whether we add some <image, null text> pair to the training dataset?

opened by yangdongchao 0
About unconditional synthesis on FFHQ.

Hi Authors,

Thanks for sharing this nice work! I am trying to reproduce the results of unconditional synthesis on FFHQ dataset. Compared to other tasks, however, training and inference details for this experiment seem to be insufficient. Could you tell me the training details and inference code for unconditional image generation on FFHQ?

Thanks a lot:)

opened by Godkimchiy 0
Change the dimension of the input and ouptu image

Hi. The current version of the code seems to work with 256X256 input/output images. I am wondering if there is any way to modify the size of input and output images.

Thanks,

opened by ClinicalAI 0
Oxford flower dateset pretrained model release

Hi, could you please release the pretrained model on Oxford flower dataset? Meanwhile, did you follow the previous work which split the flower dataset into 82 classes in training data and 20 classes (class 1 to 20) in testing data? Thanks very much!

opened by xouyang0079 0

Owner

Shuyang Gu

GitHub

Minimal diffusion models - Minimal code and simple experiments to play with Denoising Diffusion Probabilistic Models (DDPMs)

Minimal code and simple experiments to play with Denoising Diffusion Probabilist

16 Oct 6, 2022

Pytorch-diffusion - A basic PyTorch implementation of 'Denoising Diffusion Probabilistic Models'

PyTorch implementation of 'Denoising Diffusion Probabilistic Models' This reposi

76 Jan 7, 2023

This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

3k Dec 26, 2022

Codebase for Diffusion Models Beat GANS on Image Synthesis.

128 Dec 2, 2022

High-Resolution Image Synthesis with Latent Diffusion Models

Latent Diffusion Models Requirements A suitable conda environment named ldm can be created and activated with: conda env create -f environment.yaml co

5.6k Jan 4, 2023

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

GLIDE This is the official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing w

2.9k Jan 4, 2023

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Related tags

Overview

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Overview

Framework

Requirements

Data Preparing

Microsoft COCO

CUB-200

ImageNet

Pretrained Model

Inference

Training

Cite VQ-Diffusion

Acknowledgement

License

Contact Information

Comments

Owner

Shuyang Gu

Minimal diffusion models - Minimal code and simple experiments to play with Denoising Diffusion Probabilistic Models (DDPMs)

Pytorch-diffusion - A basic PyTorch implementation of 'Denoising Diffusion Probabilistic Models'

This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

Codebase for Diffusion Models Beat GANS on Image Synthesis.

High-Resolution Image Synthesis with Latent Diffusion Models

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

Generate saved_model, tfjs, tf-trt, EdgeTPU, CoreML, quantized tflite and .pb from .tflite.

BitPack is a practical tool to efficiently save ultra-low precision/mixed-precision quantized models.

Quantized tflite models for ailia TFLite Runtime

Quantized models with python

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech"

A 1.3B text-to-image generation model trained on 14 million image-text pairs

SEOVER: Sentence-level Emotion Orientation Vector based Conversation Emotion Recognition Model

PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling

A denoising diffusion probabilistic model (DDPM) tailored for conditional generation of protein distograms

NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling @ INTERSPEECH 2021 Accepted

Implementation of GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022).