ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

CompVis Heidelberg

Last update: Jan 1, 2023

Related tags

Deep Learning imagebart

Overview

ImageBART

NeurIPS 2021

Patrick Esser*, Robin Rombach*, Andreas Blattmann*, Björn Ommer
* equal contribution

arXiv | BibTeX | Poster

Requirements

A suitable conda environment named imagebart can be created and activated with:

conda env create -f environment.yaml
conda activate imagebart

Get the Models

We provide pretrained weights and hyperparameters for models trained on the following datasets:

FFHQ:
- 4 scales, geometric noise schedule: wget -c https://ommer-lab.com/files/ffhq_4_scales_geometric.zip
- 2 scales, custom noise schedule: wget -c https://ommer-lab.com/files/ffhq_2_scales_custom.zip
LSUN, 3 scales, custom noise schedules:
- Churches: wget -c https://ommer-lab.com/files/churches_3_scales.zip
- Bedrooms: wget -c https://ommer-lab.com/files/bedrooms_3_scales.zip
- Cats: wget -c https://ommer-lab.com/files/cats_3_scales.zip
Class-conditional ImageNet:
- 5 scales, custom noise schedule: wget -c https://ommer-lab.com/files/cin_5_scales_custom.zip
- 4 scales, geometric noise schedule: wget -c https://ommer-lab.com/files/cin_4_scales_geometric.zip

Download the respective files and extract their contents to a directory ./models/.

Moreover, we provide all the required VQGANs as a .zip at https://ommer-lab.com/files/vqgan.zip, which contents have to be extracted to ./vqgan/.

Get the Data

Running the training configs or the inpainting script requires a dataset available locally. For ImageNet and FFHQ, see this repo's parent directory taming-transformers. The LSUN datasets can be conveniently downloaded via the script available here. We performed a custom split into training and validation images, and provide the corresponding filenames at https://ommer-lab.com/files/lsun.zip. After downloading, extract them to ./data/lsun. The beds/cats/churches subsets should also be placed/symlinked at ./data/lsun/bedrooms/./data/lsun/cats/./data/lsun/churches, respectively.

Inference

Unconditional Sampling

We provide a script for sampling from unconditional models trained on the LSUN-{bedrooms,bedrooms,cats}- and FFHQ-datasets.

FFHQ

On the FFHQ dataset, we provide two distinct pretrained models, one with a chain of length 4 and a geometric noise schedule as proposed by Sohl-Dickstein et al. [1] , and another one with a chain of length 2 and a custom schedule. These models can be started with

CUDA_VISIBLE_DEVICES=<gpu_id> streamlit run scripts/sample_imagebart.py configs/sampling/ffhq/<config>

LSUN

For the models trained on the LSUN-datasets, use

CUDA_VISIBLE_DEVICES=<gpu_id> streamlit run scripts/sample_imagebart.py configs/sampling/lsun/<config>

Class Conditional Sampling on ImageNet

To sample from class-conditional ImageNet models, use

CUDA_VISIBLE_DEVICES=<gpu_id> streamlit run scripts/sample_imagebart.py configs/sampling/imagenet/<config>

Image Editing with Unconditional Models

We also provide a script for image editing with our unconditional models. For our FFHQ-model with geometric schedule this can be started with

CUDA_VISIBLE_DEVICES=<gpu_id> streamlit run scripts/inpaint_imagebart.py configs/sampling/ffhq/ffhq_4scales_geometric.yaml

resulting in samples similar to the following.

Training

In general, there are two options for training the autoregressive transition probabilities of the reverse Markov chain: (i) train them jointly, taking into account a weighting of the individual scale contributions, or (ii) train them independently, which means that each training process optimizes a single transition and the scales must be stacked after training. We conduct most of our experiments using the latter option, but provide configurations for both cases.

Training Scales Independently

For training scales independently, each transition requires a seperate optimization process, which can started via

CUDA_VISIBLE_DEVICES=
   
     python main.py --base configs/
    /
     
      .yaml -t --gpus 0,

We provide training configs for a four scale training of FFHQ using a geometric schedule, a four scale geometric training on ImageNet and various three-scale experiments on LSUN. See also the overview of our pretrained models.

Training Scales Jointly

For completeness, we also provide a config to run a joint training with 4 scales on FFHQ. Training can be started by running

CUDA_VISIBLE_DEVICES=
   
     python main.py --base configs/ffhq/ffhq_4_scales_joint-training.yaml -t --gpus 0,

Shout-Outs

Many thanks to all who make their work and implementations publicly available. For this work, these were in particular:

The extremely clear and extensible encoder-decoder transformer implementations by lucidrains: https://github.com/lucidrains/x-transformers
Emiel Hoogeboom et al's paper on multinomial diffusion and argmax flows: https://arxiv.org/abs/2102.05379

References

[1] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. & Ganguli, S.. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. Proceedings of the 32nd International Conference on Machine Learning

Bibtex

@article{DBLP:journals/corr/abs-2108-08827, author = {Patrick Esser and Robin Rombach and Andreas Blattmann and Bj{\"{o}}rn Ommer}, title = {ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis}, journal = {CoRR}, volume = {abs/2108.08827}, year = {2021} }

You might also like...

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

Bilateral Denoising Diffusion Models (BDDMs) This is the official PyTorch implementation of the following paper: BDDM: BILATERAL DENOISING DIFFUSION M

172 Dec 23, 2022

PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Lip to Speech Synthesis with Visual Context Attentional GAN This repository contains the PyTorch implementation of the following paper: Lip to Speech

6 Nov 2, 2022

[CVPR2021 Oral] FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation.

FFB6D This is the official source code for the CVPR2021 Oral work, FFB6D: A Full Flow Biderectional Fusion Network for 6D Pose Estimation. (Arxiv) Tab

201 Dec 28, 2022

Implementation of Bidirectional Recurrent Independent Mechanisms (Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules)

BRIMs Bidirectional Recurrent Independent Mechanisms Implementation of the paper Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neura

26 May 26, 2022

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

GLIDE This is the official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing w

2.9k Jan 4, 2023

Comments

Missing key pretty

Hi!

When I tried to generate a sample with lsun checkpoint i got an error:

  File "sample_imagebart.py", line 295, in <module>
    print(paths.pretty())
...
omegaconf.errors.ConfigAttributeError: Missing key pretty
    full_key: pretty
    object_type=dict

opened by andorxornot 2

Great work! Some questions

Thanks for the awesome work! Really amazing generation results! It seems some related works like [1,2], which employs bi-directional transformer for generation, are not mentioned or compared in the paper. I am a little curious about how the performance differs.

[1] High-Fidelity Pluralistic Image Completion with Transformers, ICCV 2021 [2] M6-UFC: Unifying Multi-Modal Controls for Conditional Image Synthesis, Arxiv 2021

opened by fxyang9 0

ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis

Related tags

Overview

ImageBART

NeurIPS 2021

Requirements

Get the Models

Get the Data

Inference

Unconditional Sampling

FFHQ

LSUN

Class Conditional Sampling on ImageNet

Image Editing with Unconditional Models

Training

Training Scales Independently

Training Scales Jointly

Shout-Outs

References

Bibtex

You might also like...

BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis

PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

[CVPR2021 Oral] FFB6D: A Full Flow Bidirectional Fusion Network for 6D Pose Estimation.

Implementation of Bidirectional Recurrent Independent Mechanisms (Learning to Combine Top-Down and Bottom-Up Signals in Recurrent Neural Networks with Attention over Modules)

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

Implementation of "Bidirectional Projection Network for Cross Dimension Scene Understanding" CVPR 2021 (Oral)

UnFlow: Unsupervised Learning of Optical Flow with a Bidirectional Census Loss

Semi-Autoregressive Transformer for Image Captioning

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

Comments

Missing key pretty

Great work! Some questions

Owner

CompVis Heidelberg

Minimal diffusion models - Minimal code and simple experiments to play with Denoising Diffusion Probabilistic Models (DDPMs)

Pytorch-diffusion - A basic PyTorch implementation of 'Denoising Diffusion Probabilistic Models'

This is the codebase for Diffusion Models Beat GANS on Image Synthesis.

Codebase for Diffusion Models Beat GANS on Image Synthesis.

High-Resolution Image Synthesis with Latent Diffusion Models

TalkNet 2: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis with Explicit Pitch and Duration Prediction.

PyTorch Implementation of VAENAR-TTS: Variational Auto-Encoder based Non-AutoRegressive Text-to-Speech Synthesis.

Official implementation of the paper Chunked Autoregressive GAN for Conditional Waveform Synthesis

L-Verse: Bidirectional Generation Between Image and Text

Pytorch Implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension)