A 1.3B text-to-image generation model trained on 14 million image-text pairs

Kakao Brain

Last update: Dec 14, 2022

Related tags

Deep Learning minDALL-E

Overview

minDALL-E on Conceptual Captions

minDALL-E, named after minGPT, is a 1.3B text-to-image generation model trained on 14 million image-text pairs for non-commercial purposes.

Environment Setup

Basic setup

PyTorch == 1.8.0
CUDA >= 10.1

Other packages

pip install -r requirements.txt

Model Checkpoint

Model structure (two-stage autoregressive model)
- Stage1: Unlike the original DALL-E [1], we replace Discrete VAE with VQGAN [2] to generate high-quality samples effectively. We slightly fine-tune vqgan_imagenet_f16_16384, provided by the official VQGAN repository, on FFHQ [3] as well as ImageNet.
- Stage2: We train our 1.3B transformer from scratch on 14 million image-text pairs from CC3M [4] and CC12M [5]. For the more detailed model spec, please see configs/dalle-1.3B.yaml.
You can download the pretrained models including the tokenizer from this link. This will require about 5GB space.

Sampling

Given a text prompt, the code snippet below generates candidate images and re-ranks them using OpenAI's CLIP [6].
This has been tested under a single V100 of 32GB memory. In the case of using GPUs with limited memory, please lower down num_candidates to avoid OOM.

from matplotlib import pyplot as plt
import clip
from dalle.models import Dalle
from dalle.utils.utils import set_seed, clip_score

device = 'cuda:0'
set_seed(0)

prompt = "A painting of a monkey with sunglasses in the frame"
model = Dalle.from_pretrained('minDALL-E/1.3B')  # This will automatically download the pretrained model.
model.to(device=device)

# Sampling
images = model.sampling(prompt=prompt,
                        top_k=256, # It is recommended that top_k is set lower than 256.
                        top_p=None,
                        softmax_temperature=1.0,
                        num_candidates=96,
                        device=device).cpu().numpy()
images = np.transpose(images, (0, 2, 3, 1))

# CLIP Re-ranking
model_clip, preprocess_clip = clip.load("ViT-B/32", device=device)
model_clip.to(device=device)
rank = clip_score(prompt=prompt,
                  images=images,
                  model_clip=model_clip,
                  preprocess_clip=preprocess_clip,
                  device=device)

# Plot images
images = images[rank]
plt.imshow(images[0])
plt.show()

If you want to use a complete python code for sampling, please see examples/sampling_ex.py
If you want to play with an interactive demo, please see examples/sampling_interactive_demo.ipynb. Before using this, you may need to install ipywidgets.

Samples (Top-K=256, Temperature=1.0)

"a painting of a {cat, dog} with sunglasses in the frame"

"a large {pink, black} elephant walking on the beach"

"Eiffel tower on a {desert, mountain}"

Quantitative Results

We have validated minDALL-E on the CC3M validation set (in-distribution evaluation) and MS-COCO (zero-shot evaluation).
For CC3M, we measure the cosine similarity between image and text representations from the pretrained CLIP model (ViT-B/32), referred to as CLIP-score.
For MS-COCO, we compute FID between 30K generated and real samples from MS-COCO 2017, where we randomly choose 30K captions from COCO as in DALL-E. We select the best out of 32 candidates by CLIP re-ranking.

Model	CC3M:CLIP-score (higher is better)	MS-COCO:FID-30K (lower is better)
VQGAN [2]	0.20	-
ImageBART [7]	0.23	-
DALL-E [1]	-	27.5
minDALL-E	0.26	14.7

Transfer Learning Examples

minDALL-E, which is pre-trained on noisy text supervisions, could be transferable to class-conditional and unconditional generation tasks. To validate this, we simply fine-tune it on ImageNet over 8 epochs in the case of class-conditional generation and unconditional generation.
The commands below fine-tune the pretrained DALL-E. It takes about 36 hours on 8 V100 GPUs.

# unconditinoal image generation for imagenet (256x256)
python examples/transfer_learning_ex.py -d=configs/transfer-imagenet-uncond-gen.yaml
                                        -u=[MODEL_CKPT]
                                        -r=[RESULT_PATH]
                                        --n-gpus=[NUM_GPUS]

# class-conditinoal image generation for imagenet (256x256)
python examples/transfer_learning_ex.py -d=configs/transfer-imagenet-clscond-gen.yaml
                                        -u=[MODEL_CKPT]
                                        -r=[RESULT_PATH]
                                        --n-gpus=[NUM_GPUS]

We compute FID-50K between 50K generated samples and all ImageNet training samples, where we use top-k=256 and softmax temperature=1.0 for generation. All results are obtained without the rejection sampling. Interestingly, our model achieves very competitive performance with baselines, even though minDALL-E is fine-tuned in a few epochs.

Model	Params	FID-50K(class-cond.)	FID-50K(uncond.)
VQ-GAN	1.4B	15.78	-
ImageBART	3.5B	21.19	-
minDALL-E	1.3B	15.55	37.58

BibTex

If you find this repository useful in your research, please cite:

@misc{kakaobrain2021minDALL-E,
  title         = {minDALL-E on Conceptual Captions},
  author        = {Saehoon Kim, Sanghun Cho, Chiheon Kim, Doyup Lee, and Woonhyuk Baek},
  year          = {2021},
  howpublished  = {\url{https://github.com/kakaobrain/minDALL-E}},
}

References

[1] Ramesh et al. Zero-Shot Text-to-Image Generation, ICML 2021.
[2] Esser et al. Taming Transformers for High-Resolution Image Synthesis, CVPR 2021.
[3] Karras et al. A Style-Based Generator Architecture for Generative Adversarial Networks, CVPR 2019.
[4] Sharma et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning, ACL 2018.
[5] Changpinyo et al. Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts, CVPR 2021.
[6] Radford et al. Learning Transferable Visual Models From Natural Language Supervision, ICML 2021.
[7] Esser et al. ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis, NeurIPS 2021.
[8] https://github.com/karpathy/minGPT

Licenses

The source codes are licensed under Apache 2.0 License.
The stage2 pretrained weights are licensed under CC-BY-NC-SA 4.0 License.

Contact

We hope that minDALL-E helps various projects in research-oriented institutes and startups. If you would like to collaborate with us or share a feedback, please e-mail to us, [email protected]

Limitations

Although minDALL-E is trained on a small set (14M image-text pairs), this might be vulnerable to malicious attacks from the prompt engineering to generate socially unacceptable images. If you obersve these images, please report the "prompt" and "generated images" to us.

Comments

Does zero-shot work in minDALL-E?

Thanks for your amazing work!

I'm attempting zero-shot image-to-image translation, as described in the original paper, by inserting only half of the image. The outcomes are as follows. Will this problem be solved if I increase the size of the model?

opened by SeungyounShin 3
text token index slice to N-1
Hi, thanks for sharing the code.

In the forward function of Transformer1d, text index is sliced with 0 ~ N-2 and image index is sliced with N-1 ~ N-1 + (T-1).

B, T = images.shape _, N = texts.shape ... x = torch.cat([texts, images], axis=1).contiguous() ... texts = x[:, :N-1].contiguous() images = x[:, N-1:-1].contiguous()

Could you please clarify why you didn't slice like below? Thanks!

texts = x[:, :N] images = x[:, N:]
opened by j-min 2
CUDA out-of-memory

Hi, It is mentioned in the "Transfer Learning Examples" section that you fine-tuned the pre-trained DALL-E on 8 V100 GPUs. I tried running you transfer_learning_ex.py script on V100 GPUs (16GB GPU memory per CPU). It throws CUDA OOM error. Can you please share the exact specs of the hardware you used for this?

opened by smittal10 1
Comparison against GLIDE

Recently Open AI posted GLIDE, a diffusion model made for generating images from text, much like DALL-E.

Would it be possible to compare minDALL-E to GLIDE and put the results on the github?

Thank you in advance!

Also I have to say this is amazing!

opened by MyUsernamee 1
Amazing work; models CDN?

Hi there! Just want to quickly congratulate all the effort done in this project!

Will the models / tokenizers also be stored in Github's releases binary? It could be good as a backup / alternative.

opened by johnpaulbin 1
sampling in GPU with 12 GB memory

I found that sampling code examples/sampling_ex.py fails to save the image if the num_candiates is smaller than 16.

It is due to the value 16 is hardcoded in line 61, for i in range(16):

The below modification works for lower num_candidates value. for i in range(min(16, args.num_candidates)):

opened by tackgeun 0

Project dependencies may have API risk issues

Hi, In minDALL-E, inappropriate dependency versioning constraints can cause risks.

Below are the dependencies and version constraints that the project is using

torch==1.8.0
torchvision>=0.8.2
tokenizers>=0.10.2
pyflakes>=2.2.0
tqdm>=4.46.0
pytorch-lightning>=1.5
einops
omegaconf
git+https://github.com/openai/CLIP.git
matplotlib

The version constraint == will introduce the risk of dependency conflicts because the scope of dependencies is too strict. The version constraint No Upper Bound and * will introduce the risk of the missing API Error because the latest version of the dependencies may remove some APIs.

After further analysis, in this project, The version constraint of dependency tqdm can be changed to >=4.36.0,<=4.64.0.

The above modification suggestions can reduce the dependency conflicts as much as possible, and introduce the latest version as much as possible without calling Error in the projects.

The invocation of the current project includes all the following methods.

The calling methods from the tqdm

tqdm.tqdm.set_description
tqdm.tqdm

The calling methods from the all methods

self.resid_drop
torch.cuda.manual_seed_all
PIL.Image.fromarray
PIL.Image.fromarray.save
ExpConfig
self.key
hashlib.md5
module.weight.data.normal_
self.head
pytorch_lightning.loggers.TensorBoardLogger
self.lr_schedulers.get_last_lr
text_features.image_features.F.cosine_similarity.squeeze
W.B.device.H.torch.arange.repeat.transpose
numpy.transpose
min
argparse.ArgumentParser.add_argument
self.quantize.get_codebook_entry
self.v
sorted_idx_remove_cond.scatter
self.quant_conv
RuntimeError
self.apply
ImageNetDataModule
self.sos.repeat
pytorch_lightning.Trainer.fit
torchvision.transforms.Compose
self.stage2.sos
AttnBlock
model.stage1.from_ckpt
from_file
reversed
get_positional_encoding
datetime.datetime.now
tokens.to.unsqueeze
torch.nn.functional.cosine_similarity
probs.torch.multinomial.clone
self.encode
pl_module.stage1
self.down.append
Normalize
self.mid.block_1
download
self.conv1
Downsample
z_q.permute.contiguous
self.conv
OptConfig
torch.nn.functional.pad
Stage1Hparams
self.embedding
super
w_.permute.permute
i.images.astype
source.info.get
from_file.enable_truncation
self.norm2
random.seed
numpy.random.seed
os.path.expanduser
x.self.query.view
codes.device.T.torch.arange.repeat
layers.Block
device.args.num_candidates.args.softmax_temperature.args.top_p.args.top_k.args.prompt.model.sampling.cpu
self.conv_in
device.H.torch.arange.repeat
self.mlp.transpose
cutoff_topp_probs.masked_fill
self.norm1
k.reshape.reshape
torch.cuda.amp.autocast
x.contiguous.contiguous
loop.update
argparse.ArgumentParser.parse_args
prompt.clip.tokenize.to
self.tok_emb_txt
device.args.num_candidates.args.softmax_temperature.args.top_p.args.top_k.args.prompt.model.sampling.cpu.numpy
Stage2Hparams
os.path.dirname
torch.tril
self.ln1
pytorch_lightning.callbacks.ModelCheckpoint
cnt.code_.unsqueeze
model_clip.encode_text
y.transpose.contiguous.view
ImageNetDataModule.setup
tuple
enumerate
torch.nn.Linear
self.resid_drop.transpose
tokenizer.build_tokenizer
i_block.i_level.self.down.attn
self.register_buffer
self.dropout
torchvision.utils.make_grid
self.mid.attn_1
x.self.value.view
torch.randn
output.write
self.pos_emb_img
self.n_heads.C.self.n_heads.B.T.x.self.key.view.transpose
self.ln2
self.nin_shortcut
self.stage2.eval
self.lr_schedulers.step
self.blocks
os.path.abspath
model.stage2.from_ckpt
torch.multinomial
self.encoder
quant.permute.permute
min_encoding_indices.self.embedding.view
torch.nn.functional.interpolate
labels.self.sos.unsqueeze
print
torchvision.transforms.Normalize
sys.path.append
self.decoder
torch.einsum
self.norm_out
torch.optim.AdamW
images.self.stage1.get_codes.detach.view
MultiHeadSelfAttention
einops.rearrange
urllib.parse.urlparse
stage2.transformer.Transformer1d
self.stage1.get_codes
DataConfig
self.drop
omegaconf.OmegaConf.structured
dalle.models.Dalle.from_pretrained.sampling
preprocess_clip
images.torch.stack.to
tqdm.tqdm.set_description
utils.config.get_base_config
tqdm.tqdm
x.self.key.view
self.n_heads.C.self.n_heads.B.T.x.self.query.view.transpose
torch.cat.clone
self.decode
self.stage2
self.query
i_level.self.up.upsample
urllib.request.urlopen
torch.nn.ModuleList.append
self.conv2
source.info
self.n_heads.C.self.n_heads.B.T.x.self.value.view.transpose
self.lr_schedulers
layers.Encoder
tarfile.open
images.self.stage1.get_codes.detach
model_clip.encode_image
cutoff_topk_logits
utils.sampling.sampling
torch.nn.Sequential
torch.nn.ModuleList
setup_callbacks
self.value
tokens.to.to
self.log
math.sqrt
isinstance
omegaconf.OmegaConf.merge
open
torch.cat
torch.ones
torch.topk
self.proj_out.reshape
torch.argmin
self.q
self.stage1.parameters
os.path.join
os.path.exists
torch.utils.data.DataLoader
self.embedding.weight.data.uniform_
scores.torch.argsort.cpu
torch.nn.Module
cutoff_topk_logits.to
dalle.utils.utils.clip_score
int
cutoff_topk_logits.clone
N.x.contiguous
f.extract
torch.stack
torch.sort
self.attn_drop.masked_fill
torchvision.datasets.ImageNet
torchvision.transforms.CenterCrop
optimizer.step
download_target.open.read
cnt.pos_enc_code_.unsqueeze
args.config_downstream.os.path.basename.split
self
torch.optim.lr_scheduler.CosineAnnealingLR
stage1.vqgan.VQGAN
ValueError
torch.argsort
Stage1Config
range
torch.nn.functional.avg_pool2d
omegaconf.OmegaConf.load
self.sos
x.transpose.contiguous
torch.manual_seed
os.path.isfile
image.astype
present.torch.stack.clone
pl_module.logger.experiment.add_image
os.path.basename
ImageLogger
self.stage1.eval
pytorch_lightning.seed_everything
torch.cat.size
v.reshape.reshape
sos.self.stage2.sos.unsqueeze
torchvision.transforms.Resize
url.split
clip.tokenize
datetime.datetime.now.strftime
device.W.torch.arange.repeat
torch.nn.Conv2d
torch.nn.LayerNorm
dalle.utils.utils.set_seed
cls_idx.torch.LongTensor.to
torch.nn.functional.softmax
i_block.i_level.self.up.attn
ResnetBlock
torch.nn.functional.cross_entropy
probs.torch.multinomial.clone.detach
float
images.texts.torch.cat.contiguous
f.getmembers
z_q.permute.contiguous.view
dalle.models.Dalle.from_pretrained
source.read
VectorQuantizer
pytorch_lightning.Trainer
torch.sigmoid
self.tok_emb_img
i_block.i_level.self.down.block
torch.clamp
self.tokenizer.encode
h.self.quantize.view
self.conv_out
nonlinearity
model_clip.to
self.ln_f
q.permute.reshape
torch.arange
self.load_state_dict
q.permute.permute
self.k
functools.partial
torch.sum
self.stage2.sos.repeat
self.norm
self.mid.block_2
self.head_txt
cls
utils.realpath_url_or_path
torch.load
torch.no_grad
format
past.append
torchvision.transforms.ToTensor
device.N.torch.arange.repeat
presents.append
self.stage1.decode_code
self.quantize
from_file.token_to_id
os.makedirs
self.pos_emb_txt
torch.nn.Embedding
utils.sampling.sampling_igpt
code.clone.detach
dalle.models.ImageGPT.from_pretrained
z_q.permute.contiguous.permute
torchvision.transforms.RandomCrop
self.attn
Upsample
stage2.transformer.iGPT
self.post_quant_conv
torch.cumsum
super.__init__
download_target.open.read.hashlib.md5.hexdigest
self.proj_out
i_level.self.down.downsample
h.sos.torch.cat.contiguous
ImageNetDataModule.train_dataloader
self.stage2.view
self.head_img
self.proj
ImageNetDataModule.valid_dataloader
self.parameters
len
z.rearrange.contiguous
torch.clip
torch.nn.GroupNorm
torch.nn.Parameter
model.sampling
argparse.ArgumentParser
torch.nn.Dropout
sorted_idx_remove_cond.clone
block.sample
torch.LongTensor
self.log_img
from_file.enable_padding
torch.bmm
self.mlp
self.conv_shortcut
y.transpose.contiguous
recons.cpu.cpu
module.bias.data.zero_
GELU
self.up.insert
dataclasses.field
module.weight.data.fill_
clip.load
torch.nn.functional.gelu
i_block.i_level.self.up.block
present.torch.stack.clone.detach
from_file.add_special_tokens
Stage2Config
torch.repeat_interleave
dalle.models.Dalle.from_pretrained.to
layers.Decoder
scores.torch.argsort.cpu.numpy
cutoff_topp_probs
self.mask.torch.tril.view
sos.self.stage2.sos.unsqueeze.repeat
torch.cat.transpose
images.cpu.cpu
self.attn_drop
quant.rearrange.contiguous
z.rearrange.contiguous.view

@developer Could please help me check this issue? May I pull a request to fix it? Thank you very much.

opened by PyDeps 0

How to do inference from half image

Hi I want to know if the code can do the inference when we input the text and half of the image like iGPT and Taming Transformer? If possible, would you mind pointing to the relevance code for this.

opened by thuangb 0
Increasing positional embeddings text

I am finetuning the minDALL-E model on a self-made dataset but my tokenized text prompts are sometimes longer than 64. What would be the best technique to increase the length of the positional encodings to e.g. 128? I was thinking of keeping the original 64 embeddings and appending 64 more, which have to be trained from scratch. However, I think it might mess with the finetuning, since the embeddings are in the very first layer.

Are there better options/techniques to accomplish this?

opened by ChristiaensBert 0
How much VRAM is needed for this?

I was trying to run the sampling_ex.py, but no matter how low I set the num_candidates value (even if it's set to one or two), it always tells me that it has run out of memory. I am using an NVIDIA Quadro M5000 with 8 GB of VRAM.

opened by mjohanning99 2
Script for VQGAN Finetuning

This is an incredible project! For reproducibility, and for some of my own work, would you mind sharing/pointing me to code for fine-tuning VQGAN models (e.g., vqgan_imagenet_f16_16384) on custom datasets? This would be different than code for training VQGAN from scratch on different datasets.

Additionally, how long does fine-tuning take?

opened by siddk 0

A 1.3B text-to-image generation model trained on 14 million image-text pairs

Related tags

Overview

minDALL-E on Conceptual Captions

Environment Setup

Model Checkpoint

Sampling

Samples (Top-K=256, Temperature=1.0)

Quantitative Results

Transfer Learning Examples

BibTex

References

Licenses

Contact

Limitations

Comments

Owner

Kakao Brain

Code for the paper "Improving Vision-and-Language Navigation with Image-Text Pairs from the Web" (ECCV 2020)

Annotate datasets with a semi-trained or fully trained YOLOv5 model

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

Image-generation-baseline - MUGE Text To Image Generation Baseline

A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

Official repository for "PAIR: Planning and Iterative Refinement in Pre-trained Transformers for Long Text Generation"

This repo uses a combination of logits and feature distillation method to teach the PSPNet model of ResNet18 backbone with the PSPNet model of ResNet50 backbone. All the models are trained and tested on the PASCAL-VOC2012 dataset.

Code, Data and Demo for Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting

CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation

Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).

PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

a reccurrent neural netowrk that when trained on a peice of text and fed a starting prompt will write its on 250 character text using LSTM layers

Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

Repository to run object detection on a model trained on an autonomous driving dataset.

Chinese clinical named entity recognition using pre-trained BERT model

RoBERTa Marathi Language model trained from scratch during huggingface 🤗 x flax community week

The Hailo Model Zoo includes pre-trained models and a full building and evaluation environment