KoDALLE
Utilizing pretrained language modelโs token embedding layer and position embedding layer as DALLEโs text encoder.
Background
- Training DALLE model from scratch demands large size paired dataset of images and captions. For example, OpenAI DALLE is trained with more than 250 million text-image pairs for the training.
- If the dataset isnโt large enough or is limited to specific domains, number of vocabularies in the trained DALLE model are insufficient. For instance, 1 million text captions of K-Fashion dataset only consists of more or less than 300 tokens.
- Therefore, inferencing from such DALLE models could be problematic if the given sentence query is unconnected to the originally trained captionsโ text dataset.
KoDALLE's Result on Small Size Fashion Dataset
OpenAIโs DALLE | KoDALLE of HappyFace | |
---|---|---|
Train Dataset Size | 250 Million Pairs | 0.8 Million Pairs |
#Params | 12 Billion | 428 Million |
#Layers | 64 Layers | 16 Layers |
Computing Resource | 1024 x V100 16GB | 1 x V100 32GB |
Text Encoder | 16384 Vocab x 512 Dim BPE | 32000 Vocab x 1024 Dim klue/roberta-large |
Image Encoder | VQVAE | VQGAN |
Optimizer | AdamW | AdamW |
Learning Rate | 4.5e-5 | 3.0e-5 |
Weight Decay | 4.5e-3 | 3.0e-3 |
LR Scheduler | ReduceLROnPlateau | - |
The team constructed Text to Fashion Design DALLE model in Korean language with less than 100k text-image sampled pairs.
Caption | ์์ฐํฐ๋ ์์์ด ์นดํค ์์ฌ๊ฐ ์ฐ๋ธ ํ์ด ๋ฃจ์ฆ์ธ ์ฝํธ์ด๋ค. ํ์๋ ์์์ด ๋ค์ด๋น ์์ฌ๊ฐ ๋ฐ๋ ํ์ด ์คํค๋์ธ ์ฒญ๋ฐ์ง์ด๋ค. |
Generated Image |
Methodology
Experimentations were conducted with the following Korean Transformers Modelsโ embedding layers. The team selected klue/roberta-large as baseline in the repository considering the size of the model.
- klue/roberta-large: Vocab Size of 32000, Embedding Dimension of 1024.
- KoGPT Trinity of SKT: Vocab Size of 51200, Embedding Dimension of 1920.
- KoGPT of Kakao Brain: Vocab Size of 64512, Embedding Dimension of 4096.
KoDALLE with klue/roberta-large's wpe and wte which is trainable on 16GB GPU Google Colab environment. Hyperparams related to the DALLE's model size are following.
'BATCH_SIZE': 32
'DEPTH': 2
'TEXT_SEQ_LEN': 128
'VOCAB_SIZE': 32000
'MODEL_DIM': 1024
'ATTN_TYPES': 'full'
'DIM_HEAD': 64
'HEADS': 8
- DALLE model is composed on lucidrain's DALLE-pytorch
- Image encoder is constructed based on VQGAN(Taming Transformers)
Significance
- Offers promising result for training from scratch on specific domains with small size dataset.
- Introduces solution for domain specific DALLE & CLIP models to be robust on input sentence.
- Recommends adequate text-to-image model size for given computation resource.
- Suggests effortless method of creating DALLE & CLIP model for own languages if pretrained language model is available.
WIP
- Add image-caption reranker(EfficientNet + Klue/roberta-large)
- Model trained with 500k text-image pairs.
- Modulize in python code.
- Update Inference code.
- Update FID and IS metrics on test and validation dataset.