Text-to-Image generation

Overview

Generate vivid Images for Any (Chinese) text

teaser

CogView is a pretrained (4B-param) transformer for text-to-image generation in general domain.

@article{ding2021cogview,
  title={CogView: Mastering Text-to-Image Generation via Transformers},
  author={Ding, Ming and Yang, Zhuoyi and Hong, Wenyi and Zheng, Wendi and Zhou, Chang and Yin, Da and Lin, Junyang and Zou, Xu and Shao, Zhou and Yang, Hongxia and Tang, Jie},
  journal={arXiv preprint arXiv:2105.13290},
  year={2021}

Getting Started

Setup

  • Hardware: Linux servers with Nvidia V100s or A100s are recommended, but it is also okay to run the pretrained models with smaller --max-inference-batch-size or training smaller models on less powerful GPUs.

  • Environment (Option 1): Please first install PyTorch (>=1.7.0) and apex, and then install other dependencies via pip install -r requirements.txt.

  • Environment (Option 2): We prepare a docker image in case that you fail to handle the environments. Pull the image, create a (background) container and get into it via:

    docker pull cogview/cuda111_torch181_deepspeed040
    ./env/start_docker.sh && docker exec -it bg-cogview bash
    
    cd /root/cogview # in the container
    

Download

  1. Download the image tokenizer vqvae_hard_biggerset_011.pt from BAAI website or Tsinghua Cloud. Place the file under pretrained/vqvae.
wget https://cloud.tsinghua.edu.cn/f/71607a5dca69417baa8c/?dl=1 -O pretrained/vqvae/vqvae_hard_biggerset_011.pt
  1. Download models from Project Wudao-Wenhui.

    FileName Discription
    cogview-base.tar The pretrained text-to-image model.
    cogview-caption.tar Finetuned image-to-text model, also used for reranking.
    cogview-sr.tar Finetuned super-resolution model. (warning: it runs slow.)

    Uncompress them into pretrained/cogview/. The following command should be modified based on the model name.

    tar -xvf cogview-{base, sr, caption}.tar -C pretrained/cogview/
    
  2. (Only for training tutorial, skip it for inference.) Download the Alibaba item-title image tokens dataset from our link at Tianchi(TODO). Place the lmdb folder under ./data.

Run CogView! (Model Inference)

We encapsulate the generation functions into scripts. See generate_samples.py and arguments.py for details.

Text-to-Image Generation

Write text queries (one per line) into input.txt and run:

./scripts/text2image.sh --debug

The results will in a new folder samples_text2image/.

Arguments useful in inference are mainly:

  • --input-source [path or "interactive"]. The path of the input file, can also be "interactive", which will launch a CLI.
  • --output-path [path]. The folder containing the results.
  • --batch-size [int]. The number of samples will be generated per query.
  • --max-inference-batch-size [int]. Maximum batch size per forward. Reduce it if OOM.
  • --debug. Only save concatenated images for all generated samples, and name them by input text and date.
  • --with-id. When it toggled, you must specify an "id" before each input, e.g. 001\t一个漂亮的女孩, \t denoting TAB (NOT space). It will generate batch-size split images in a folder named "id" for each input. Confict with --debug.
  • --device [int]. Running on which GPU.

Super-resolution

Run the following script and input text\t{image_path}, where {image_path} means the path of a previously generated image.

./scripts/super_resolution.sh

Note: It is only effective for generated images from our Image Tokenizer (due to the token distribution).

Image-to-Text

The input is "one image path per line", and will print the results to stdout.

./scripts/image2text.sh

Note: Not optimized for this task, so it might not very competitive (but okay). We will consider to release a version funetuning for a longer period on this task in the future. (TODO)

Post-selection

This application only takes file inputs, where each line is {text}\t{image_path1}\t{image_path2}\t{image_path3}.... The output is {output_path}/scores.txt, a line of a list of scores, following a line from inputs.

./scripts/post_selection.sh

Note: In the released codes, for simplicity, we did not expose the raw API , which supports some advanced generation modes, e.g. text and part of image.

Training

Here we use a subset of our dataset from Alibaba item-title for tutorial.

Single Node

After downloading the dataset, directly run

./scripts/pretrain_single_node.sh

Multiple Nodes

If you want to train the models on multiple servers inter-connected by infiniband without a shared file system (you may need pdsh to accelerate this process):

  1. On each server, use git clone to download this repo, and make sure the data (LMDB format) are moved into the data subfolder.
  2. On each server, echo "ip1 ip2 <other IPs>" > ./docker/ip_list.txt, and then start the docker by ./env/start_docker.sh.
  3. Get into the docker on the first node container via docker exec -it bg-cogview bash.
  4. Get into /root/cogview and run ./scripts/pretrain_multiple_nodes.sh. You may need to change the config (especially OPTIONS_NCCL) in the shell script.

See the arguments.py for advanced functions for training. TODO

Gallery

more_samples

Comments
  • Got error ''IndexError: tuple index out of range'' running super-res on colab with a tesla v100

    Got error ''IndexError: tuple index out of range'' running super-res on colab with a tesla v100

    /content/CogView Generate Samples WARNING: No training data specified using world size: 1 and model-parallel size: 1

    using dynamic loss scaling initializing model parallel with size 1 initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 padded vocab (size: 58219) with 21 dummy tokens (new size: 58240) prepare tokenizer done building CogView2 model ... number of parameters on model parallel rank 0: 3928849920 current device: 0 tcmalloc: large alloc 7881007104 bytes == 0x5637e3fb2000 @ 0x7f61e428db6b 0x7f61e42ad379 0x7f6171f1e25e 0x7f6171f1f9d2 0x7f61aff48e7d 0x7f61c0b43120 0x7f61c0781bd9 0x5637152088a8 0x56371527bfd5 0x5637152767ad 0x5637152093ea 0x5637152773b5 0x5637152767ad 0x563715209003 0x563715208b09 0x56371535028d 0x5637152bf1db 0x563715207bb1 0x5637152f8fed 0x56371527b988 0x5637152767ad 0x563715148e2c 0x563715278bb5 0x5637152764ae 0x5637152093ea 0x56371527832a 0x56371520930a 0x5637152773b5 0x56371520930a 0x5637152773b5 0x5637152764ae Load model file pretrained/cogview/cogview-sr/20000/mp_rank_00_model_states.pt Working on No. 0 on 0... Traceback (most recent call last): File "generate_samples.py", line 326, in main() File "generate_samples.py", line 323, in main generate_images_continually(model, args) File "generate_samples.py", line 215, in generate_images_continually for raw_text, seq, output_path in get_context(args, query_template): File "generate_samples.py", line 132, in get_context seq = _parse_and_to_tensor(raw_text, img_size=img_size, query_template=query_template) File "generate_samples.py", line 70, in _parse_and_to_tensor text = query_template.format(*text.split('\t')) IndexError: tuple index out of range /content

    opened by johngore123 17
  • Can't apply to download model.

    Can't apply to download model.

    Good afternoon,

    I tried to register on wudaoai and put in my information so I could download the model, but Wudaoai did not accept my phone number, probably because I am from Brazil. I had to look for a Chinese phone number online to apply, but I doubt they will allow me to download the model because of this. What can I do?

    good first issue 
    opened by PapayasTehSkeletor 15
  • about evaluation

    about evaluation

    Hi, How do you get "26.0" FID on mscoco using DM-GAN? Because the official result reported in https://github.com/MinfengZhu/DM-GAN is 26.55. I ran DM-GAN myself and managed to get a similar result(26.54), instead of "26.0".

    opened by FrankCast1e 11
  • Colab error

    Colab error

    I found a colab file reference in a closed issue. The last cell (inference) shows this error. Does anyone know of a solution or a more recent Colab noteboook? The web-based version of Cogview works but takes a while to process queued requests. I used the "insert code" icon when editing this and for some reason, the output is all connected together making it unreadable.

    /content/CogView Traceback (most recent call last): File "generate_samples.py", line 28, in <module> from utils import Timers File "/content/CogView/utils.py", line 25, in <module> from fp16 import FP16_Optimizer File "/content/CogView/fp16/__init__.py", line 15, in <module> from .fp16util import ( File "/content/CogView/fp16/fp16util.py", line 21, in <module> import mpu File "/content/CogView/mpu/__init__.py", line 35, in <module> from .layers import ColumnParallelLinear File "/content/CogView/mpu/layers.py", line 28, in <module> from apex.normalization.fused_layer_norm import FusedLayerNorm as LayerNorm ModuleNotFoundError: No module named 'apex'

    opened by metaphorz 8
  • Out of memory when using Text 2 Image

    Out of memory when using Text 2 Image

    [Wed Jun 16 19:21:01 2021] Memory cgroup out of memory: Killed process 15052 (python3) total-vm:19700460kB, anon-rss:11917232kB, file-rss:89696kB, shmem-rss:12288kB, UID:0 pgtables:25896kB oom_score_adj:0

    Is what I can get after the process has been killed. Is there a way to optimize this to run on a GPU with lower ram? I'm using a Tesla T4 on Google Colab.

    Thanks.

    opened by johnpaulbin 8
  • Filter unsafe inputs (better)

    Filter unsafe inputs (better)

    Guys, can you do something to prohibit inputs like this one image

    Even worse, someone’s already trying to generate underage porn: https://user-images.githubusercontent.com/188197/122180141-d4754600-ce90-11eb-96f6-f8cbacee8da1.mp4

    Maybe add some Google-authed signup with the possibility of banning to at least make it harder for creeps to use the model for such things?

    enhancement 
    opened by vzakharov 6
  • The math in this paper

    The math in this paper

    想請問一下關於此篇論文數學的部分:

    1. 加入文字之後的 ELBO 有更詳細的推導過程嗎? 是不是單純把只有 image 的 ELBO 不等式兩邊各加上一個 NLL loss for text 而已? 因為我看起來 text 並沒有在 VQVAE 訓練過程中起到作用,而這個 ELBO 是給 VQVAE 的,不太懂為什麼會有 text 那項 loss
    2. 不太了解 式(2) 如何變成 式(3) 的

    謝謝

    opened by brianw0924 5
  • Hello! CUDA out of memory when load the pretrained model of cogview-caption

    Hello! CUDA out of memory when load the pretrained model of cogview-caption

    Hello!Our team plans to load the pre-trained model of cogview-caption to finetune with v100, which is consistent with what you said in the paper about pre-training on the V100. But it turns out that "CUDA out of memory", and the training can't be launched until the model-parallel-size is set to be 4. So, how can we load the pre-trained model and finetune on V100? @neozhangthe1 @Sleepychord @lykeven @cenyk1230 @Somefive

    opened by starmemda 5
  • How to finetune the CogView to perform image captioning?

    How to finetune the CogView to perform image captioning?

    Hello, I wonder how to finetune the CogView model to perform image captioning? Here is my question: what is the format of the input text? I notice that the format of input text in your code is [ROI1], text, [BASE], [BOI1], image, [EOI1]. Therefore, what should I change for finetuning to image captioning? Just change the format into [BASE], [BOI1], image, [EOI1], [ROI1], text, or how?

    Looking forward to your reply, thanks!

    opened by ReneTCAd 3
  • docker pull error,

    docker pull error, "You have reached your pull rate limit"

    Is there any possibility to share the Dockerfile?

    I was trying to use the docker since it's difficult to build the apex library. Whereas when I used docker pull like the below, "docker pull cogview/cuda111_torch181_deepspeed040"

    I got the following error message: "Using default tag: latest Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit"

    Therefore, an original Dockerfile may be better.

    opened by XuanxuanGao 3
  • script to finetune Cogview-base

    script to finetune Cogview-base

    Hi, I'm trying to finetune Cogview pretrained model. However, when I try to load model weights, I get following error: RuntimeError: Error(s) in loading state_dict for GPT2Model:
    size mismatch for word_embeddings.weight: copying a param with shape torch.Size([14560, 2560]) from checkpoint, the shape in current model is torch .Size([14592, 2560]).

    Here is my script:

    `NUM_WORKERS=1 NUM_GPUS_PER_WORKER=4 MP_SIZE=1

    script_path=$(realpath $0) echo $script_path script_dir=$(dirname $script_path) main_dir=$(dirname $script_dir)

    OPTIONS_NCCL="NCCL_DEBUG=info" HOST_FILE_PATH="hostfile_single"

    config_json="$script_dir/ds_config_zero.json" gpt_options="
    --experiment-name cogview-test_finetune
    --img-tokenizer-num-tokens 8192
    --dataset-type TokenizedDataset
    --model-parallel-size ${MP_SIZE}
    --batch-size 4
    --num-layers 48
    --hidden-size 2560
    --num-attention-heads 40
    --save ./
    --train-iters 2000
    --save-interval 800
    --resume-dataloader
    --train-data /path/to/my/data
    --split 90,5,5
    --distributed-backend nccl
    --lr-decay-style cosine
    --warmup .1
    --checkpoint-activations
    --deepspeed-activation-checkpointing
    --max-position-embeddings 1089
    --max-memory-length 0
    --fp16
    --txt-loss-scale 5
    --load /path/to/cogview
    --no-load-rng
    --model-parallel-size 2
    --num-workers 16
    --is-sparse 0
    --finetune
    --shuffle "

    gpt_options="${gpt_options} --deepspeed
    --deepspeed_config ${config_json}
    "

    run_cmd="${OPTIONS_NCCL} deepspeed --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --hostfile ${HOST_FILE_PATH} pretrain_gpt2.py $@ ${gpt_options}" `

    It will be great if you can provide some details for finetuning. Thanks!

    opened by luyang-huang96 3
  • Copyright problems

    Copyright problems

    Who owns the copyright of the images generated by the engine? Or is the engine generated image assigned to be published under a certain protocol (like CC0)?

    opened by Saintafox 1
  • After training, there are only 2G Pt files left. What's wrong with missing files?

    After training, there are only 2G Pt files left. What's wrong with missing files?

    After I tried to train with my own data set, the PT file I got was not 7g in size. Use it to execute. / scripts / text2image.sh -- debug error "runtimeerror: error (s) in loading state_dict for gpt2model:". Who can help me?Thank you.

    opened by tt-s-t 1
  • Do you have a model checkpoint smaller than 7GB?

    Do you have a model checkpoint smaller than 7GB?

    Hi, I found your model checkpoint at https://resource.wudaoai.cn/home?ind=2&name=WuDao%20WenHui&id=1399364355975327744 too large for me to run on my PC, as the hidden size of the checkpoint is 2560. Do you have a checkpoint with a smaller hidden size, such as 1024? or can I easily shift the size of the checkpoint you released smaller? Thank you!

    opened by ZihanWangRuc 0
  • Is the CogView2 demo avaliable here the same as the CogView2 paper released?

    Is the CogView2 demo avaliable here the same as the CogView2 paper released?

    opened by apolinario 1
  • RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

    RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

    使用的环境是由作者提供的docker镜像 使用的显卡是 Tesla P100-PCIE 16GB 在运行./scripts/text2image.sh --debug报错 报错代码如下: `Generate Samples WARNING: No training data specified using world size: 1 and model-parallel size: 1

    using dynamic loss scaling initializing model parallel with size 1 initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 padded vocab (size: 58219) with 21 dummy tokens (new size: 58240) prepare tokenizer done building CogView2 model ... number of parameters on model parallel rank 0: 3928849920 current device: 1 Load model file pretrained/cogview/cogview-base/142000/mp_rank_00_model_states.pt Working on No. 0 on 0... show raw text: 一只可爱的小猫。 Traceback (most recent call last): File "generate_samples.py", line 329, in main() File "generate_samples.py", line 326, in main generate_images_continually(model, args) File "generate_samples.py", line 221, in generate_images_continually generate_images_once(model, args, raw_text, seq, num=args.batch_size, output_path=output_path) File "generate_samples.py", line 166, in generate_images_once output_tokens_list.append(filling_sequence(model, seq.clone(), args)) File "/root/cogview/generation/sampling.py", line 128, in filling_sequence logits, *mems = model(tokens, position_ids, attention_mask, txt_indices_bool, img_indices_bool, is_sparse=args.is_sparse, *mems) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(input, **kwargs) File "/root/cogview/fp16/fp16.py", line 65, in forward return fp16_to_fp32(self.module((fp32_to_fp16(inputs)), **kwargs)) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/root/cogview/model/gpt2_modeling.py", line 112, in forward transformer_output = self.transformer(embeddings, position_ids, attention_mask, txt_indices_bool, img_indices_bool, is_sparse, *mems) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/root/cogview/mpu/sparse_transformer.py", line 604, in forward hidden_states = layer(*args, mem=mem_i) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/root/cogview/mpu/sparse_transformer.py", line 322, in forward attention_output = self.attention(layernorm_output1, ltor_mask, pivot_idx, is_sparse, mem) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/root/cogview/mpu/sparse_transformer.py", line 166, in forward output = self.dense(context_layer) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "/root/cogview/mpu/layers.py", line 319, in forward output_parallel = F.linear(input_parallel, self.weight) File "/opt/conda/lib/python3.8/site-packages/torch/nn/functional.py", line 1753, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP) ` 希望有人能为我解答这个问题,谢谢

    opened by acerhp 0
Owner
THUDM
Data Mining Research Group at Tsinghua University
THUDM
Text-to-Image generation

Generate vivid Images for Any (Chinese) text CogView is a pretrained (4B-param) transformer for text-to-image generation in general domain. Read our p

THUDM 1.3k Jan 5, 2023
keras复现场景文本检测网络CPTN: 《Detecting Text in Natural Image with Connectionist Text Proposal Network》;欢迎试用,关注,并反馈问题...

keras-ctpn [TOC] 说明 预测 训练 例子 4.1 ICDAR2015 4.1.1 带侧边细化 4.1.2 不带带侧边细化 4.1.3 做数据增广-水平翻转 4.2 ICDAR2017 4.3 其它数据集 toDoList 总结 说明 本工程是keras实现的CPTN: Detecti

mick.yi 107 Jan 9, 2023
Detecting Text in Natural Image with Connectionist Text Proposal Network (ECCV'16)

Detecting Text in Natural Image with Connectionist Text Proposal Network The codes are used for implementing CTPN for scene text detection, described

Tian Zhi 1.3k Dec 22, 2022
huoyijie 1.2k Dec 29, 2022
WACV 2022 Paper - Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching

Is An Image Worth Five Sentences? A New Look into Semantics for Image-Text Matching Code based on our WACV 2022 Accepted Paper: https://arxiv.org/pdf/

Andres 13 Dec 17, 2022
This pyhton script converts a pdf to Image then using tesseract as OCR engine converts Image to Text

Script_Convertir_PDF_IMG_TXT Este script de pyhton convierte un pdf en Imagen luego utilizando tesseract como motor OCR convierte la Imagen a Texto. p

alebogado 1 Jan 27, 2022
Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Genalog is an open source, cross-platform python package allowing generation of synthetic document images with custom degradations and text alignment capabilities.

Microsoft 235 Dec 22, 2022
An Implementation of the alogrithm in paper IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Oriented Scene Text Detection

InceptText-Tensorflow An Implementation of the alogrithm in paper IncepText: A New Inception-Text Module with Deformable PSROI Pooling for Multi-Orien

GeorgeJoe 115 Dec 12, 2022
Code for the paper STN-OCR: A single Neural Network for Text Detection and Text Recognition

STN-OCR: A single Neural Network for Text Detection and Text Recognition This repository contains the code for the paper: STN-OCR: A single Neural Net

Christian Bartz 496 Jan 5, 2023
text detection mainly based on ctpn model in tensorflow, id card detect, connectionist text proposal network

text-detection-ctpn Scene text detection based on ctpn (connectionist text proposal network). It is implemented in tensorflow. The origin paper can be

Shaohui Ruan 3.3k Dec 30, 2022
OCR system for Arabic language that converts images of typed text to machine-encoded text.

Arabic OCR OCR system for Arabic language that converts images of typed text to machine-encoded text. The system currently supports only letters (29 l

Hussein Youssef 144 Jan 5, 2023
OCR, Scene-Text-Understanding, Text Recognition

Scene-Text-Understanding Survey [2015-PAMI] Text Detection and Recognition in Imagery: A Survey paper [2014-Front.Comput.Sci] Scene Text Detection and

Alan Tang 354 Dec 12, 2022
Total Text Dataset. It consists of 1555 images with more than 3 different text orientations: Horizontal, Multi-Oriented, and Curved, one of a kind.

Total-Text-Dataset (Official site) Updated on April 29, 2020 (Detection leaderboard is updated - highlighted E2E methods. Thank you shine-lcy.) Update

Chee Seng Chan 671 Dec 27, 2022
Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. This Neural Network (NN) model recognizes the text contained in the images of segmented words.

Handwritten-Text-Recognition Handwritten Text Recognition (HTR) system implemented with TensorFlow (TF) and trained on the IAM off-line HTR dataset. T

null 27 Jan 8, 2023
Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Vedaldi, Andrew Zisserman, CVPR 2016.

SynthText Code for generating synthetic text images as described in "Synthetic Data for Text Localisation in Natural Images", Ankush Gupta, Andrea Ved

Ankush Gupta 1.8k Dec 28, 2022
This can be use to convert text in a file to handwritten text.

TextToHandwriting This can be used to convert text to handwriting. Clone this project or download the code. Run TextToImage.py give the filename of th

Ashutosh Mahapatra 2 Feb 6, 2022
Detect handwritten words in a text-line (classic image processing method).

Word segmentation Implementation of scale space technique for word segmentation as proposed by R. Manmatha and N. Srimal. Even though the paper is fro

Harald Scheidl 190 Jan 3, 2023