Text-to-Image generation

Overview

Generate vivid Images for Any (Chinese) text

teaser

CogView is a pretrained (4B-param) transformer for text-to-image generation in general domain.

  • Read our paper CogView: Mastering Text-to-Image Generation via Transformers on ArXiv for a formal introduction. The PB-relax and Sandwich-LN can also help you train large and deep transformers stably (e.g. eliminating NaN losses).
  • Visit our demo at Github Page or Wudao! (Without post-selection or super-resolution, currently only supports simplified Chinese input, but one can translate text from other languages into Chinese for input. Note: Wudao provides faster access for users from China mainland.)
  • Download our pretrained models from Project Wudao-Wenhui(悟道-文汇).
  • Cite our paper if you find our work is helpful~
@article{ding2021cogview,
  title={CogView: Mastering Text-to-Image Generation via Transformers},
  author={Ding, Ming and Yang, Zhuoyi and Hong, Wenyi and Zheng, Wendi and Zhou, Chang and Yin, Da and Lin, Junyang and Zou, Xu and Shao, Zhou and Yang, Hongxia and Tang, Jie},
  journal={arXiv preprint arXiv:2105.13290},
  year={2021}
  • Google Colab Two contributors successfully setup up CogView on Colab Links to Colab!

Getting Started

Setup

  • Hardware: Linux servers with Nvidia V100s or A100s are recommended, but it is also okay to run the pretrained models with smaller --max-inference-batch-size or training smaller models on less powerful GPUs.

  • Environment (Option 1): Please first install PyTorch (>=1.7.0) and apex, and then install other dependencies via pip install -r requirements.txt.

  • Environment (Option 2): We prepare a docker image in case that you fail to handle the environments. Pull the image, create a (background) container and get into it via:

    docker pull cogview/cuda111_torch181_deepspeed040
    ./env/start_docker.sh && docker exec -it bg-cogview bash
    
    cd /root/cogview # in the container
    

Download

  1. Download the image tokenizer vqvae_hard_biggerset_011.pt from BAAI website or Tsinghua Cloud. Place the file under pretrained/vqvae.
wget https://cloud.tsinghua.edu.cn/f/71607a5dca69417baa8c/?dl=1 -O pretrained/vqvae/vqvae_hard_biggerset_011.pt
  1. Download models from Project Wudao-Wenhui.

    FileName Discription
    cogview-base.tar The pretrained text-to-image model.
    cogview-caption.tar Finetuned image-to-text model, also used for reranking.
    cogview-sr.tar Finetuned super-resolution model. (warning: it runs slow.)

    Uncompress them into pretrained/cogview/. The following command should be modified based on the model name.

    tar -xvf cogview-{base, sr, caption}.tar -C pretrained/cogview/
    
  2. (Only for training tutorial, skip it for inference.) Download a small "bird-and-animal" example dataset from our link at Tsinghua Cloud.

wget https://cloud.tsinghua.edu.cn/f/1e4963ec8ac84941ba68/?dl=1 -O data/bird_animal.bin

Run CogView! (Model Inference)

We encapsulate the generation functions into scripts. See generate_samples.py and arguments.py for details.

Text-to-Image Generation

Write text queries (one per line) into input.txt and run:

./scripts/text2image.sh --debug

The results will in a new folder samples_text2image/.

Arguments useful in inference are mainly:

  • --input-source [path or "interactive"]. The path of the input file, can also be "interactive", which will launch a CLI.
  • --output-path [path]. The folder containing the results.
  • --batch-size [int]. The number of samples will be generated per query.
  • --max-inference-batch-size [int]. Maximum batch size per forward. Reduce it if OOM.
  • --debug. Only save concatenated images for all generated samples, and name them by input text and date.
  • --with-id. When it toggled, you must specify an "id" before each input, e.g. 001\t一个漂亮的女孩, \t denoting TAB (NOT space). It will generate batch-size split images in a folder named "id" for each input. Confict with --debug.
  • --device [int]. Running on which GPU.

Super-resolution

Run the following script and input text\t{image_path}, where {image_path} means the path of a previously generated image.

./scripts/super_resolution.sh

Note: It is only effective for generated images from our Image Tokenizer (due to the token distribution).

Image-to-Text

The input is "one image path per line", and will print the results to stdout.

./scripts/image2text.sh

Note: Not optimized for this task, so it might not very competitive (but okay). We will consider to release a version funetuning for a longer period on this task in the future. (TODO)

Post-selection

This application only takes file inputs, where each line is {text}\t{image_path1}\t{image_path2}\t{image_path3}.... The output is {output_path}/scores.txt, a line of a list of scores, following a line from inputs.

./scripts/post_selection.sh

Note: In the released codes, for simplicity, we did not expose the raw API , which supports some advanced generation modes, e.g. text and part of image.

Training

Here we use a subset of our dataset from bird-and-animal for tutorial. The binary dataset is generated by our cogdata toolkit. Please wait for a formal release with tutorials of cogdata (although it is available now).

Single Node

After downloading the dataset, directly run

./scripts/pretrain_single_node.sh

Multiple Nodes

If you want to train the models on multiple servers inter-connected by infiniband without a shared file system (you may need pdsh to accelerate this process):

  1. On each server, use git clone to download this repo, and make sure the data (LMDB format) are moved into the data subfolder.
  2. On each server, echo "ip1 ip2 <other IPs>" > ./docker/ip_list.txt, and then start the docker by ./env/start_docker.sh.
  3. Get into the docker on the first node container via docker exec -it bg-cogview bash.
  4. Get into /root/cogview and run ./scripts/pretrain_multiple_nodes.sh. You may need to change the config (especially OPTIONS_NCCL) in the shell script.

See the arguments.py for advanced functions for training. TODO

Gallery

more_samples

Comments
  • Got error ''IndexError: tuple index out of range'' running super-res on colab with a tesla v100

    Got error ''IndexError: tuple index out of range'' running super-res on colab with a tesla v100

    /content/CogView Generate Samples WARNING: No training data specified using world size: 1 and model-parallel size: 1

    using dynamic loss scaling initializing model parallel with size 1 initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 padded vocab (size: 58219) with 21 dummy tokens (new size: 58240) prepare tokenizer done building CogView2 model ... number of parameters on model parallel rank 0: 3928849920 current device: 0 tcmalloc: large alloc 7881007104 bytes == 0x5637e3fb2000 @ 0x7f61e428db6b 0x7f61e42ad379 0x7f6171f1e25e 0x7f6171f1f9d2 0x7f61aff48e7d 0x7f61c0b43120 0x7f61c0781bd9 0x5637152088a8 0x56371527bfd5 0x5637152767ad 0x5637152093ea 0x5637152773b5 0x5637152767ad 0x563715209003 0x563715208b09 0x56371535028d 0x5637152bf1db 0x563715207bb1 0x5637152f8fed 0x56371527b988 0x5637152767ad 0x563715148e2c 0x563715278bb5 0x5637152764ae 0x5637152093ea 0x56371527832a 0x56371520930a 0x5637152773b5 0x56371520930a 0x5637152773b5 0x5637152764ae Load model file pretrained/cogview/cogview-sr/20000/mp_rank_00_model_states.pt Working on No. 0 on 0... Traceback (most recent call last): File "generate_samples.py", line 326, in main() File "generate_samples.py", line 323, in main generate_images_continually(model, args) File "generate_samples.py", line 215, in generate_images_continually for raw_text, seq, output_path in get_context(args, query_template): File "generate_samples.py", line 132, in get_context seq = _parse_and_to_tensor(raw_text, img_size=img_size, query_template=query_template) File "generate_samples.py", line 70, in _parse_and_to_tensor text = query_template.format(*text.split('\t')) IndexError: tuple index out of range /content

    opened by johngore123 17
  • Can't apply to download model.

    Can't apply to download model.

    Good afternoon,

    I tried to register on wudaoai and put in my information so I could download the model, but Wudaoai did not accept my phone number, probably because I am from Brazil. I had to look for a Chinese phone number online to apply, but I doubt they will allow me to download the model because of this. What can I do?

    good first issue 
    opened by PapayasTehSkeletor 15
  • about evaluation

    about evaluation

    Hi, How do you get "26.0" FID on mscoco using DM-GAN? Because the official result reported in https://github.com/MinfengZhu/DM-GAN is 26.55. I ran DM-GAN myself and managed to get a similar result(26.54), instead of "26.0".

    opened by FrankCast1e 11
  • Colab error

    Colab error

    I found a colab file reference in a closed issue. The last cell (inference) shows this error. Does anyone know of a solution or a more recent Colab noteboook? The web-based version of Cogview works but takes a while to process queued requests. I used the "insert code" icon when editing this and for some reason, the output is all connected together making it unreadable.

    /content/CogView Traceback (most recent call last): File "generate_samples.py", line 28, in <module> from utils import Timers File "/content/CogView/utils.py", line 25, in <module> from fp16 import FP16_Optimizer File "/content/CogView/fp16/__init__.py", line 15, in <module> from .fp16util import ( File "/content/CogView/fp16/fp16util.py", line 21, in <module> import mpu File "/content/CogView/mpu/__init__.py", line 35, in <module> from .layers import ColumnParallelLinear File "/content/CogView/mpu/layers.py", line 28, in <module> from apex.normalization.fused_layer_norm import FusedLayerNorm as LayerNorm ModuleNotFoundError: No module named 'apex'

    opened by metaphorz 8
  • Out of memory when using Text 2 Image

    Out of memory when using Text 2 Image

    [Wed Jun 16 19:21:01 2021] Memory cgroup out of memory: Killed process 15052 (python3) total-vm:19700460kB, anon-rss:11917232kB, file-rss:89696kB, shmem-rss:12288kB, UID:0 pgtables:25896kB oom_score_adj:0

    Is what I can get after the process has been killed. Is there a way to optimize this to run on a GPU with lower ram? I'm using a Tesla T4 on Google Colab.

    Thanks.

    opened by johnpaulbin 8
  • Filter unsafe inputs (better)

    Filter unsafe inputs (better)

    Guys, can you do something to prohibit inputs like this one image

    Even worse, someone’s already trying to generate underage porn: https://user-images.githubusercontent.com/188197/122180141-d4754600-ce90-11eb-96f6-f8cbacee8da1.mp4

    Maybe add some Google-authed signup with the possibility of banning to at least make it harder for creeps to use the model for such things?

    enhancement 
    opened by vzakharov 6
  • The math in this paper

    The math in this paper

    想請問一下關於此篇論文數學的部分:

    1. 加入文字之後的 ELBO 有更詳細的推導過程嗎? 是不是單純把只有 image 的 ELBO 不等式兩邊各加上一個 NLL loss for text 而已? 因為我看起來 text 並沒有在 VQVAE 訓練過程中起到作用,而這個 ELBO 是給 VQVAE 的,不太懂為什麼會有 text 那項 loss
    2. 不太了解 式(2) 如何變成 式(3) 的

    謝謝

    opened by brianw0924 5
  • Hello! CUDA out of memory when load the pretrained model of cogview-caption

    Hello! CUDA out of memory when load the pretrained model of cogview-caption

    Hello!Our team plans to load the pre-trained model of cogview-caption to finetune with v100, which is consistent with what you said in the paper about pre-training on the V100. But it turns out that "CUDA out of memory", and the training can't be launched until the model-parallel-size is set to be 4. So, how can we load the pre-trained model and finetune on V100? @neozhangthe1 @Sleepychord @lykeven @cenyk1230 @Somefive

    opened by starmemda 5
  • How to finetune the CogView to perform image captioning?

    How to finetune the CogView to perform image captioning?

    Hello, I wonder how to finetune the CogView model to perform image captioning? Here is my question: what is the format of the input text? I notice that the format of input text in your code is [ROI1], text, [BASE], [BOI1], image, [EOI1]. Therefore, what should I change for finetuning to image captioning? Just change the format into [BASE], [BOI1], image, [EOI1], [ROI1], text, or how?

    Looking forward to your reply, thanks!

    opened by ReneTCAd 3
  • docker pull error,

    docker pull error, "You have reached your pull rate limit"

    Is there any possibility to share the Dockerfile?

    I was trying to use the docker since it's difficult to build the apex library. Whereas when I used docker pull like the below, "docker pull cogview/cuda111_torch181_deepspeed040"

    I got the following error message: "Using default tag: latest Error response from daemon: toomanyrequests: You have reached your pull rate limit. You may increase the limit by authenticating and upgrading: https://www.docker.com/increase-rate-limit"

    Therefore, an original Dockerfile may be better.

    opened by XuanxuanGao 3
  • script to finetune Cogview-base

    script to finetune Cogview-base

    Hi, I'm trying to finetune Cogview pretrained model. However, when I try to load model weights, I get following error: RuntimeError: Error(s) in loading state_dict for GPT2Model:
    size mismatch for word_embeddings.weight: copying a param with shape torch.Size([14560, 2560]) from checkpoint, the shape in current model is torch .Size([14592, 2560]).

    Here is my script:

    `NUM_WORKERS=1 NUM_GPUS_PER_WORKER=4 MP_SIZE=1

    script_path=$(realpath $0) echo $script_path script_dir=$(dirname $script_path) main_dir=$(dirname $script_dir)

    OPTIONS_NCCL="NCCL_DEBUG=info" HOST_FILE_PATH="hostfile_single"

    config_json="$script_dir/ds_config_zero.json" gpt_options="
    --experiment-name cogview-test_finetune
    --img-tokenizer-num-tokens 8192
    --dataset-type TokenizedDataset
    --model-parallel-size ${MP_SIZE}
    --batch-size 4
    --num-layers 48
    --hidden-size 2560
    --num-attention-heads 40
    --save ./
    --train-iters 2000
    --save-interval 800
    --resume-dataloader
    --train-data /path/to/my/data
    --split 90,5,5
    --distributed-backend nccl
    --lr-decay-style cosine
    --warmup .1
    --checkpoint-activations
    --deepspeed-activation-checkpointing
    --max-position-embeddings 1089
    --max-memory-length 0
    --fp16
    --txt-loss-scale 5
    --load /path/to/cogview
    --no-load-rng
    --model-parallel-size 2
    --num-workers 16
    --is-sparse 0
    --finetune
    --shuffle "

    gpt_options="${gpt_options} --deepspeed
    --deepspeed_config ${config_json}
    "

    run_cmd="${OPTIONS_NCCL} deepspeed --num_nodes ${NUM_WORKERS} --num_gpus ${NUM_GPUS_PER_WORKER} --hostfile ${HOST_FILE_PATH} pretrain_gpt2.py $@ ${gpt_options}" `

    It will be great if you can provide some details for finetuning. Thanks!

    opened by luyang-huang96 3
  • Copyright problems

    Copyright problems

    Who owns the copyright of the images generated by the engine? Or is the engine generated image assigned to be published under a certain protocol (like CC0)?

    opened by Saintafox 1
  • After training, there are only 2G Pt files left. What's wrong with missing files?

    After training, there are only 2G Pt files left. What's wrong with missing files?

    After I tried to train with my own data set, the PT file I got was not 7g in size. Use it to execute. / scripts / text2image.sh -- debug error "runtimeerror: error (s) in loading state_dict for gpt2model:". Who can help me?Thank you.

    opened by tt-s-t 1
  • Do you have a model checkpoint smaller than 7GB?

    Do you have a model checkpoint smaller than 7GB?

    Hi, I found your model checkpoint at https://resource.wudaoai.cn/home?ind=2&name=WuDao%20WenHui&id=1399364355975327744 too large for me to run on my PC, as the hidden size of the checkpoint is 2560. Do you have a checkpoint with a smaller hidden size, such as 1024? or can I easily shift the size of the checkpoint you released smaller? Thank you!

    opened by ZihanWangRuc 0
  • Is the CogView2 demo avaliable here the same as the CogView2 paper released?

    Is the CogView2 demo avaliable here the same as the CogView2 paper released?

    opened by apolinario 1
  • Wudao down? Or the host was changed?

    Wudao down? Or the host was changed?

    Not working by any of these links:

    https://agc.platform.baai.ac.cn/CogView/index.html https://thudm.github.io/CogView/index.html https://wudao.aminer.cn/CogView/index.html https://lab.aminer.cn/cogview/index.html

    opened by aleksusklim 3
Owner
THUDM
Data Mining Research Group at Tsinghua University
THUDM
This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models are Pix2Pix, Pix2PixHD, CycleGAN and PointWise.

RGB2NIR_Experimental This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models

null 5 Jan 4, 2023
FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

FuseDream This repo contains code for our paper (paper link): FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimizat

XCL 191 Dec 31, 2022
BARTScore: Evaluating Generated Text as Text Generation

This is the Repo for the paper: BARTScore: Evaluating Generated Text as Text Generation Updates 2021.06.28 Release online evaluation Demo 2021.06.25 R

NeuLab 196 Dec 17, 2022
Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

Deep Daze mist over green hills shattered plates on the grass cosmic love and attention a time traveler in the crowd life during the plague meditative

Phil Wang 4.4k Jan 3, 2023
A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.

Ryan Murdock has done it again, combining OpenAI's CLIP and the generator from a BigGAN! This repository wraps up his work so it is easily accessible to anyone who owns a GPU.

Phil Wang 2.3k Jan 9, 2023
Text to Image Generation with Semantic-Spatial Aware GAN

text2image This repository includes the implementation for Text to Image Generation with Semantic-Spatial Aware GAN This repo is not completely. Netwo

CVDDL 124 Dec 30, 2022
Cross-Modal Contrastive Learning for Text-to-Image Generation

Cross-Modal Contrastive Learning for Text-to-Image Generation This repository hosts the open source JAX implementation of XMC-GAN. Setup instructions

Google Research 94 Nov 12, 2022
Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized

VQGAN-CLIP-Docker About Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized This is a stripped and minimal dependency repository for running loca

Kevin Costa 73 Sep 11, 2022
Text-to-Image generation

Generate vivid Images for Any (Chinese) text CogView is a pretrained (4B-param) transformer for text-to-image generation in general domain. Read our p

THUDM 1.3k Dec 29, 2022
A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

Eugenio Herrera 175 Dec 29, 2022
L-Verse: Bidirectional Generation Between Image and Text

Far beyond learning long-range interactions of natural language, transformers are becoming the de-facto standard for many vision tasks with their power and scalabilty

Kim, Taehoon 102 Dec 21, 2022
Pytorch Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Pytorch Implementation of Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [Paper] [Colab is coming soon] Approach Example Usage To r

null 6 Dec 1, 2021
Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

GLIDE This is the official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing w

OpenAI 2.9k Jan 4, 2023
Pytorch implementation of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

Make-A-Scene - PyTorch Pytorch implementation (inofficial) of Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors (https://arxiv.org/

Casual GAN Papers 259 Dec 28, 2022
Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Storium GPT-2 Models This is the official repository for the GPT-2 models described in the EMNLP 2020 paper [STORIUM: A Dataset and Evaluation Platfor

Nader Akoury 27 Dec 20, 2022
A weakly-supervised scene graph generation codebase. The implementation of our CVPR2021 paper ``Linguistic Structures as Weak Supervision for Visual Scene Graph Generation''

README.md shall be finished soon. WSSGG 0 Overview 1 Installation 1.1 Faster-RCNN 1.2 Language Parser 1.3 GloVe Embeddings 2 Settings 2.1 VG-GT-Graph

Keren Ye 35 Nov 20, 2022
Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Deep Text Search - AI Based Text Search & Recommendation System Deep Text Search is an AI-powered multilingual text search and recommendation engine w

null 19 Sep 29, 2022
TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

TAP: Text-Aware Pre-training TAP: Text-Aware Pre-training for Text-VQA and Text-Caption by Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Flo

Microsoft 61 Nov 14, 2022
Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

SwinTextSpotter This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text R

mxin262 183 Jan 3, 2023