Aphantasia
This is a text-to-image tool, part of the artwork of the same name.
Based on CLIP model, with FFT parameterizer from Lucent library as a generator.
Tested on Python 3.7 with PyTorch 1.7.1.
Aphantasia is the inability to visualize mental images, the deprivation of visual dreams.
The image in the header is generated by the tool from this word.
Features
- generating massive detailed textures, a la deepdream
- fast convergence!
- fullHD/4K resolutions and above
- complex queries:
- text and/or image as main prompts
- additional text prompts for fine details and to subtract (avoid) topics
- criteria inversion (show "the opposite")
- continuous mode to process phrase lists (e.g. illustrating lyrics)
- saving/loading parameters to resume processing
- selectable CLIP model
Setup CLIP et cetera:
pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git
pip install git+https://github.com/Po-Hsun-Su/pytorch-ssim
Operations
- Generate an image from the text prompt (set the size as you wish):
python clip_fft.py -t "the text" --size 1280-720
- Reproduce an image:
python clip_fft.py -i theimage.jpg --sync 0.01
--sync X
argument (X = from 0 to 1) enables SSIM loss to keep the composition and details of the original image.
You can combine both text and image prompts.
Use --translate
option to process non-English languages.
- Set more specific query like this:
python clip_fft.py -t "macro figures" -t2 "micro details" -t0 "avoid this" --size 1280-720
- Other options:
--model M
selects one of the released CLIP models:ViT-B/32
(default),RN50
,RN50x4
,RN101
.
--overscan
mode processes double-padded image to produce more uniform (and probably seamlessly tileable) textures. Omit it, if you need more centered composition.
--steps N
sets iterations count. 50-100 is enough for a starter; 500-1000 would elaborate it more thoroughly.
--samples N
sets amount of the image cuts (samples), processed at one step. With more samples you can set fewer iterations for similar result (and vice versa). 200/200 is a good guess. NB: GPU memory is mostly eaten by this count (not resolution)!
--fstep N
tells to save every Nth frame (useful with high iterations, default is 1).
--contrast X
may be needed for new ResNet models (they tend to burn the colors).
--noise X
adds some noise to the parameters, possibly making composition less clogged (in a degree).
--lrate
controls learning rate. The range is quite wide (tested from 0.01 to 10, try less/more).
--invert
negates the whole criteria, if you fancy checking "totally opposite".
--save_pt myfile.pt
will save FFT parameters, to resume for next query with--resume myfile.pt
.
--verbose
('on' by default) enables some printouts and realtime image preview.
Continuous mode
- Make video from a text file, processing it line by line in one shot:
python illustra.py -i mysong.txt --size 1280-720 --length 155
This will first generate and save images for every text line (with sequences and training videos, as in single-image mode above), then render final video from those (mixing images in FFT space) of the length
duration in seconds.
By default, every frame is produced independently (randomly initiated). Instead, --keep all
starts each generation from the average of previous runs; on practice that means similar compositions and smoother transitions. --keep last
amplifies that smoothness by starting generation close to the last run, but that can make imagery getting stuck. This behaviour heavily depends on the input, so test with your prompts and see what's better in your case.
- Make video from a directory with saved *.pt snapshots (just interpolate them):
python interpol.py -i mydir --length 155
Credits
CLIP, the paper
Copyright (c) 2021 OpenAI
Thanks to Ryan Murdock, Jonathan Fly and Hannu Toyryla for ideas.