CLIP+FFT text-to-image

vadim epstein

Last update: Jan 2, 2023

Related tags

Overview

Aphantasia

This is a text-to-image tool, part of the artwork of the same name.
Based on CLIP model, with FFT parameterizer from Lucent library as a generator.
Tested on Python 3.7 with PyTorch 1.7.1.

Aphantasia is the inability to visualize mental images, the deprivation of visual dreams.
The image in the header is generated by the tool from this word.

Features

generating massive detailed textures, a la deepdream
fast convergence!
fullHD/4K resolutions and above
complex queries:
- text and/or image as main prompts
- additional text prompts for fine details and to subtract (avoid) topics
- criteria inversion (show "the opposite")
continuous mode to process phrase lists (e.g. illustrating lyrics)
saving/loading parameters to resume processing
selectable CLIP model

Setup CLIP et cetera:

pip install -r requirements.txt
pip install git+https://github.com/openai/CLIP.git
pip install git+https://github.com/Po-Hsun-Su/pytorch-ssim

Operations

Generate an image from the text prompt (set the size as you wish):

python clip_fft.py -t "the text" --size 1280-720

Reproduce an image:

python clip_fft.py -i theimage.jpg --sync 0.01

--sync X argument (X = from 0 to 1) enables SSIM loss to keep the composition and details of the original image.

You can combine both text and image prompts.
Use --translate option to process non-English languages.

Set more specific query like this:

python clip_fft.py -t "macro figures" -t2 "micro details" -t0 "avoid this" --size 1280-720

Other options:
--model M selects one of the released CLIP models: ViT-B/32 (default), RN50, RN50x4, RN101.
--overscan mode processes double-padded image to produce more uniform (and probably seamlessly tileable) textures. Omit it, if you need more centered composition.
--steps N sets iterations count. 50-100 is enough for a starter; 500-1000 would elaborate it more thoroughly.
--samples N sets amount of the image cuts (samples), processed at one step. With more samples you can set fewer iterations for similar result (and vice versa). 200/200 is a good guess. NB: GPU memory is mostly eaten by this count (not resolution)!
--fstep N tells to save every Nth frame (useful with high iterations, default is 1).
--contrast X may be needed for new ResNet models (they tend to burn the colors).
--noise X adds some noise to the parameters, possibly making composition less clogged (in a degree).
--lrate controls learning rate. The range is quite wide (tested from 0.01 to 10, try less/more).
--invert negates the whole criteria, if you fancy checking "totally opposite".
--save_pt myfile.pt will save FFT parameters, to resume for next query with --resume myfile.pt.
--verbose ('on' by default) enables some printouts and realtime image preview.

Continuous mode

Make video from a text file, processing it line by line in one shot:

python illustra.py -i mysong.txt --size 1280-720 --length 155

This will first generate and save images for every text line (with sequences and training videos, as in single-image mode above), then render final video from those (mixing images in FFT space) of the length duration in seconds.

By default, every frame is produced independently (randomly initiated). Instead, --keep all starts each generation from the average of previous runs; on practice that means similar compositions and smoother transitions. --keep last amplifies that smoothness by starting generation close to the last run, but that can make imagery getting stuck. This behaviour heavily depends on the input, so test with your prompts and see what's better in your case.

Make video from a directory with saved *.pt snapshots (just interpolate them):

python interpol.py -i mydir --length 155

Credits

CLIP, the paper
Copyright (c) 2021 OpenAI

Thanks to Ryan Murdock, Jonathan Fly and Hannu Toyryla for ideas.

Comments

Invalid Syntax when trying to run the first time

I followed all the instructions and still I'm getting this error -

aphantasia git:(master) python clip_fft.py -t "the text" --size 1280-720 
  File "clip_fft.py", line 112
    Ys = [torch.randn(*Y.shape).cuda() for Y in [Yl_in, *Yh_in]]

opened by GonrasK 10

SSIM Alternative: DISTS

I'm trying DISTS as an alternative to SSIM and so far it's... It works to make the supplied image show up in the results. I don't know a whole lot about it though, to be frank it's just something different that works. I haven't decided if it's of any worth yet.

To get it running in Aphantasia I just needed to !pip install dists-pytorch and in clip_fft.py add from DISTS_pytorch import DISTS and replace ssim_loss = ssim.SSIM(window_size = 11) with ssim_loss = DISTS(require_grad=True, batch_average=True)

The two images to compare might need to be blurred before comparison to focus more on shapes.

opened by torridgristle 7
allow aphantasia to be pip installed

Enjoying these interesting drawing modules which are runnable from my pixray library. I've made some changes to the file locations and added a setup.py which allows aphantasia to be pip installed - which makes using the code as a library easier.

I'm posting these changes as a pull request for review, but note I haven't had much experience myself packaging python libraries and so would welcome any feedback if there are better ways to achieve this. Also haven't tested the notebooks against this change and the likely would need their imports adjusted as well. But running the notebooks in google colab could potentially be a bit easier as well as you can also use pip now from the colab to install the library.

opened by dribnet 5

clip_fft.py won't start

I installed requirements.txt and git+https://github.com/openai/CLIP.git and after that I ran

python clip_fft.py -t "city" -t2 "gradient" --size 1280-720

And after that I got the error

c:\etc\aphantasia-master>python clip_fft.py -t "city" -t2 "gradient" --size 1280-720
Traceback (most recent call last):
  File "clip_fft.py", line 23, in <module>
    from utils import slice_imgs, derivat, sim_func, basename, img_list, img_read, plot_text, txt_clean, checkout, old_torch
  File "c:\etc\aphantasia-master\utils.py", line 13, in <module>
    from kornia.filters.sobel import spatial_gradient
  File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\kornia\__init__.py", line 19, in <module>    from kornia import jit
  File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\kornia\jit\__init__.py", line 9, in <module>
    spatial_soft_argmax2d = torch.jit.script(K.geometry.spatial_soft_argmax2d)
  File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\jit\__init__.py", line 1290, in script
    fn = torch._C._jit_script_compile(qualified_name, ast, _rcb, get_default_args(obj))
  File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\jit\_recursive.py", line 568, in try_compile_fn
    return torch.jit.script(fn, _rcb=rcb)
  File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\jit\__init__.py", line 1290, in script
    fn = torch._C._jit_script_compile(qualified_name, ast, _rcb, get_default_args(obj))
  File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\jit\_recursive.py", line 568, in try_compile_fn
    return torch.jit.script(fn, _rcb=rcb)
  File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\torch\jit\__init__.py", line 1290, in script
    fn = torch._C._jit_script_compile(qualified_name, ast, _rcb, get_default_args(obj))
RuntimeError:
Unknown type name 'torch.dtype':
  File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\kornia\utils\grid.py", line 12
        normalized_coordinates: bool = True,
        device: Optional[torch.device] = torch.device('cpu'),
        dtype: torch.dtype = torch.float32) -> torch.Tensor:
               ~~~~~~~~~~~ <--- HERE
    """Generates a coordinate grid for an image.
'create_meshgrid' is being compiled since it was called from 'spatial_expectation2d'
  File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\kornia\geometry\subpix\dsnt.py", line 100
    # Create coordinates grid.
    grid: torch.Tensor = create_meshgrid(height, width, normalized_coordinates, input.device)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    grid = grid.to(input.dtype)
'spatial_expectation2d' is being compiled since it was called from 'spatial_soft_argmax2d'
  File "C:\Users\user\AppData\Local\Programs\Python\Python38\lib\site-packages\kornia\geometry\subpix\spatial_soft_argmax.py", line 516
    """
    input_soft: torch.Tensor = dsnt.spatial_softmax2d(input, temperature)
    output: torch.Tensor = dsnt.spatial_expectation2d(input_soft, normalized_coordinates)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    return output

How can I make the program work?

opened by Niotferdi 4

init_image support

Do any of the drawing modules have support for initialising from an image? I looked but didn't see any inverse FFT code currently in the codebase. If not this might be an interesting feature to consider adding for any models that would easily support such an operation.

opened by dribnet 4
How do you upload a photo to work with it?

NameError Traceback (most recent call last) in () 12 text = translator.translate(text, dest='en').text 13 if upload_image: ---> 14 uploaded = files.upload()

NameError: name 'files' is not defined

opened by raul2297 4
ReadTimeOutError when installing OpenAi

In the instructions: pip install git+https://github.com/openai/CLIP.git results in the following error:

Collecting git+https://github.com/openai/CLIP.git Cloning https://github.com/openai/CLIP.git to /tmp/pip-req-build-ham_skxz Running command git clone -q https://github.com/openai/CLIP.git /tmp/pip-req-build-ham_skxz Requirement already satisfied: ftfy in /home/steven/anaconda3/lib/python3.8/site-packages (from clip==1.0) (6.0.1) Requirement already satisfied: regex in /home/steven/anaconda3/lib/python3.8/site-packages (from clip==1.0) (2020.6.8) Requirement already satisfied: tqdm in /home/steven/anaconda3/lib/python3.8/site-packages (from clip==1.0) (4.47.0) Collecting torch~=1.7.1 WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)")': /packages/1d/a9/f349273a0327fdf20a73188c9c3aa7dbce68f86fad422eadd366fd2ed7a0/torch-1.7.1-cp38-cp38-manylinux1_x86_64.whl WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)")': /packages/1d/a9/f349273a0327fdf20a73188c9c3aa7dbce68f86fad422eadd366fd2ed7a0/torch-1.7.1-cp38-cp38-manylinux1_x86_64.whl WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)")': /packages/1d/a9/f349273a0327fdf20a73188c9c3aa7dbce68f86fad422eadd366fd2ed7a0/torch-1.7.1-cp38-cp38-manylinux1_x86_64.whl WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)")': /packages/1d/a9/f349273a0327fdf20a73188c9c3aa7dbce68f86fad422eadd366fd2ed7a0/torch-1.7.1-cp38-cp38-manylinux1_x86_64.whl WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)")': /packages/1d/a9/f349273a0327fdf20a73188c9c3aa7dbce68f86fad422eadd366fd2ed7a0/torch-1.7.1-cp38-cp38-manylinux1_x86_64.whl ERROR: Could not install packages due to an EnvironmentError: HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Max retries exceeded with url: /packages/1d/a9/f349273a0327fdf20a73188c9c3aa7dbce68f86fad422eadd366fd2ed7a0/torch-1.7.1-cp38-cp38-manylinux1_x86_64.whl (Caused by ReadTimeoutError("HTTPSConnectionPool(host='files.pythonhosted.org', port=443): Read timed out. (read timeout=15)"))

opened by squinonescolon 4
Alternate Subtraction Method, Faster

I was trying out ways of manipulating the encoded text and one that I tried was subtracting encoded text from the encoded text prompt. I tried four renders for each and they look about the same, except the one that changes the encoded text had less of the subtract prompt which suggests to me that it's more effective at subtracting a prompt. Also it ends up using just the one txt_enc rather than 2, and just the one cosine similarity.

Prompt: "a photo of a human face" and Negative: "a photo of a face"

Subtracting Subtract's txt_enc0 from text_enc resulted in these

Existing negative method what uses cosine similarity with the image and negative prompt for loss resulted in these

And for fun, using subtract to increase the difference between the two by txt_enc + (txt_enc - text_enc0) resulted in these

The encoded text and images seem to be explorable like latent space.

opened by torridgristle 4
Doesn't work with PyTorch 1.8

PyTorch recently had a 1.8 release, bringing much better support for backing torch.cuda tensors with AMD GPUs.

However, clip_fft.py at least hasn't been ported to PyTorch 1.8 yet.

In particular, it still uses the deprecated and now removed pytorch.irfft, which needs to be replaced with calls to methods in the torch.fft namespace to work on PyTorch 1.8.

Unfortunately, the PR that removed support for the old methods doesn't provide a recipe for translating calls that can be executed by someone who doesn't understand the finer points of FFTs. It seems to me that the square-root-of-a-bunch-of-stuff normalization method of the old function isn't available as any of the normalization modes of torch.fft.irfft, and I'm not sure of the number of dimensions involved here, or whether we have the input versus the output sizes handy.

opened by interfect 4
[Feature] Learning Rate Modified by Steps
I've experimented with a learning rate that changes as the steps increase due to seeing Aphantasia develop a general image very quickly, but then slowing down to make small details. I believe that my proposed alternative puts more focus on larger shapes, and less on details.

I expose the learning_rate variable and add a learning_rate_max variable in the Generate cell, remove the optimizer = torch.optim.Adam(params, learning_rate) line and instead add this to def train(i):

learning_rate_new = learning_rate + (i / steps) * (learning_rate_max - learning_rate) optimizer_new = torch.optim.Adam(params, learning_rate_new)

With this, I find that a learning_rate of 0.0001 and a learning_rate_max of 0.008 at the highest value works well, for 300-400 steps and about 50 samples at least.
opened by torridgristle 4

TypeError: 'float' object is not subscriptable

Seems like something broke the IllusTrip3D.ipynb in the new update. Settings of possibly relevant parameters: zoom = 0.0005, shift = 0, animate_them = False

Here is the stack trace:

using fast aug transforms
 using RGB method, 95 samples
 ref text:  ethereal cosmology
 ref style:  
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-09ff3d020318> in <module>()
    229 pbar = ProgressBar(glob_steps)
    230 for i in range(count):
--> 231   process(i)
    232 
    233 HTML(makevid(tempdir))

1 frames
<ipython-input-6-4929578fc46a> in depth_transform(img_t, img_np, depth_infer, depth_mask, size, depthX, scale, shift, colors, depth_dir, save_num)
     46     dY = 100. * shift[1] / size[0]
     47     # dZ = movement direction: 1 away (zoom out), 0 towards (zoom in), 0.5 stay
---> 48     dZ = 0.5 + 23. * (scale[0]-1)
     49     # dZ += 0.5 * float(math.sin(((save_num % 70)/70) * math.pi * 2))
     50 

TypeError: 'float' object is not subscriptable

opened by Akyshnik 3

Releases(v2.5)

v2.5(Sep 3, 2022)

dual model optimization, weighted prompts parsing, illustra update, overall image quality
Source code(tar.gz)
Source code(zip)
v2.3(Mar 31, 2022)

with Illustrip [bespoke text-to-video CLIP-RGB synthesis with motion & pseudo 3D look], pip installable. quality updates: proper init from image, depth compute on gpu,
Source code(tar.gz)
Source code(zip)
v.1.2b(Nov 13, 2021)

v1.2 + versions update
Source code(tar.gz)
Source code(zip)
v.1.2a(Nov 13, 2021)

v1.2 + pytorch version fix
Source code(tar.gz)
Source code(zip)
v1.2(Oct 18, 2021)
wavelets generator, vit-b/16 model, faster video encoding, starting from image, etc.

Source code(tar.gz)
Source code(zip)
v1.0(Jun 27, 2021)

This is considered to be quite stable version, used in few projects, therefore fixed as a release.
Source code(tar.gz)
Source code(zip)

Owner

vadim epstein

GitHub

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

CLIP-GEN [简体中文][English] 本项目在萤火二号集群上用 PyTorch 实现了论文《CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP》。 CLIP-GEN 是一个 Language-F

75 Dec 29, 2022

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

Deep Daze mist over green hills shattered plates on the grass cosmic love and attention a time traveler in the crowd life during the plague meditative

4.4k Jan 3, 2023

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.

Ryan Murdock has done it again, combining OpenAI's CLIP and the generator from a BigGAN! This repository wraps up his work so it is easily accessible to anyone who owns a GPU.

2.3k Jan 9, 2023

CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

CLIP (Contrastive Language–Image Pre-training) Experiments (Evaluation) Model Dataset Acc (%) ViT-B/32 (Paper) CIFAR100 65.1 ViT-B/32 (Our) CIFAR100 6

52 Jan 7, 2023

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

AudioCLIP Extending CLIP to Image, Text and Audio This repository contains implementation of the models described in the paper arXiv:2106.13043. This

458 Jan 2, 2023

Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized

VQGAN-CLIP-Docker About Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized This is a stripped and minimal dependency repository for running loca

73 Sep 11, 2022

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

175 Dec 29, 2022

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

CLIP-ViL In our paper "How Much Can CLIP Benefit Vision-and-Language Tasks?", we show the improvement of CLIP features over the traditional resnet fea

310 Dec 28, 2022

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP Abstract: We introduce a method that allows to automatically se

134 Dec 19, 2022

Simple image captioning model - CLIP prefix captioning.

688 Jan 4, 2023

CLIPImageClassifier wraps clip image model from transformers

CLIPImageClassifier CLIPImageClassifier wraps clip image model from transformers. CLIPImageClassifier is initialized with the argument classes, these

6 Sep 12, 2022

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

CLIP-Indonesian CLIP (Radford et al., 2021) is a multimodal model that can connect images and text by training a vision encoder and a text encoder joi

17 Mar 10, 2022

Python package to generate image embeddings with CLIP without PyTorch/TensorFlow

imgbeddings A Python package to generate embedding vectors from images, using OpenAI's robust CLIP model via Hugging Face transformers. These image em

81 Jan 4, 2023

A 1.3B text-to-image generation model trained on 14 million image-text pairs

minDALL-E on Conceptual Captions minDALL-E, named after minGPT, is a 1.3B text-to-image generation model trained on 14 million image-text pairs for no

604 Dec 14, 2022

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

Deep Text Search - AI Based Text Search & Recommendation System Deep Text Search is an AI-powered multilingual text search and recommendation engine w

19 Sep 29, 2022

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

TAP: Text-Aware Pre-training TAP: Text-Aware Pre-training for Text-VQA and Text-Caption by Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei Flo

61 Nov 14, 2022

Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

SwinTextSpotter This is the pytorch implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text R

183 Jan 3, 2023

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP-GLaSS Repository for the paper Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search An in-browser demo is

172 Dec 22, 2022

Navigating StyleGAN2 w latent space using CLIP

Navigating StyleGAN2 w latent space using CLIP an attempt to build sth with the official SG2-ADA Pytorch impl kinda inspired by Generating Images from

55 Dec 6, 2022

CLIP+FFT text-to-image

Related tags

Overview

Aphantasia

Features

Operations

Continuous mode

Credits

Comments

Releases(v2.5)

v2.5(Sep 3, 2022)

v2.3(Mar 31, 2022)

v.1.2b(Nov 13, 2021)

v.1.2a(Nov 13, 2021)

v1.2(Oct 18, 2021)

v1.0(Jun 27, 2021)

Owner

vadim epstein

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Simple command line tool for text to image generation using OpenAI's CLIP and Siren (Implicit neural representation network)

A simple command line tool for text to image generation, using OpenAI's CLIP and a BigGAN.

CLIP: Connecting Text and Image (Learning Transferable Visual Models From Natural Language Supervision)

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)

Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

improvement of CLIP features over the traditional resnet features on the visual question answering, image captioning, navigation and visual entailment tasks.

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP

Simple image captioning model - CLIP prefix captioning.

CLIPImageClassifier wraps clip image model from transformers

CLIP (Contrastive Language–Image Pre-training) trained on Indonesian data

Python package to generate image embeddings with CLIP without PyTorch/TensorFlow

A 1.3B text-to-image generation model trained on 14 million image-text pairs

Deep Text Search is an AI-powered multilingual text search and recommendation engine with state-of-the-art transformer-based multilingual text embedding (50+ languages).

TAP: Text-Aware Pre-training for Text-VQA and Text-Caption, CVPR 2021 (Oral)

Pytorch re-implementation of Paper: SwinTextSpotter: Scene Text Spotting via Better Synergy between Text Detection and Text Recognition (CVPR 2022)

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

Navigating StyleGAN2 w latent space using CLIP