Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

Federico Galatolo

Last update: Dec 22, 2022

Related tags

Deep Learning clip-glass

Overview

CLIP-GLaSS

Repository for the paper Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

An in-browser demo is available here

Installation

Clone this repository

git clone https://github.com/galatolofederico/clip-glass && cd clip-glass

Create a virtual environment and install the requirements

virtualenv --python=python3.6 env && . ./env/bin/activate
pip install -r requirements.txt

Run CLIP-GLaSS

You can run CLIP-GLaSS with:

python run.py --config  --target

Specifying and according to the following table:

Config	Meaning	Target Type
GPT2	Use GPT2 to solve the Image-to-Text task	Image
DeepMindBigGAN512	Use DeepMind's BigGAN 512x512 to solve the Text-to-Image task	Text
DeepMindBigGAN256	Use DeepMind's BigGAN 256x256 to solve the Text-to-Image task	Text
StyleGAN2_ffhq_d	Use StyleGAN2-ffhq to solve the Text-to-Image task	Text
StyleGAN2_ffhq_nod	Use StyleGAN2-ffhq without Discriminator to solve the Text-to-Image task	Text
StyleGAN2_church_d	Use StyleGAN2-church to solve the Text-to-Image task	Text
StyleGAN2_church_nod	Use StyleGAN2-church without Discriminator to solve the Text-to-Image task	Text
StyleGAN2_car_d	Use StyleGAN2-car to solve the Text-to-Image task	Text
StyleGAN2_car_nod	Use StyleGAN2-car without Discriminator to solve the Text-to-Image task	Text

If you do not have downloaded the models weights you will be prompted to run ./download-weights.sh You will find the results in the folder ./tmp, a different output folder can be specified with --tmp-folder

Examples

python run.py --config StyleGAN2_ffhq_d --target "the face of a man with brown eyes and stubble beard"
python run.py --config GPT2 --target gpt2_images/dog.jpeg

Acknowledgments and licensing

This work heavily relies on the following amazing repositories and would have not been possible without them:

CLIP from openai (included in the folder clip)
pytorch-pretrained-BigGAN from huggingface
stylegan2-pytorch from Adrian Sahlman (included in the folder stylegan2)
gpt-2-pytorch from Tae-Hwan Jung (included in the folder gpt2)

All their work can be shared under the terms of the respective original licenses.

All my original work (everything except the content of the folders clip, stylegan2 and gpt2) is released under the terms of the GNU/GPLv3 license. Coping, adapting e republishing it is not only consent but also encouraged.

Citing

If you want to cite use you can use this BibTeX

@article{galatolo_glass
,	author	= {Galatolo, Federico A and Cimino, Mario GCA and Vaglini, Gigliola}
,	title	= {Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search}
,	year	= {2021}
}

Contacts

For any further question feel free to reach me at [email protected] or on Telegram @galatolo

Comments

Support "This Anime Does Not Exist" StyleGAN2 model by aydao/gwern for anime image generation

Website: https://thisanimedoesnotexist.ai/

The model can be downloaded here: https://www.gwern.net/Faces#tadne-download

Considering GLaSS already supports BigGAN and different SG2 models, I hope it wouldn't be too hard to add this great model too.

opened by n00mkrad 4

Using Multi GPUs

When I try the text-to-image task, I always run out of CUDA memory on single GPU. I try to set the device to 'cuda:0,1' but it doesn't work. I get error like this:

File "run.py", line 54, in <module>
    problem = GenerationProblem(config)
  File "/home/ubuntu/Documents/clip-glass/problem.py", line 9, in __init__
    self.generator = Generator(config)
  File "/home/ubuntu/Documents/clip-glass/generator.py", line 16, in __init__
    self.CLIP, clip_preprocess = clip.load("ViT-B/32", device=self.config.device, jit=False)
  File "/home/ubuntu/anaconda3/envs/lib/python3.7/site-packages/clip/clip.py", line 137, in load
    model = build_model(state_dict or model.state_dict()).to(device)
  File "/home/ubuntu/anaconda3/envs/lib/python3.7/site-packages/torch/nn/modules/module.py", line 600, in to
    device, dtype, non_blocking, convert_to_format = torch._C._nn._parse_to(*args, **kwargs)
RuntimeError: Invalid device string: 'cuda:0,1,'

I wonder how to set it on multi GPUs properly.

opened by KevinGoodman 3

Demo Colab Notebook doesn't support new pytorch versions

In the initialization command, generating the pytorch version string does not work for versions not included in suffix mapping dictionary. Before: pytorch_version = "1.7.1" + pytorch_suffix[version] if version in pytorch_suffix else "+cu110"

Fixed parentheses: pytorch_version = "1.7.1" + (pytorch_suffix[version] if version in pytorch_suffix else "+cu110")

The notebook is incredible and a great resource to go along with the research, great work!

opened by exofusion 3
Captioning results not compatible to the paper

Hi,

I tried your model in image captioning using the demo dog image but got a totally different results from your paper. I ran your script 5 times under the default setting and got the following results: ['the picture of the dog's body.\n\n"The dog's body is’] ['the picture of a dog with a bloated, bloated, bloa’] ["the picture of the puppy's body, with the body's b”] ["the picture of the dog's body, with a large, round”] ["the picture of the dog's body. The dog's body is c”] ['the picture of a dog with a large belly.\n\nâ¼\n\nâ¼\n\nâ¼\n’]

The captioning result shown in your paper is as follows.

Is there any setting modification I need to take for image captioning? Thank you.

opened by zhuang93 2
GPT-2 output console length?

Hi, first thanks for your job :) I don't know if it's an issue. When I select config "GPT-2", the output text of the prediction seems to be incomplete (example: "the picture of a man who is a man, a man who is a" ) --> seems like something is missing. Is this a bug? if not, is there a way to increase output length?

many thanks in advance

opened by smithee77 2
Why BigGAN not use discrimnator?

Thanks for your work. I found that the discriminator is helpful to improve the generation quality in StyleGAN setting. Why not use the BigGAN discriminator in the BigGAN setting?

Thanks.

opened by liuzhengzhe 1
Issue when running: virtualenv --python=python3.6 env && . ./env/bin/activate

Hello, When I run the following,

virtualenv --python=python3.6 env && . ./env/bin/activate

I get this output:

RuntimeError: failed to find interpreter for Builtin discover of python_spec='python3.6'

Thoughts?

opened by alexp-12 1
RuntimeError: Method 'forward' is not defined.

your demo notebook worked for me yesterday but today it's giving me this: RuntimeError: Method 'forward' is not defined.

I really like your implementation! I don't think I changed anything in what I'm doing. any ideas?
i'm pretty much a noob, trying to learn this stuff. thanks in advance

opened by socalledsound 1
how to complete image with text

how to complete image with text

example: I give an unfilled image from the middle down, then I write "same image but below, a sketch of the image"

and it generates an image of but half down is a sketch half up.

opened by molo32 1
Support for GPT-3

Hi! Love the project.

I'm in the OpenAI GPT-3 beta, and I was wondering if it's possible for clip-glass to support GPT-3 for the image-to-text task.

If it's possible, I'd love to help set that integration up but I'm not sure where to start.

opened by indiv0 1

Owner

Federico Galatolo

PhD Student @ University of Pisa

GitHub

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

FuseDream This repo contains code for our paper (paper link): FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimizat

191 Dec 31, 2022

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models Code accompanying CVPR'20 paper of the same title. Paper lin

7k Dec 30, 2022

Implementation based on Paper - Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling

3 Jul 8, 2022

Navigating StyleGAN2 w latent space using CLIP

Navigating StyleGAN2 w latent space using CLIP an attempt to build sth with the official SG2-ADA Pytorch impl kinda inspired by Generating Images from

55 Dec 6, 2022

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by

135 Dec 30, 2022

Disentangled Face Attribute Editing via Instance-Aware Latent Space Search, accepted by IJCAI 2021.

Instance-Aware Latent-Space Search This is a PyTorch implementation of the following paper: Disentangled Face Attribute Editing via Instance-Aware Lat

67 Dec 21, 2022

Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]

Face Identity Disentanglement via Latent Space Mapping Description Official Implementation of the paper Face Identity Disentanglement via Latent Space

150 Dec 7, 2022

Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Face Identity Disentanglement via Latent Space Mapping - Implement in pytorch with StyleGAN 2 Description Pytorch implementation of the paper Face Ide

58 Dec 24, 2022

Code for "SRHEN: Stepwise-Refining Homography Estimation Network via Parsing Geometric Correspondences in Deep Latent Space"

SRHEN This is a better and simpler implementation for "SRHEN: Stepwise-Refining Homography Estimation Network via Parsing Geometric Correspondences in

1 Oct 28, 2022

An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

SVM Données Une base d’images contient 490 images pour l’apprentissage (400 voitures et 90 bateaux), et encore 21 images pour fait des tests. Prétrait

3 Nov 30, 2021

Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

Related tags

Overview

CLIP-GLaSS

An in-browser demo is available here

Installation

Run CLIP-GLaSS

Examples

Acknowledgments and licensing

Citing

Contacts

Comments

Owner

Federico Galatolo

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models

Implementation based on Paper - Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling

Navigating StyleGAN2 w latent space using CLIP

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt

Disentangled Face Attribute Editing via Instance-Aware Latent Space Search, accepted by IJCAI 2021.

Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]

Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Code for "SRHEN: Stepwise-Refining Homography Estimation Network via Parsing Geometric Correspondences in Deep Latent Space"

An image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testingAn image base contains 490 images for learning (400 cars and 90 boats), and another 21 images for testing

CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

Densely Connected Search Space for More Flexible Neural Architecture Search (CVPR2020)

Generating Anime Images by Implementing Deep Convolutional Generative Adversarial Networks paper

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

Visualizer using audio and semantic analysis to explore BigGAN (Brock et al., 2018) latent space.

Just playing with getting CLIP Guided Diffusion running locally, rather than having to use colab.

[CVPR 2020] Interpreting the Latent Space of GANs for Semantic Face Editing

MODALS: Modality-agnostic Automated Data Augmentation in the Latent Space

PyTorch implementation of the WarpedGANSpace: Finding non-linear RBF paths in GAN latent space (ICCV 2021)