Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt

Overview

Feed forward VQGAN-CLIP model, where the goal is to eliminate the need for optimizing the latent space of VQGAN for each input prompt. This is done by training a model that takes as input a text prompt, and returns as an output the VQGAN latent space, which is then transformed into an RGB image. The model is trained on a dataset of text prompts and can be used on unseen text prompts. The loss function is minimizing the distance between the CLIP generated image features and the CLIP input text features. Additionally, a diversity loss can be used to make increase the diversity of the generated images given the same prompt.

Open In Colab

How to install?

Download the 16384 Dimension Imagenet VQGAN (f=16)

Links:

Install dependencies.

conda

conda create -n ff_vqgan_clip_env python=3.8
conda activate ff_vqgan_clip_env
# Install pytorch/torchvision - See https://pytorch.org/get-started/locally/ for more info.
(ff_vqgan_clip_env) conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia
(ff_vqgan_clip_env) pip install -r requirements.txt

pip/venv

conda deactivate # Make sure to use your global python3
python3 -m pip install venv
python3 -m venv ./ff_vqgan_clip_venv
source ./ff_vqgan_clip_venv/bin/activate
$ (ff_vqgan_clip_venv) python -m pip install -r requirements.txt

How to use?

(Optional) Pre-tokenize Text

$ (ff_vqgan_clip_venv) python main.py tokenize data/list_of_captions.txt cembeds 128

Train

Modify configs/example.yaml as needed.

$ (ff_vqgan_clip_venv) python main.py train configs/example.yaml

Tensorboard:

Loss will be output for tensorboard.

# in a new terminal/session
(ff_vqgan_clip_venv) pip install tensorboard
(ff_vqgan_clip_venv) tensorboard --logdir results

Pre-trained models

Name Type Size Dataset Link Author
cc12m_8x128 VitGAN 12.1MB Conceptual captions 12M Download @mehdidc
cc12m_16x256 VitGAN 60.1MB Conceptual captions 12M Download @mehdidc
cc12m_32x512 VitGAN 408.4MB Conceptual captions 12M Download @mehdidc
cc12m_32x1024 VitGAN 1.55GB Conceptual captions 12M Download @mehdidc
cc12m_64x1024 VitGAN 3.05GB Conceptual captions 12M Download @mehdidc
bcaptmod_8x128 VitGAN 11.2MB Modified blog captions Download @afiaka87
bcapt_16x128 MLPMixer 168.8MB Blog captions Download @mehdidc

You can also access them from here

NB: cc12m_AxB means a model trained on conceptual captions 12M, with depth A and hidden state dimension B

After downloading a model or finishing training your own model, you can test it with new prompts, e.g.,

python -u main.py test pretrained_models/cc12m_32x1024/model.th "an armchair in the shape of an avocado"

You can also try it in the Colab Notebook. Using the notebook you can generate images from pre-trained models and do interpolations between text prompts to create videos, see for instance video 1 or video 2 or video 3

Acknowledgements

Comments
  • Models are broken in the new `torch` version

    Models are broken in the new `torch` version

    PyTorch introduced approximate GELU. This breaks the MLP-Mixer models. The fix is to save pre-trained models as weight dicts and not complete pickle objects.

    opened by neverix 12
  • Allow different models in replicate.ai interface

    Allow different models in replicate.ai interface

    @CJWBW Thanks again for providing an interface to the model in replicate.ai. I would like now to allow the user to select between different models. I modified predict.py and download-weights.sh accordingly.

    I would like to update the image on https://replicate.ai/mehdidc/feed_forward_vqgan_clip/ , is cog push r8.im/mehdidc/feed_forward_vqgan_clip the correct way to do it ? or it should be done on your side ? I tried the command but I got "docker: Error response from daemon: could not select device driver "" with capabilities: [[gpu]]." since I don't have an nvidia GPU on my local machine (assuming that's the reason it failed).

    opened by mehdidc 12
  • Goal?

    Goal?

    Hey!

    Is the idea here to use CLIP embeds through a transformer similar to alstroemeria's CLIP Decision Transformer?

    edit: https://github.com/crowsonkb/cond_transformer_2

    opened by afiaka87 10
  • Error in Load Model

    Error in Load Model

    Two issues found:

    (1) A Restart Runtime occurs on !pip install requirements.txt . This, in turn, resets the current directory to /current. But even after manually updating the current directory....

    (2) Under Load Model: ImportError: /usr/local/lib/python3.7/dist-packages/torchtext/_torchtext.so: undefined symbol: _ZN3c106ivalue6Future15extractDataPtrsERKNS_6IValueE

    opened by metaphorz 9
  • Unavailable and broken links

    Unavailable and broken links

    When I run the notebook, some links seem unavailable. I don't know why this happens, because it seems that I can manually download the files in my web browser.

    Unavailable links

    Moreover, the links in the README are broken.

    Broken links

    opened by woctezuma 7
  • Observations training with different modifying words/phrases

    Observations training with different modifying words/phrases

    Searching for a more photo-realistic output - I've found that training on certain words is likely to bias the output heavily.

    "illustration"/"cartoon" biases heavily towards a complete lack of photorealism in favor of very abstract interpretations that are often too simple in fact.

    Here - an example from training on the blog post captions with the word "minimalist" prepended to each caption (and a removal of all mannequin captions which are about a 1/16 of all the captions)

    progress_0000019700

    In the Eleuther AI discord; a user @kingdomakrillic posted a very useful link https://imgur.com/a/SnSIQRu showing the effect a starting caption/modifier caption has on various other words when generating an image using the VQGAN + CLIP method.

    With those captions; I decided to randomly prepend all the modifying words/phrases which produced a (subjectively) photo-realistic output to the blog post captions.

            "8k resolution",
            "Flickr",
            "Ambient occlusion",
            "filmic",
            "global illumination",
            "Photo taken with Nikon D750",
            "DSLR",
            "20 megapixels",
            "photo taken with Ektachrome",
            "photo taken with Fugifilm Superia",
            "photo taken with Provia",
            "criterion collection",
            "National Geographic photo ",
            "Associated Press photo",
            "detailed",
            "shot on 70mm",
            "3840x2160",
            "ISO 200",
            "Tri-X 400 TX",
            "Ilford HPS",
            "matte photo",
            "Kodak Gold 200",
            "Kodak Ektar",
            "Kodak Portra",
            "geometric",
    

    With this in place - outputs tend to much more photorealistic (similar caption to above, less than 1 epoch trained): <|startoftext|>2 0 megapixels photo of richmond district , san francisco , from a tall vantage point in the morning <|endoftext|>!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! progress_0000005100

    None of this is very principled however and my next attempts were indeed going to be either "add noise to the captions" or "train on image-text pairs as well" - both of which seem to be in the codebase already! So I'm going to have a try with that.

    In the meantime - here is a checkpoint from the first round of captions (prepend "minimalist" to every blog caption, removing all captions containing "mannequin"). I trained it using the vitgan for 8 epochs, 128 dim, 8 depth, ViT-B16, 32 cutn. The loss was perhaps still going down at this point; but with very diminished returns.

    model.th.zip

    opened by afiaka87 6
  • Support new CLIP models (back to old install)

    Support new CLIP models (back to old install)

    Wasn't expecting an update from openai so soon but I think we have to do this (unfortunately) again until rom1504's branch for the clip-anytorch package is even with main.

    opened by afiaka87 4
  • VQGAN - blended models

    VQGAN - blended models

    I want to take a film (say the Shining )

    • caption it using amazon ai label detection (maybe 1 every 100 frames)
    • throw these image + text paris into training -
    • then take trained model have the neural nets spit out something in the style of the movie....

    Is it possible? In the nerdyrodent/VQGAN-CLIP repo - there's a style transfer

    • but I'm in an enquiry of how to merge the model layers so that the content is skewed to a certain style / astethic.

    @norod + @justinpinkney were successful in blending models together (the FFHQ + cartoon designs) which could easily - could it be achieved in this VQGAN domain? They kind of perform some neural surgery / hacking the layers to force the results. https://github.com/justinpinkney/toonify

    Does the VQGAN give us some access to hack these layers?

    UPDATE @JCBrouwer - seems to have a combined a style transfer via video here https://github.com/JCBrouwer/maua-style

    fyi @nerdyrodent

    opened by johndpope 3
  • How to condition model output z that looks like it can from a standard normal distribution?

    How to condition model output z that looks like it can from a standard normal distribution?

    Hi, this is a nice repo and I'm trying to reimplement something similar for StyleGAN2. Using a list of texts, I'm trying to map CLIP text embeddings to StyleGAN2 latent vectors which is input to StyleGAN2 generator for generating images and then optimize this MLP mapper model using CLIP loss. However, I'm quickly getting blown out images for entire batches. I'm suspecting perhaps this is due to the output of the MLP not conditioned to output something that looks like it can from a standard normal distribution. I wonder if you could perhaps point me in the right direction how to do this.

    opened by xiankgx 2
  • Add Docker environment & web demo

    Add Docker environment & web demo

    Hey @mehdidc! 👋

    We find your model so cool that it generates images from prompts ultra fast!

    This pull request makes it possible to run your model inside a Docker environment, which makes it easier for other people to run it. We're using an open source tool called Cog to make this process easier.

    This also means we can make a web page where other people can try out your model! View it here: https://replicate.ai/mehdidc/feed_forward_vqgan_clip

    Claim your page here so you can edit it, and we'll feature it on our website and tweet about it too.

    In case you're wondering who I am, I'm from Replicate, where we're trying to make machine learning reproducible. We got frustrated that we couldn't run all the really interesting ML work being done. So, we're going round implementing models we like. 😊

    opened by chenxwh 1
  • How to get more variation in the null image

    How to get more variation in the null image

    I've been generating images using this model, which is delightfully fast, but I've noticed that it produces images that are all alike. I tried generating the "null" image by doing:

    H = perceptor.encode_text(toks.to(device)).float()
    z = net(0 * H)
    

    This resulted in:

    base image

    And indeed, everything I generated kind of matched that: you can see the fleshly protrusion on the left in "gold coin":

    gold-coin--0 0

    The object and matching mini-object in "tent":

    tent-0 5

    And it always seems to try to caption the image with nonsense lettering ("lion"):

    lion--0 0

    So I'm wondering if there's a way to "prime" the model and suggest it use a different zero image for each run. Is there a variable I can set, or is this deeply ingrained in training data?

    Any advice would be appreciated, thank you!

    (Apologies if this is the same as #8, but it sounded like #8 was solved by using priors which doesn't seem to help with this.)

    opened by kchodorow 0
  • training GPU configuration

    training GPU configuration

    Thanks for your excellent repo.

    When training cc12m_32x1024 with type VitGAN or MLP Mixer, what kinds of GPU environment do you use? Tesla V100 with 32G mem or others?

    Thanks

    opened by CrossLee1 1
  • Slow Training Speed

    Slow Training Speed

    Hi, First of all great work! I really loved it. To replicate, I tried training on the Conceptual 12M Dataset with the depth and dims same as the pretrained models but the training was too slow. Even in 4 days it was going through the first (or 0th) epoch. I'm training it on NVIDIA Quadro RTX A6000 which I don't think is that much slow. Any suggestions to improve the speed of training? I have multi-gpu access but seems it isn't supported rn. Thanks !

    opened by s13kman 3
  • clarifying differences between available models

    clarifying differences between available models

    Hi @mehdidc 👋🏼 I'm a new team member at @replicate.

    I was trying out your model on replicate.ai and noticed that the names of the models are a bit cryptic, so it's hard to know what differences to expect when using each:

    Screen Shot 2021-09-23 at 6 21 40 PM

    Here's where those are declared:

    https://github.com/mehdidc/feed_forward_vqgan_clip/blob/dd640c0ee5f023ddf83379e6b3906529511ce025/predict.py#L10-L14

    Looking at the source for cog's Input class it looks like options can be a list of anything:

    options: Optional[List[Any]] = None
    

    I'm not sure if this is right, but maybe this means that each model could be declared as a tuple with an accompanying label:

    MODELS = [
        ("cc12m_32x1024_vitgan_v0.1.th", "This model does x"),
        ("cc12m_32x1024_vitgan_v0.2.th" "This model does y"),,
        ("cc12m_32x1024_mlp_mixer_v0.2.th", "This model does z"),
    ]
    

    We could then display those labels on the model form on replicate.ai to make the available options more clear to users.

    Curious to hear your thoughts!

    cc @CJWBW @bfirsh @andreasjansson

    opened by zeke 2
  • How to improve so we could get results closer to the

    How to improve so we could get results closer to the "regular" VQGAN+CLIP?

    Hi! I really love this idea and think that this concept solves the main bottleneck of current VQGAN+CLIP approach which is the optimisation for each prompt. I love how instantaneous this approach is to generating new images. However results with the different CC12M or blog captions model fall short in comparison to the most recent VQGAN+CLIP optimisation approaches

    I am wondering where it could potentially be improved. I guess one thing could be trying to embed the MSE regularised and z+quantize most recent VQGAN+CLIP approaches. The other is that I am wondering whether a bigger training dataset would improve the quality. Would it make sense to train it on ImageNet captions or maybe even a bigger 100M+ caption dataset? (maybe C@H?)

    As you can see, I can't actually contribute much (but I could help with a bigger dataset training effort) but I'm cheering for this project to not die!

    opened by apolinario 2
  • Finetuing CLIP to improve domain-specific performance

    Finetuing CLIP to improve domain-specific performance

    It's quite easy to finetune one of the Open AI CLIP checkpoints with this codebase:

    https://github.com/Zasder3/train-CLIP-FT

    Uses pytorch-lightning. May be worth pursuing

    opened by afiaka87 1
Owner
Mehdi Cherti
Deep Learning Researcher
Mehdi Cherti
FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

FuseDream This repo contains code for our paper (paper link): FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimizat

XCL 191 Dec 31, 2022
Very simple NCHW and NHWC conversion tool for ONNX. Change to the specified input order for each and every input OP. Also, change the channel order of RGB and BGR. Simple Channel Converter for ONNX.

scc4onnx Very simple NCHW and NHWC conversion tool for ONNX. Change to the specified input order for each and every input OP. Also, change the channel

Katsuya Hyodo 16 Dec 22, 2022
Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search

CLIP-GLaSS Repository for the paper Generating images from caption and vice versa via CLIP-Guided Generative Latent Space Search An in-browser demo is

Federico Galatolo 172 Dec 22, 2022
Navigating StyleGAN2 w latent space using CLIP

Navigating StyleGAN2 w latent space using CLIP an attempt to build sth with the official SG2-ADA Pytorch impl kinda inspired by Generating Images from

Mike K. 55 Dec 6, 2022
Homepage of paper: Paint Transformer: Feed Forward Neural Painting with Stroke Prediction, ICCV 2021.

Paint Transformer: Feed Forward Neural Painting with Stroke Prediction [Paper] [PaddlePaddle Implementation] Homepage of paper: Paint Transformer: Fee

null 442 Dec 16, 2022
Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

Just playing with getting VQGAN+CLIP running locally, rather than having to use colab.

Nerdy Rodent 2.3k Jan 4, 2023
Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized

VQGAN-CLIP-Docker About Zero-Shot Text-to-Image Generation VQGAN+CLIP Dockerized This is a stripped and minimal dependency repository for running loca

Kevin Costa 73 Sep 11, 2022
CLIP + VQGAN / PixelDraw

clipit Yet Another VQGAN-CLIP Codebase This started as a fork of @nerdyrodent's VQGAN-CLIP code which was based on the notebooks of @RiversWithWings a

dribnet 276 Dec 12, 2022
Streamlit Tutorial (ex: stock price dashboard, cartoon-stylegan, vqgan-clip, stylemixing, styleclip, sefa)

Streamlit Tutorials Install pip install streamlit Run cd [directory] streamlit run app.py --server.address 0.0.0.0 --server.port [your port] # http:/

Jihye Back 30 Jan 6, 2023
Text2Art is an AI art generator powered with VQGAN + CLIP and CLIPDrawer models

Text2Art is an AI art generator powered with VQGAN + CLIP and CLIPDrawer models. You can easily generate all kind of art from drawing, painting, sketch, or even a specific artist style just using a text input. You can also specify the dimensions of the image. The process can take 3-20 mins and the results will be emailed to you.

Muhammad Fathy Rashad 643 Dec 30, 2022
An architecture that makes any doodle realistic, in any specified style, using VQGAN, CLIP and some basic embedding arithmetics.

Sketch Simulator An architecture that makes any doodle realistic, in any specified style, using VQGAN, CLIP and some basic embedding arithmetics. See

null 12 Dec 18, 2022
Making a music video with Wav2CLIP and VQGAN-CLIP

music2video Overview A repo for making a music video with Wav2CLIP and VQGAN-CLIP. The base code was derived from VQGAN-CLIP The CLIP embedding for au

Joel Jang | 장요엘 163 Dec 26, 2022
Traditional deepdream with VQGAN+CLIP and optical flow. Ready to use in Google Colab

VQGAN-CLIP-Video cat.mp4 policeman.mp4 schoolboy.mp4 forsenBOG.mp4

null 23 Oct 26, 2022
GEP (GDB Enhanced Prompt) - a GDB plug-in for GDB command prompt with fzf history search, fish-like autosuggestions, auto-completion with floating window, partial string matching in history, and more!

GEP (GDB Enhanced Prompt) GEP (GDB Enhanced Prompt) is a GDB plug-in which make your GDB command prompt more convenient and flexibility. Why I need th

Alan Li 23 Dec 21, 2022
Working demo of the Multi-class and Anomaly classification model using the CLIP feature space

??️ Hindsight AI: Crime Classification With Clip About For Educational Purposes Only This is a recursive neural net trained to classify specific crime

Miles Tweed 2 Jun 5, 2022
CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP

CLIP-GEN [简体中文][English] 本项目在萤火二号集群上用 PyTorch 实现了论文 《CLIP-GEN: Language-Free Training of a Text-to-Image Generator with CLIP》。 CLIP-GEN 是一个 Language-F

null 75 Dec 29, 2022
Face Identity Disentanglement via Latent Space Mapping [SIGGRAPH ASIA 2020]

Face Identity Disentanglement via Latent Space Mapping Description Official Implementation of the paper Face Identity Disentanglement via Latent Space

null 150 Dec 7, 2022
Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Face Identity Disentanglement via Latent Space Mapping - Implement in pytorch with StyleGAN 2 Description Pytorch implementation of the paper Face Ide

Daniel Roich 58 Dec 24, 2022
Disentangled Face Attribute Editing via Instance-Aware Latent Space Search, accepted by IJCAI 2021.

Instance-Aware Latent-Space Search This is a PyTorch implementation of the following paper: Disentangled Face Attribute Editing via Instance-Aware Lat

null 67 Dec 21, 2022