Implements VQGAN+CLIP for image and video generation, and style transfers, based on text and image prompts. Emphasis on ease-of-use, documentation, and smooth video creation.

Ryan Hamilton

Last update: Dec 30, 2022

Related tags

Deep Learning art machine-learning cuda image-processing artificial-neural-networks vqgan vqgan-clip

Overview

VQGAN-CLIP-GENERATOR Overview

This is a package (with available notebook) for running VQGAN+CLIP locally, with a focus on ease of use, good documentation, and generating smooth style transfer videos. There are three main user-facing functions: generate.image(), generate.video_frames(), and generate.style_transfer().

This package started as a complete refactor of the code provided by NerdyRodent, which started out as a Katherine Crowson VQGAN+CLIP-derived Google colab notebook.

In addition to refactoring NerdyRodent's code into a more pythonic package to improve usability, this project includes the following noteable elements:

Significant improvements to the quality of style transfer videos
Video smoothing/deflicker by applying EWMA to latent vector series
A wrapper for Real-ESRGAN
Improvements to generated image quality derived from the use of NerdyRodent's cut method code
Example code for video includes optical flow interpolation using RIFE
Unit tests
A google colab notebook

Some sample images:

Environment:

Tested on Windows 10 build 19043
- GPU: Nvidia RTX 3080 10GB
- CPU: AMD 5900X
Also tested in Google Colab (free and pro tiers) using this notebook.
Typical VRAM requirements:
- 24 GB for a 900x900 image (1200x675 in 16:9 format)
- 16 GB for a 700x700 image (933x525 in 16:9 format)
- 10 GB for a 512x512 image (684x384 in 16:9 format)
- 8 GB for a 380x380 image (507x285 in 16:9 format)

Setup

Virtual environment

This example uses Anaconda to manage virtual Python environments. Create a new virtual Python environment for VQGAN-CLIP-GENERATOR. Then, install the dependencies and this VQGAN-CLIP-GENERATOR package using pip. If you are completely new to python and just want to make some art, I have a quick start guide.

conda create --name vqgan python=3.9 pip numpy pytest tqdm git pytorch==1.9.0 torchvision=0.10.0 torchaudio=0.9.0 cudatoolkit=11.1 -c pytorch -c conda-forge
conda activate vqgan
conda install -c conda-forge ffmpeg
conda install ipykernel
pip install git+https://github.com/openai/CLIP.git taming-transformers==0.0.1 ftfy==6.0.3 regex pytorch-lightning==1.4.9 kornia==0.5.11 imageio==2.9.0 omegaconf==2.1.1 torch-optimizer==0.1.0 piexif==1.1.3
pip install git+https://github.com/rkhamilton/vqgan-clip-generator.git

To upgrade to the latest version of this package, use the standard pip upgrade command.

pip install git+https://github.com/rkhamilton/vqgan-clip-generator.git --upgrade

If you want to get into the guts of the code and run a local development copy so you can tinker with the algorithm (be my guest!), do not use pip to install it. Instead, clone and setup in develop mode.

git clone https://github.com/rkhamilton/vqgan-clip-generator.git
cd .\vqgan-clip-generator\
python setup.py develop

Quick example to confirm that it works

import vqgan_clip.generate
from vqgan_clip.engine import VQGAN_CLIP_Config
import os

config = VQGAN_CLIP_Config()
config.output_image_size = [128,128]
vqgan_clip.generate.image(eng_config = config,
        text_prompts = 'A pastoral landscape painting by Rembrandt',
        iterations = 100,
        output_filename = 'output.png')

Optionally, install Real-ESRGAN for image upscaling

Real-ESRGAN is a package that uses machine learning for image restoration, including upscaling and cleaning up noisy images. Given that VQGAN+CLIP output sizes are significantly limited by available VRAM, using a sophisticated upscaler can be useful.

To install Real-ESRGAN for use in this package, run the commands below. See Real-ESRGAN.md for additional instructions, including use of custom upscaling/restoration models.

conda activate vqgan
pip install opencv-python scipy
pip install basicsr
pip install facexlib
pip install gfpgan
pip install git+https://github.com/xinntao/Real-ESRGAN

Optionally, download arXiv2020-RIFE

The project arXiv2020-RIFE is an optical flow interpolation implementation for increasing the framerate of existing video. Optical flow creates extra frames of video that smoothly move visual elements from their positions in the first frame to their positions in the second frame.

I've provided examples of how you can combine Real-ESRGAN and RIFE to upscale and interpolate generated content.

In addition to the setup commands above, run the following commands to set up RIFE. Please note that RIFE does not offer an installable python package, unlike the packages above. You will have to clone their repository to the working directory you plan to use for your VQGAN+CLIP projects. Then, download the RIFE trained model v3.8 to the ./arXiv2020-RIFE/train_log/ folder.

conda activate vqgan
pip install sk-video
pip install opencv-python
pip install moviepy
git clone [email protected]:hzwer/arXiv2020-RIFE.git

If using an AMD graphics card

The instructions above assume an nvidia GPU with support for CUDA 11.1. Instructions for an AMD GPU below are courtesy of NerdyRodent. Note: I have not tested this advice.

ROCm can be used for AMD graphics cards instead of CUDA. You can check if your card is supported here: https://github.com/RadeonOpenCompute/ROCm#supported-gpus

Install ROCm accordng to the instructions and don't forget to add the user to the video group: https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html

The usage and set up instructions above are the same, except for the line where you install Pytorch. Instead of pip install torch==1.9.0+cu111 ..., use the one or two lines which are displayed here (select Pip -> Python-> ROCm): https://pytorch.org/get-started/locally/

If using the CPU

If no graphics card can be found, the CPU is automatically used and a warning displayed. In my testing, a RTX 3080 GPU is >25x faster than a 5900X CPU. Using a CPU may be impractically slow.

This works with the CUDA version of Pytorch, even without CUDA drivers installed, but doesn't seem to work with ROCm as of now.

Uninstalling

Remove the Python enviroment:

conda deactivate
conda remove --name vqgan --all

Remove any cached model files at ~\cache\torch\hub\models.

Generating images and video

Functions

Generating images and video is done through functions in the vqgan_clip.generate module. For the functions that generate folders of images, you may optionally conver them to video using the included video_tools.encode_video() method, which is a wrapper for ffmpeg.

Function	Purpose
generate.image()	Generate a single image.
generate.video_frames()	Generate a sequence of images by running the VQGAN training while periodically saving the generated images to unique files. The resulting images can "zoom in" or translate around if you use optional arguments to transform each generated frame of video. The result is a folder of images that can be combined using (e.g.) ffmpeg.
generate.style_transfer()	Apply VQGAN_CLIP to each frame of an existing video. This is an enhancement of the standard style transfer algorithm that has improvements to the fluidity of the resulting video. The result is a folder of images that can be combined using (e.g.) ffmpeg.

Prompts

Prompts are objects that can be analyzed by CLIP to identify their contents. The resulting images will be those that are similar to the prompts. Prompts can be any combination of text phrases, example images, or random number generator seeds. Each of these types of prompts is in a separate string, discussed below.

Multiple prompts can be combined, both in parallel and in series. Prompts that should be used in parallel are separated by a pipe symbol, like so:

'first parallel prompt | second parallel prompt'

Prompts that should be processed in series should be separated by a carat (^). Serial prompts, sepated by ^, will be cycled through after a user-specified number of video frames. If more prompt changes are requested than there are serial prompts available, the last prompt will be used. This feature is not applicable to generating single images.

'first serial prompt ^ second serial prompt'

Prompts may be given different weights by following them with ':float'. A weight of 1.0 is assumed if no value is provided.

'prompt 10x more weighted:1.0 | prompt with less weight:0.1'

These methods may be used in any combination.

'prompt 1:1.0 | prompt 2:0.1 | prompt 3:0.5 ^ prompt 4 | prompt 5 | prompt 6:2.0'

Parameters

There are a lot of degrees of freedom you can change when creating generative art. I describe the parameters of this package below, and try to highlight the most important considerations in the text.

The parameters used for image generation are either passed to a method of generate.py, or stored in a VQGAN_CLIP_Config instance. These two groups of configuration parameters are discussed below.

Parameters common to vqgan_clip.generate.*

These parameters are common to all of the functions of vqgan_clip.generate: image(), video_frames(), style_transfer().

Function Argument	Default	Meaning
text_prompts	'A painting of flowers in the renaissance style:0.5\|rembrandt:0.5^fish:0.2\|love:1'	Text prompt for image generation
image_prompts	[]	Path to image(s) that will be turned into a prompt via CLIP. The contents of the resulting image will have simiar content to the prompt image(s), as evaluated by CLIP.
noise_prompts	[]	Random number seeds can be used as prompts using the same format as a text prompt. E.g. '123:0.1\|234:0.2\|345:0.\|3' Stories (^) are supported.
init_image	None	A seed image that can be used to start the training. Without an initial image, random noise will be used.
save_every	50	An interim image will be saved to the output location every save_every iterations. If you are generating a video, a frame of video will be created every save_every iterations.
output_filename	'output.jpg'	Location to save the output image file when a single file is being created. All filetypes supported by Pillow should work. Only PNG and jpg files will have metadata embedded that describes generation parameters.
verbose	False	Determines whether training diagnostics should be displayed every time a file is saved.

Parameters specific to generate.image()

Function Argument	Default	Meaning
iterations	100	Number of iterations of train() to perform before stopping and outputing the image. The resulting still image will eventually converge to an image that doesn't perceptually change much in content.

Parameters specific to generate.video_frames()

Function Argument	Default	Meaning
num_video_frames		Number of frames of video to generate.
iterations_per_frame	30	Number of iterations of train() to perform on each generated video frame
iterations_for_first_frame	100	Number of extra iterations of train() to perform on the first frame of video so the image isn't a gray field.
change_prompts_on_frame	None	All serial prompts (separated by "^") will be cycled forward on the video frames provided here. If more changes are requested than prompts are available, the last prompt is used.
generated_video_frames_path	'./video_frames'	Location where multiple_images() will save output.
zoom_scale	1.0	When using zoom_video(), this parameter sets the ratio by which each frame will be zoomed in relative to the previous.
shift_x	0	When using zoom_video(), this parameter sets how many pixels each new frame will be shifted in the x direction.
shift_y	0	When using zoom_video(), this parameter sets how many pixels each new frame will be shifted in the x direction.
z_smoother	False	When True, flicker is reduced and frame-to-frame consistency is increased at the cost of some motion blur. Recent latent vectors used for image generation are combined using a modified EWMA calculation. This averages together multiple adjacent image latent vectors, giving more weight to a central frame, and exponentially less weight to preceeding and succeeding frames.
z_smoother_buffer_len	5	Sets how many latent vectors (images) are combined using an EWMA. Bigger numbers will combine more images for more smoothing, but may make blur rapid changes. The center element of this buffer is given the greatest weight. Must be an odd number.
z_smoother_alpha	0.7	Sets how much the adjacent latent vectors contribute to the final average. Bigger numbers mean the keyframe image will contribute more to the final output, sharpening the result and increasing flicker from frame to frame.

Parameters specific to generate.style_transfer()

Function Argument	Default	Meaning
iterations_per_frame	30	Number of iterations of train() to perform on each generated video frame
change_prompts_on_frame	None	All serial prompts (separated by "^") will be cycled forward on the video frames provided here. If
current_source_frame_image_weight	2.0	Higher numbers make the output video look more like the input video.
current_source_frame_prompt_weight	0.0	Higher numbers make the output video look more like the content of the input video as assessed by CLIP. It treats the source frame as an image_prompt.

VQGAN_CLIP_Config

Other configuration attributes can be seen in vqgan_clip.engine.VQGAN_CLIP_Config. These options are related to the function of the algorithm itself. For example, you can change the learning rate of the GAN, or change the optimization algorithm used, or change the GPU used. Instantiate this class and customize the attributes as needed, then pass this configuration object to a method of vqgan_clip.generate. For example:

config = VQGAN_CLIP_Config()
config.output_image_size = [587,330]
vqgan_clip.generate.image(eng_config = config,text_prompt='a horse')

VQGAN_CLIP_Config Attribute	Default	Meaning
output_image_size	[256,256]	x/y dimensions of the output image in pixels. This will be adjusted slightly based on the CLIP model used. VRAM requirements increase steeply with image size. My video card with 10GB of VRAM can handle a size of [500,500], or [684,384] in 16:9 aspect ratio. Note that a lower resolution output does not look like a scaled-down version of a higher resolution output. Lower res images have less detail for CLIP to analyze and will generate different results than a higher resolution workflow.
vqgan_model_name	f'models/vqgan_imagenet_f16_16384'	Name of the pre-trained VQGAN model to be used.
vqgan_model_yaml_url	f'https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/files/?p=%2Fconfigs%2Fmodel.yaml&dl=1'	URL to download the .yaml file for the selected VQGAN model. Select a valid model name.
vqgan_model_ckpt_url	f'https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/files/?p=%2Fckpts%2Flast.ckpt&dl=1'	URL to download the .ckpt file for the selected VQGAN model. Select a valid model name.
model_dir	None	If set to a folder name (e.g. 'models') then model files will be downloaded to a subfolder of the current working directory. This may be helpful if your default drive, used by PyTorch, is small.
init_noise	None	Seed an image with noise. Options None, 'pixels' or 'gradient'
init_weight	0.0	A weight can be given to the initial image used so that the result will 'hold on to' the look of the starting point.
init_noise	None	Seed an image with noise. Options None, 'pixels' or 'gradient'
cut_method	'kornia'	Sets the method used to generate cutouts which are fed into CLIP for evaluation. 'original' is the method from the original Katherine Crowson colab notebook. 'kornia' includes additional transformations and results in images with more small details. Defaults to 'kornia'.
seed	None	Random number generator seed used for image generation. Reusing the same seed does not ensure perfectly identical output due to some nondeterministic algorithms used in PyTorch.
optimizer	'Adam'	Different optimizers are provided for training the GAN. These all perform differently, and may give you a different result. See torch.optim documentation.
init_weight_method	'original'	Method used to compare current image to init_image. 'decay' will let the output image get further from the source by flattening the original image before letting the new image evolve from the flattened source. The 'decay' method may give a more creative output for longer iterations. 'original' is the method used in the original Katherine Crowson colab notebook, and keeps the output image closer to the original input. This argument is ignored for style transfers.

Dynamic model download and caching

The VQGAN algorithm requires use of a compatible model. These models consist of a configuration file (.yaml) and a checkpoint file (.ckpt). These files are not provided with the pip intallation, and must be downloaded separately. As of version 1.1 of VQGAN_CLIP_GENERATOR, these files are downloaded the first time they are used, and cached locally in the users ~/.cache/torch/hub/models folder. Depending on the models you've used, these can take up several gigabytes of storage, so be aware that they are cached in this location. Uninstallation of this package does not remove cached files.

The pretrained models are discussed in more detail by CompVis. The default model used in this package is vqgan_imagenet_f16_16384. Other models that seem to be in frequent use with VQGAN+CLIP implementaitons are shown below, and are all expected to be compatible. These models will have different abilities to generate content based on their training sets.

Dataset	Model Config	Model Checkpoint
VQGAN ImageNet (f=16), 1024	vqgan_imagenet_f16_1024.yaml	vqgan_imagenet_f16_1024.ckpt
VQGAN ImageNet (f=16), 16384	vqgan_imagenet_f16_16384.yaml	vqgan_imagenet_f16_16384.ckpt
S-FLCKR (f=16)	sflckr.yaml	sflckr.ckpt
COCO-Stuff (f=16)	coco_transformer.yaml	coco_transformer.ckpt

In order to use a non-default model, configure the VQGAN_CLIP_GENERATOR engine as in the example below:

config = VQGAN_CLIP_Config()
config.vqgan_model_name = 'sflckr'
config.vqgan_model_yaml_url = f'https://heibox.uni-heidelberg.de/d/73487ab6e5314cb5adba/files/?p=%2Fconfigs%2F2020-11-09T13-31-51-project.yaml&dl=1'
config.vqgan_model_ckpt_url = f'https://heibox.uni-heidelberg.de/d/73487ab6e5314cb5adba/files/?p=%2Fcheckpoints%2Flast.ckpt&dl=1'
vqgan_clip.generate.image(eng_config = config,
        text_prompts='an apple')

Examples

Example Scripts

The examples folder has the follow scripts available for you to download and customize. These scripts combine all of the available function arguments and additional tools that I've found useful for routine use: upscaling using Real-ESRGAN, and optical flow interpolation using RIFE.

Example	Description
single_image.py	Generate a single image from a text prompt.
image_prompt.py	Generate a single image from an image prompt. The output will have the same content as the image prompt, as assessed by CLIP.
multiple_same_prompts.py	Generate a folder of images using the same prompt, but different random number seeds. This is useful to fish for interesting images.
multiple_many_prompts.py	Generate a folder of images by combining your prompt with a number of "keyword" prompts that have a big impact on image generation.
video.py	Generate images where the prompts change over time, and use those images to create a video.
zoom_video.py	Generate images where the images zoom and shift over time, and use those images to create a video.
upscaling_video.py	Demo of how to use Real-ESRGAN to upscale an existing video.
custom_zoom_video.py	Example of how you can re-use and modify the generate.video_frames() function to create unique behaviors. In this example the zooming and shifting changes over the course of the video.
style_transfer.py	Apply a VQGAN+CLIP style to an existing video. This is the showcase feature of the package (discussed below).
style_transfer_exploration.py	Explore the output of generate.style_transfer() for combinations of parameter values. Useful to dial in the settings for your style transfer, and then use the best settings to generate your video using generate.style_transfer().

Below is a brief discussion of a few specific examples.

Generating a single image from a text prompt

In the example below, an image is generated from two text prompts: "A pastoral landscape painting by Rembrandt" and "A blue fence." These prompts are given different weights during image genration, with the first weighted ten-fold more heavily than the second. This method of applying weights may optionally be used for all three types of prompts: text, images, and noise. If a weight is not specified, a weight of 1.0 is assumed.

# Generate a single image based on a text prompt
from vqgan_clip import generate, esrgan
from vqgan_clip.engine import VQGAN_CLIP_Config
import os

config = VQGAN_CLIP_Config()
config.output_image_size = [587,330]
text_prompts = 'A pastoral landscape painting by Rembrandt:1.0 | A blue fence:0.1'

generate.image(eng_config = config,
        text_prompts = text_prompts,
        iterations = 100,
        output_filename = 'output.png')

Generating a single image from a text prompt and initial image

In this example, an initial image is added to the code above, so that the GAN is seeded with this starting point. The output image will have the same aspect ratio as the initial image.

from vqgan_clip import generate, esrgan
from vqgan_clip.engine import VQGAN_CLIP_Config
import os

config = VQGAN_CLIP_Config()
config.output_image_size = [587,330]
text_prompts = 'A pastoral landscape painting by Rembrandt:1.0 | A blue fence:0.1'

generate.image(eng_config = config,
        text_prompts = text_prompts,
        iterations = 100,
        init_image = 'starting_image.jpg',
        output_filename = 'output.png')

Style Transfer

The method generate.style_transfer() will apply VQGAN+CLIP prompts to an existing video by extracting frames of video from the original and using them as inputs to create a frame of output video. The resulting frames may be combined into a video, and the original audio is optionally copied to the new file. As an example, here is a video of my face restyled with the prompt "portrait covered in spiders charcoal drawing" with 60 iterations per frame, and current_source_frame_image_weight = 3.2 (full code to generate this video).

The innovations used in this approach are:

Each frame of generated video is initialized using the previous output frame of generated video. This ensures that the next generated frame has a starting image that partially satisfies the CLIP loss function. Doing so greatly reduces the changes that the new frame of video will train toward a very different optimimum, which is responsible for the characteristic flicker in most VQGAN+CLIP style transfer videos.
The training process evolves the image from the previous generated image toward the next source frame of video, thereby tracking the original source video frame-by-frame. Increasing current_source_frame_image_weight causes the output video to follow the source image more closely.
You may elect to set the current source image frame as an image prompt. This will cause the resulting output frames to have more similarity (according to CLIP) to the source frame. This is done by increasing current_source_frame_prompt_weight.

A few tips for style transfers:

If you want the output to look more like the input, your prompts should describ the original video as well as the new style. In the example above, I started with a selfie-video, and the prompt included the word "portrait." Without the word portrait, the result shifts much more toward a charcoal drawing of spiders with fewer human elements. You may also have success using the current_source_frame_prompt_weight parameter to use the source frame as an image prompt if you want to retain elements of the original video without describing the source material in a text prompt.
The parameter current_source_frame_image_weight affects how close the final pixels of the image will be to the source material. At a weight of >8, the output will be very similar to the input. At a weight of <1.0 the output will be very different from the source material. A weight of 0.0 would not track the rest of the video after the first frame, and would be very similar to a generate.video_frames().
The iterations_per_frame has a strong effect on the look of the output, and on the frame-to-frame consistency. At high iterations (>100) each frame has a lot of opportunity to change significantly from the previous frame. At low iterations (<5), and low current_source_frame_image_weight values, the output may not have a chance to change toward the new source material image.
For example:
- iterations_per_frame=300, current_source_frame_image_weight=0.2, is a wild ride with a lot of change from the source, and variation from frame to frame (flicker). The flicker could be smoothed by using z_smoother_alpha=0.7 or lower.
- iterations_per_frame=15, current_source_frame_image_weight=4, would be gentle stylig applied to the original video.
You may have success using a lower extraction_framerate (7.5 or 15) and then using RIFE (optical flow interpolation) to interpolate the output up to 60/120 FPS.

Custom scripts

The generate.py file contains the common functions that users are expected to use to create content. However, you should feel free to copy methods from this file and customize them for your own projects. The code in generate.py is still pretty high level, with the implementation details buried in engine and _functional. I've provided an example file where I just extracted the zoom_video_frames method and turned it into a script so that you can see how you might make some creative changes. A few ideas:

Change the image prompt weights over time to create smoother content transitions
Change the interval at which video frames are exported over time, to create the effect of speeding or slowing video
Create style transfer videos where each frame uses many image prompts, or many previous frames as image prompts
Create a zoom video where the shift_x and shift_x change over time to create spiraling zooms, or the look of camera movements
It's art. Go nuts!

Support functions

A few supporting tools are provided. Documentation is provided in docstrings, and the examples folder demonstrates their use.

Functions

Function	Purpose
video_tools.extract_video_frames()	Wrapper for ffmpeg to extract video frames.
video_tools.copy_video_audio()	Wrapper for ffmpeg to copy audio from one video to another.
video_tools.encode_video()	Wrapper for ffmpeg to encode a folder of images to a video.
esrgam.inference_realesrgan()	Functionalized version of the Real-ESRGAN inference_realesrgan.py script for upscaling images.

Function arguments

Function Argument	Default	Meaning
extraction_framerate	30	When extracting video frames from an existing video, this sets how many frames per second will be extracted. Interpolation will be used if the video's native framerate differs.
extracted_video_frames_path	'./extracted_video_frames'	Location where extract_video_frames will save extracted frames of video from the source file.
input_framerate	30	When combining still images to make a video, this parameter can be used to force an assumed original framerate. For example, you could assume you started with 10fps, and interpolate to 60fps using ffmpeg or RIFE.
output_framerate	None	Desired framerate of the output video from encode_video. If ommitted, the input_framerate willb e used. If supplied, ffmpeg will interpolate the video to the new output framerate. If you are using RIFE for optical flow interpolation, it is not recommended to first interpolte with ffmpeg.

Metadata

A record of the settings and prompts used is stored within all generated PNG, JPEG, and MP4 files created by this library. On Windows, you can access this information by right clicking a file, choosing Properties, and selecting the Details tab. Any text prompts are stored as a Title. All of the configuration parameters and values are stored in the Comments field.

Troubleshooting

RuntimeError: CUDA out of memory

For example:

RuntimeError: CUDA out of memory. Tried to allocate 150.00 MiB (GPU 0; 23.70 GiB total capacity; 21.31 GiB already allocated; 78.56 MiB free; 21.70 GiB reserved in total by PyTorch)

Your request doesn't fit into your GPU's VRAM. Reduce the image size and/or number of cuts.

Citations

@misc{unpublished2021clip,
    title  = {CLIP: Connecting Text and Images},
    author = {Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal},
    year   = {2021}
}
@misc{esser2020taming,
      title={Taming Transformers for High-Resolution Image Synthesis}, 
      author={Patrick Esser and Robin Rombach and Björn Ommer},
      year={2020},
      eprint={2012.09841},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
@InProceedings{wang2021realesrgan,
    author    = {Xintao Wang and Liangbin Xie and Chao Dong and Ying Shan},
    title     = {Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data},
    booktitle = {International Conference on Computer Vision Workshops (ICCVW)},
    date      = {2021}
}
@article{huang2020rife,
  title={RIFE: Real-Time Intermediate Flow Estimation for Video Frame Interpolation},
  author={Huang, Zhewei and Zhang, Tianyuan and Heng, Wen and Shi, Boxin and Zhou, Shuchang},
  journal={arXiv preprint arXiv:2011.06294},
  year={2020}
}

Katherine Crowson - https://github.com/crowsonkb
NerdyRodent - https://github.com/nerdyrodent/

Public Domain images from Open Access Images at the Art Institute of Chicago - https://www.artic.edu/open-access/open-access-images

Comments

RuntimeError: requires_grad_ is not supported on ScriptModules

Hi, I am trying to run your colab. I believe it's a crucial node in the system because of it's integration of optical flow interpolation for the AI effects. I am getting this error in the "Style transfer to an existing video" node

/usr/local/lib/python3.7/dist-packages/torch/jit/_script.py in fail(self, *args, **kwargs) 912 def _make_fail(name): 913 def fail(self, *args, **kwargs): --> 914 raise RuntimeError(name + " is not supported on ScriptModules") 915 916 return fail

RuntimeError: requires_grad_ is not supported on ScriptModules

opened by timeFliesWhenYoureHavingFun 5

using style-transfer.py to toy around and getting error

(vqgan3) C:\Apps\vqgan-clip-generator\examples>python "style_transfer - Copy.py"
Traceback (most recent call last):
  File "C:\Apps\vqgan-clip-generator\examples\style_transfer - Copy.py", line 34, in <module>
    original_video_frames = video_tools.extract_video_frames(input_video_path,
  File "C:\Users\BRi7X\anaconda3\envs\vqgan3\lib\site-packages\vqgan_clip\video_tools.py", line 29, in extract_video_frames
    raise ValueError(f'input_video_path must be a directory')
ValueError: input_video_path must be a directory

(vqgan3) C:\Apps\vqgan-clip-generator\examples>

I've even gone into the python script and changed what I thought this field should be many times, but the video tools script still doesn't like my answer. Might you have any suggestions? Thank you so much! This seems like it's going to become one of my favorite implementations. NerdyRodent's is good, but this seems to have smoothing for video frames which I'm quite into! ^_^

EDIT: So I decided to keep trying several more times, finally landing on just moving the video into the examples directory and setting just the filename as input_video_path, and that appears to have worked, we shall see how it goes.

Edit 2: So the first 200 iterations for the first video frame worked, but then the script halted with this:

(vqgan3) C:\Apps\vqgan-clip-generator\examples>python "style_transfer - Copy.py"
Traceback (most recent call last):
  File "C:\Apps\vqgan-clip-generator\examples\style_transfer - Copy.py", line 39, in <module>
    metadata_comment = generate.style_transfer(original_video_frames,
  File "C:\Users\BRi7X\anaconda3\envs\vqgan3\lib\site-packages\vqgan_clip\generate.py", line 537, in style_transfer
    output_size_X, output_size_Y = VF.filesize_matching_aspect_ratio(video_frames[0], eng_config.output_image_size[0], eng_config.output_image_size[1])
  File "C:\Users\BRi7X\anaconda3\envs\vqgan3\lib\site-packages\vqgan_clip\_functional.py", line 351, in filesize_matching_aspect_ratio
    img=Image.open(files[0])
IndexError: list index out of range

Is this due to something I either didn't modify correctly or missed entirely? I'm also getting a series of errors with restyle_video.py though I'll hold off on that so as to not overwhelm anyone responding to this.

opened by Byronius7X 5

Location of images saved changed for single images output/output?

This block of code

vqgan_clip.generate.single_image(eng_config = config,
        text_prompts = text_prompts,
        iterations = 50,
        save_every = 10,
        output_filename = 'output' + os.sep + 'man1')

Saves in a folder output/output/man1.png

this block of code

vqgan_clip.generate.single_image(eng_config = config,
        text_prompts = text_prompts,
        iterations = 50,
        save_every = 10,
        output_filename = 'man1')

saves it in the root folder.. not output folder..

opened by gateway 5

Unknown encoder 'libx265' from FFmpeg

I'm not sure if this is a windows 10 thing but the conda install of your original ffmpeg seemed to now pick up the libx265 for some reason. I had to install this version to get it to work

conda install -c conda-forge ffmpeg

can we also choose h264 as well?

opened by gateway 4

ESRGAN issue... ??

Ok I have never seen this before but I was running an image though my art types and the Real-ESRGAN got to about 85% and then barfed out with this error..

Real-ESRGAN:  85%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                         | 28/33 [00:39<00:07,  1.40s/image]
Traceback (most recent call last):
  File "C:\Users\stiet\Desktop\Work\AIStuff\VQGAN\single_image.py", line 86, in <module>
    esrgan.inference_realesrgan(input=generated_images_path,
  File "C:\Users\stiet\anaconda3\envs\vqgan1\lib\site-packages\vqgan_clip\esrgan.py", line 99, in inference_realesrgan
    if len(img.shape) == 3 and img.shape[2] == 4:
AttributeError: 'NoneType' object has no attribute 'shape'

code

if upscale_images:
    esrgan.inference_realesrgan(input=generated_images_path,
                                output_images_path=upscaled_video_frames_path,
                                face_enhance=face_enhance,
                                purge_existing_files=True,
                                netscale=4,
                                outscale=4)

Im running the latest package...

opened by gateway 3

Init image help text may be slightly incorrect

From this line, where it says random noise will be used if no init image -- is that correct? This seems to conflict with lower down where the default for init_noise is none.

opened by halr9000 2

Warning can only create PNG files.

I noticed with the 1.2 build when I run my script it throws a warning out.

C:\Users\stiet\anaconda3\envs\vqgan1\lib\site-packages\vqgan_clip\generate.py:579: UserWarning: vqgan_clip_generator can only create and save .PNG files.
  warnings.warn('vqgan_clip_generator can only create and save .PNG files.')

block of code uses in this script.

vqgan_clip.generate.single_image(eng_config = config,
        text_prompts = text_prompts,
        iterations = 10,
        save_every = 2,
        output_filename = 'man2')

opened by gateway 2

encode_video() got an unexpected keyword argument 'metadata

Just following one of the video examples on the front page read me.. which has the metadata listed but seems to be failing when running this..

# Use a wrapper for FFMPEG to encode the video.
video_tools.encode_video(output_file=os.path.join('output','zoom_video1.mp4'),
        metadata=text_prompts,
        output_framerate=60,
        input_framerate=30)

error

    video_tools.encode_video(output_file=os.path.join('output','zoom_video1.mp4'),
TypeError: encode_video() got an unexpected keyword argument 'metadata'

opened by gateway 1

Readme has image output at .jpg but arnt they pngs?

I was looking at the readme today and noticed output_filename = 'output.jpg') at various parts of the examples but are you not output a proper png? :? :) Typo?

opened by gateway 1
Not an Issue

Btw thank you so much for taking the time to do this. I have been playing around with the prior code. I'm surprised I got this to run on windows as well (my Titan RTX is on windows).

Would you mind enabling the discussion tab in github for this?

Oh I did notice one issue on windows.. the output filename blows if you have things like ^ or | type of separators in your text prompt.

One question about video.. can you add a way to set a downscale size.. I have a bunch of 4k footage I wanted to test this on a downscale option when extracting the images from the video would be awesome.

I think the VF option in ffmpeg works.. -vf scale=192:168

Thoughts?

opened by gateway 1
2.1

This release adds support for multiple export filetypes in addition to PNG. Exports to jpeg or PNG will have metadata embedded that describe the media generation settings. PNG files have already had metadata stored in PNG data chunks. JPG files, available in 2.1, have metadata stored in the exif fields XPTitle and XPComment. Other export filetypes are supported for still images, provded they are types supported by Pillow.

I ran a lot of side-by-side comparisons of different cut_method approaches, and found the 'kornia' method produces more interesting small details in results, as compared to the 'original' and 'sg3' methods. I've changed the default cut_method to 'kornia'. You can get the old behavior back by setting config.cut_method='original' if desired.

opened by rkhamilton 0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed.

Hi there I get this error in the Download dependencies cell:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. tensorflow 2.8.0 requires tf-estimator-nightly==2.8.0.dev2021122109, which is not installed. albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, which is not installed.

Any insights?

opened by soklamon 0
ImportError: cannot import name 'get_num_classes

----> 3 from vqgan_clip import generate, esrgan, video_tools ImportError: cannot import name 'get_num_classes' from 'torchmetrics.utilities.data' (/usr/local/lib/python3.7/dist-packages/torchmetrics/utilities/data.py)

Im a newb so struggling (running on colab)

I did get it working at some point earlier.. but foolishly overwrite my changes and now cant remember what i changed to get it working..

any help appreciated thanks for great package by the way..

opened by dave20012 4

Releases(v2.3.2)

v2.3.2(Nov 11, 2021)
Bug Fixes

style_transfer did not correctly pass an init_weight to image(), resulting is non-changing video output. Fixes issue #61.

Source code(tar.gz)
Source code(zip)
V2.3.1(Nov 11, 2021)
Bug Fixes

extract_video_frames() had a misleading error message, which is clarified.

restyle_video_naive.py had some bugs because it hadn't been kept up with API changes. It should run now, but it is still using the deprecated restyle_video function, so you should move to style_transfer.py as your example starting point.

extract_video_frames() will do a better job purging existing files from the extraction folder.

Source code(tar.gz)
Source code(zip)
v2.3.0(Nov 7, 2021)
added init_weight as option to generate.image(). It works the same as in the other functions, but was missing from image.

Bug Fixes

esrgan.inference_realesrgan() will now raise an exception if it is unable to load an image from the passed file or folder.

generate.* functions now check for valid function inputs, and will raise exceptions appropriately.

Source code(tar.gz)
Source code(zip)
v2.2.0(Nov 5, 2021)

This release changes style_transfer to work better at low iterations_per_frame. Previously it was resetting the gradient (training) with each new frame of video. Now it is preserved.

The cut_method is also now saved to media metadata.
Source code(tar.gz)
Source code(zip)
v2.1.1(Nov 4, 2021)
Bug Fixes

generate.video_frames was not working for low iterations_per_frame. This is no corrected for non-zooming videos. As long as zoom_scale==1.0, and shift_x, and shift_y, are 0, you can freely set iterations_per_frame to 1 and get expected results.

Known Issues There is still an issue with iterations_per_frame < ~5 when using zoom_scale==1.0, and shift_x, and shift_y. It takes more iterations_per_frame than expected to see progress in the result. For the time being, use a higher iterations_per_frame if you are using these parameters, than if you are not.
Source code(tar.gz)
Source code(zip)
v2.1.0(Nov 3, 2021)
This release adds support for multiple export filetypes in addition to PNG. Exports to jpeg or PNG will have metadata embedded that describe the media generation settings. PNG files have already had metadata stored in PNG data chunks. JPG files, available in 2.1, have metadata stored in the exif fields XPTitle and XPComment. Other export filetypes are supported for still images, provded they are types supported by Pillow.

I ran a lot of side-by-side comparisons of different cut_method approaches, and found the 'kornia' method produces more interesting small details in results, as compared to the 'original' and 'sg3' methods. I've changed the default cut_method to 'kornia'. You can get the old behavior back by setting config.cut_method='original' if desired. I'm doing more detailed comparisons into ways to create cuts and exploring alternatives for the future.

API changes

Engine.save_current_output() argument png_info, type changed to img_metadata.

_functional.copy_PNG_metadata replaced with _functional.copy_image_metadata. This function handles jpg and png files.

Bug Fixes

Real-ESRGAN script was not handling folders of complex filenames with spaces and special character.

Fix for extracting video from folders with long paths with spaces.

Improvements to progress bar accuracy for generate.video_frames().

Fixed regression in RIFE wrapper. Now tested and working on Google Colab.

The tqdm progressbar has been updated to work correctly in Jupyter notebooks.

video_tools.encode_video() fixed to work on linux systems (Google Colab).

Source code(tar.gz)
Source code(zip)
v2.0.0(Oct 29, 2021)
This release introduces major improvements to style transfers, in which VQGAN style is applied to an existing video. The improvements should result in videos that are more consistant from frame-to-frame (less flicker). Associated with the style transfer improvements, there are major changes in the video generation API to make it easier to calculate video durations.

API changes

generate.style_transfer added with the new video generation features.

generate.zoom_video_frames and generate.video_frames have been combined to a single function: generate.video_frames. If you do not specify zoom_scale, shift_x, or shift_y, these values default to 0, and non-zooming images are generated.

generate.video_frames arguments changed. iterations and save_every are removed. New arguments are provided to make it easier to calculate video durations.

num_video_frames : Set the number of video frames (images) to be generated.

iterations_per_frame : Set the number of vqgan training iterations to perform for each frame of video. Higher numbers are more stylized.

generate.multiple_images removed. Functionally it was identical to repeatedly running generate.single_image

generate.single_image renamed to generate.image

generate.single_image argument change_prompt_every is removed. It is not relevant for generating a single image.

generate.restyle_video renamed to generate.restyle_video_legacy. It will be removed in a future version.

generate.restyle_video_naive removed. Use generate.style_transfer instead.

video_tools.RIFE_interpolation added as a wrapper to the arXiv2020-RIFE inference_video.py script.

New Features

generate.zoom_video lets you specify specific video frames where prompts should be changed using the argument change_prompts_on_frame. E.g. to change prompts on frames 150 and 200, use change_prompts_on_frame = [150,200]. Examples are updated with this argument.

video_tools now sets ffmpeg to output on error only

Bug Fixes

The upscaling video example file had a bug in the ffmpeg command.

The generate.encode_video method was not producing file with the expected framerate.

Many problems were resolved that impacted paths that included spaces. In general, be sure to pass f-strings as paths (f'my path{os.sep}here').

Source code(tar.gz)
Source code(zip)
v1.3.0(Oct 24, 2021)

This release adds smoothing to the output of video_frames and restyle_video_frames. The smoothing is done by combining a user-specifiable number of latent vectors (z) and averaging them together using a modified exponentially weighted moving average (EWMA). The approach used here creates a sliding window of z frames (of z_smoothing_buffer length). The center of this window is considered the key frame, and has the greatest weight in the result. As frames move away from the center of the buffer, they have exponentially decreasing weight, by factor (1-z_smoothing_alpha)**offset_from_center.

To increase the temporal smoothing, increase the buffer size. To increase the weight of the key frame of video, increase the z_smoothing_alpha. More smoothing will combine adjacent z vectors, which will blur rapid motion from frame to frame.
Source code(tar.gz)
Source code(zip)
v1.2.2(Oct 23, 2021)
Test coverage increased to include all generate, esrgan, and video_tools functions.

Bug Fixes

generate.extract_video_frames was still saving jpgs. Changed to only save png.

Source code(tar.gz)
Source code(zip)
v1.2.1(Oct 22, 2021)
New features:

Video metadata is encoded by the encode_video function in the title (text prompts) and comment (generator parameters) fields.

Bug Fixes

generate.restyle_video* functions no longer re-load the VQGAN network each frame, which results in a 300% speed-up in running this function. This means that training doesn't start over each frame, so the output will look somewhat different than in earlier versions.

generate functions no longer throw a warning when the output file argument doesn't have an extension.

v1.2.0 introduced a bug where images were saved to output/output/filename. This is fixed.

Source code(tar.gz)
Source code(zip)
v1.2.0(Oct 22, 2021)
Important change to handling initial images I discovered that the code that I started from had a major deviation in how it handled initial images, which I carried over in my code. The expected behavior is that passing any value for init_weight would drive the algorithm to preserve the original image in the output. The code I was using had changed this behavior completely to an (interesting) experimental approach so that the initial image feature was putting pressure on the output to drive it to an all grayscale, flat image, with a decay of this effect with iteration. If you set the init_weight very high, instead of ending up with your initial image, you would get a flat gray image.

The line of code used in all other VQGAN+CLIP repos returns the diffrence between the outut tensor z (the current output image) and the orginal output tensor (original image):

F.mse_loss(self._z, self._z_orig) * self.conf.init_weight / 2

The line of code used in the upstream copy that I started from is very different, with an effect that decreases with more iterations:

F.mse_loss(self._z, torch.zeros_like(self._z_orig)) * ((1/torch.tensor(iteration_number*2 + 1))*self.conf.init_weight) / 2

New features:

Alternate methods for maintaining init_image are provided.

'decay' is the method used in this package from v1.0.0 through v1.1.3, and remains the default. This gives a more stylized look. Try values of 0.1-0.3.

'original' is the method from the original Katherine Crowson colab notebook, and is in common use in other notebooks. This gives a look that stays closer to the source image. Try values of 1-2.

specify the method using config.init_weight_method = 'original' if desired, or config.init_weight_method = 'decay' to specify the default.

Story prompts no longer cycle back to the first prompt when the end is reached.

encode_video syntax change. input_framerate is now required. As before, if output_framerate differs from input_framerate, interpolation will be used.

PNG outputs have data chunks added which describe the generation conditions. You can view these properties using imagemagick. "magick identify -verbose my_image.png"

Source code(tar.gz)
Source code(zip)
v1.1.3(Oct 19, 2021)
Bug Fixes

generate.restyle_video* functions now no longer rename the source files. Original filenames are preserved. As part of this fix, the video_tools.extract_video_frames() now uses a different naming format, consistent with generate.restyle_video. All video tools now use the filename frames_%12d.png.

Source code(tar.gz)
Source code(zip)
v1.1.2(Oct 19, 2021)
When generating videos, the pytorch random number generator was getting a new seed every frame of video, instead of keeping the same seed. This is now fixed, and video is more consistent from frame to frame.

Source code(tar.gz)
Source code(zip)

v1.1.1(Oct 19, 2021)

By user request, it is now possible to set an Engine.conf.model_dir to store downloaded models in a subfolder of the current working directory.

esrgan.inference_realesrgan(input='.\\video_frames',
        output_images_path='upscaled_video_frames',
        face_enhance=False,
        model_dir='models')

config = VQGAN_CLIP_Config()
config.model_dir = 'models'
generate.single_image(eng_config = config,
        image_prompts = 'input_image.jpg',
        iterations = 500,
        save_every = 10,
        output_filename = output_filename)

Source code(tar.gz)
Source code(zip)

v1.1.0(Oct 19, 2021)
This is a significant change that breaks compatibility.

New features:

Real-ESRGAN integration for upscaling images and video. This can be used on generated media or existing media.

In order to accomodate flexible upscaling, all generate.*_video() methods have been changed to only generate folders of images (and renamed generate.*_video_frames()). You will need to optionally include a call to the upscaler, followed by a call to the video encoder.

All examples have been updated to include upscaling.

Model files for VQGAN and Real-ESRGAN are dynamically downloaded and cached in your pytorch hub folder instead of your working folder ./models subfolder. You will provide a URL and filename for the model to the vqgan_clip_generator.Engine object, and if there is no local copy available it will be downloaded and used. If a local copy has already been downloaded, it will not be downloaded again. This should give you a cleaner project folder / working directory, and allow model reuse across multiple project folders.

These files will need to be manually removed when you uninstall vqgan_clip_generator. On Windows, model files are stored in ~\.cache\torch\hub\models

You can copy your existing downloaded model files to ~\.cache\torch\hub\models and they will be used and not re-downloaded.

Initial images (init_image) used to initialize VQGAN+CLIP image generation has been moved to be an argument of the generate.* methods, instead of being accessed as part of Engine configuration. It was confusing that initializing image generation required accessing Engine config. The philosophy is that Engine config should not need to be touched in most cases except to set your output image size. Internally, generate.* methods just copy the init_image to the Engine config structure, but it seemed more clear to expose this as a generate.* argument.

Known issues:

Story prompts aren't working when restyling videos. Only the initial prompts (before the ^) are used. I need to change the prompt cycling to be based on video frame, not iteration, since the iterations reset for each frame.

Unit tests don't cover Real-ESRGAN yet.

The Colab notebook isn't fully tested for these changes yet.

Source code(tar.gz)
Source code(zip)
v1.0.0(Oct 18, 2021)

First feature-complete release. Be aware that a significant change is planned for v1.1 that will break compatibility for video generation. It will also introduce easy (hopefully) integration with Real-ESRGAN for upscaling.
Source code(tar.gz)
Source code(zip)