Official Implementation for "ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement"


ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement

Recently, the power of unconditional image synthesis has significantly advanced through the use of Generative Adversarial Networks (GANs). The task of inverting an image into its corresponding latent code of the trained GAN is of utmost importance as it allows for the manipulation of real images, leveraging the rich semantics learned by the network. Recognizing the limitations of current inversion approaches, in this work we present a novel inversion scheme that extends current encoder-based inversion methods by introducing an iterative refinement mechanism. Instead of directly predicting the latent code of a given image using a single pass, the encoder is tasked with predicting a residual with respect to the current estimate of the inverted latent code in a self-correcting manner. Our residual-based encoder, named ReStyle, attains improved accuracy compared to current state-of-the-art encoder-based methods with a negligible increase in inference time. We analyze the behavior of ReStyle to gain valuable insights into its iterative nature. We then evaluate the performance of our residual encoder and analyze its robustness compared to optimization-based inversion and state-of-the-art encoders.

Different from conventional encoder-based inversion techniques, our residual-based ReStyle scheme incorporates an iterative refinement mechanism to progressively converge to an accurate inversion of real images. For each domain, we show the input image on the left followed by intermediate inversion outputs.


Official Implementation of our ReStyle paper for both training and evaluation. ReStyle introduces an iterative refinement mechanism which can be applied over different StyleGAN encoders for solving the StyleGAN inversion task.

Getting Started


  • Linux or macOS
  • NVIDIA GPU + CUDA CuDNN (CPU may be possible with some modifications, but is not inherently supported)
  • Python 3


  • Dependencies:
    We recommend running this repository using Anaconda. All dependencies for defining the environment are provided in environment/restyle_env.yaml.

Pretrained Models

In this repository, we provide pretrained ReStyle encoders applied over the pSp and e4e encoders across various domains.

Please download the pretrained models from the following links.

ReStyle + pSp

Path Description
FFHQ - ReStyle + pSp ReStyle applied over pSp trained on the FFHQ dataset.
Stanford Cars - ReStyle + pSp ReStyle applied over pSp trained on the Stanford Cars dataset.
LSUN Church - ReStyle + pSp ReStyle applied over pSp trained on the LSUN Church dataset.
AFHQ Wild - ReStyle + pSp ReStyle applied over pSp trained on the AFHQ Wild dataset.

ReStyle + e4e

Path Description
FFHQ - ReStyle + e4e Coming Soon!
Stanford Cars - ReStyle + e4e Coming Soon!
LSUN Church - ReStyle + e4e Coming Soon!
AFHQ Wild - ReStyle + e4e Coming Soon!
LSUN Horse - ReStyle + e4e ReStyle applied over e4e trained on the LSUN Horse dataset.

Auxiliary Models

In addition, we provide various auxiliary models needed for training your own ReStyle models from scratch.
This includes the StyleGAN generators and pre-trained models used for loss computation.

Path Description
FFHQ StyleGAN StyleGAN2 model trained on FFHQ with 1024x1024 output resolution.
LSUN Car StyleGAN StyleGAN2 model trained on LSUN Car with 512x384 output resolution.
LSUN Church StyleGAN StyleGAN2 model trained on LSUN Church with 256x256 output resolution.
LSUN Horse StyleGAN StyleGAN2 model trained on LSUN Horse with 256x256 output resolution.
AFHQ Wild StyleGAN StyleGAN-ADA model trained on AFHQ Wild with 512x512 output resolution.
IR-SE50 Model Pretrained IR-SE50 model taken from TreB1eN for use in our ID loss and encoder backbone on human facial domain.
ResNet-34 Model ResNet-34 model trained on ImageNet taken from torchvision for initializing our encoder backbone.
MoCov2 Model Pretrained ResNet-50 model trained using MOCOv2 for computing MoCo-based loss on non-facial domains. The model is taken from the official implementation.
CurricularFace Backbone Pretrained CurricularFace model taken from HuangYG123 for use in ID similarity metric computation.
MTCNN Weights for MTCNN model taken from TreB1eN for use in ID similarity metric computation. (Unpack the tar.gz to extract the 3 model weights.)

Note: all StyleGAN models are converted from the official TensorFlow models to PyTorch using the conversion script from rosinality.

By default, we assume that all auxiliary models are downloaded and saved to the directory pretrained_models. However, you may use your own paths by changing the necessary values in configs/


Preparing your Data

In order to train ReStyle on your own data, you should perform the following steps:

  1. Update configs/ with the necessary data paths and model paths for training and inference.
dataset_paths = {
    'train_data': '/path/to/train/data'
    'test_data': '/path/to/test/data',
  1. Configure a new dataset under the DATASETS variable defined in configs/ There, you should define the source/target data paths for the train and test sets as well as the transforms to be used for training and inference.
	'my_data_encode': {
		'transforms': transforms_config.EncodeTransforms,   # can define a custom transform, if desired
		'train_source_root': dataset_paths['train_data'],
		'train_target_root': dataset_paths['train_data'],
		'test_source_root': dataset_paths['test_data'],
		'test_target_root': dataset_paths['test_data'],
  1. To train with your newly defined dataset, simply use the flag --dataset_type my_data_encode.

Preparing your Generator

In this work, we use rosinality's StyleGAN2 implementation. If you wish to use your own generator trained using NVIDIA's implementation there are a few options we recommend:

  1. Using NVIDIA's StyleGAN2 / StyleGAN-ADA TensorFlow implementation.
    You can then convert the TensorFlow .pkl checkpoints to the supported format using the conversion script found in rosinality's implementation.
  2. Using NVIDIA's StyleGAN-ADA PyTorch implementation.
    You can then convert the PyTorch .pkl checkpoints to the supported format using the conversion script created by Justin Pinkney found in dvschultz's fork.

Once you have the converted .pt files, you should be ready to use them in this repository.

Training ReStyle

The main training scripts can be found in scripts/ and scripts/ Each of the two scripts will run ReStyle applied over the corresponding base inversion method.
Intermediate training results are saved to opts.exp_dir. This includes checkpoints, train outputs, and test outputs.
Additionally, if you have tensorboard installed, you can visualize tensorboard logs in opts.exp_dir/logs.

We currently support applying ReStyle on the pSp encoder from Richardson et al. [2020] and the e4e encoder from Tov et al. [2021].

Training ReStyle with the settings used in the paper can be done by running the following commands.

  • ReStyle applied over pSp:
python scripts/ \
--dataset_type=ffhq_encode \
--encoder_type=BackboneEncoder \
--exp_dir=experiment/restyle_psp_ffhq_encode \
--workers=8 \
--batch_size=8 \
--test_batch_size=8 \
--test_workers=8 \
--val_interval=5000 \
--save_interval=10000 \
--start_from_latent_avg \
--lpips_lambda=0.8 \
--l2_lambda=1 \
--w_norm_lambda=0 \
--id_lambda=0.1 \
--input_nc=6 \
--n_iters_per_batch=5 \
--output_size=1024 \
  • ReStyle applied over e4e:
python scripts/ \
--dataset_type ffhq_encode \
--encoder_type ProgressiveBackboneEncoder \
--exp_dir=experiment/restyle_e4e_ffhq_encode \
--workers=8 \
--batch_size=8 \
--test_batch_size=8 \
--test_workers=8 \
--start_from_latent_avg \
--lpips_lambda=0.8 \
--l2_lambda=1 \
--delta_norm_lambda 0.0002 \
--id_lambda 0.1 \
--use_w_pool \
--w_discriminator_lambda 0.1 \
--progressive_start 20000 \
--progressive_step_every 2000 \
--input_nc 6 \
--n_iters_per_batch=5 \
--output_size 1024 \

Additional Notes:

  • Encoder backbones:
    • For the human facial domain (ffhq_encode), we use an IRSE-50 backbone using the flags:
      • --encoder_type=BackboneEncoder for pSp
      • --encoder_type=ProgressiveBackboneEncoder for e4e
    • For all other domains, we use a ResNet34 encoder backbone using the flags:
      • --encoder_type=ResNetBackboneEncoder for pSp
      • --encoder_type=ResNetProgressiveBackboneEncoder for e4e
  • ID/similarity losses:
    • For the human facial domain we also use a specialized ID loss which is set using the flag --id_lambda=0.1.
    • For all other domains, please set --id_lambda=0 and --moco_lambda=0.5 to use the MoCo-based similarity loss from Tov et al.
      • Note, you cannot set both id_lambda and moco_lambda to be active simultaneously.
  • You should also adjust the --output_size and --stylegan_weights flags according to your StyleGAN generator.
  • See options/ and options/ for all training-specific flags.

Inference Notebook

To help visualize the results of ReStyle we provide a Jupyter notebook found in notebooks/inference_playground.ipynb.
The notebook will download the pretrained models and run inference on the images found in notebooks/images or on images of your choosing. It is recommended to run this in Google Colab.



You can use scripts/ to apply a trained model on a set of images:

python scripts/ \
--exp_dir=/path/to/experiment \
--checkpoint_path=experiment/checkpoints/ \
--data_path=/path/to/test_data \
--test_batch_size=4 \
--test_workers=4 \

This script will save each step's outputs in a separate sub-directory (e.g., the outputs of step i will be saved in /path/to/experiment/inference_results/i).


  • By default, the images will be saved at their original output resolutions (e.g., 1024x1024 for faces, 512x384 for cars). If you wish to save outputs resized to resolutions of 256x256 (or 256x192 for cars), you can do so by adding the flag --resize_outputs.
  • This script will also save all the latents as an .npy file in a dictionary format as follows:
    "0.jpg": [latent_step_1, latent_step_2, ..., latent_step_N],
    "1.jpg": [latent_step_1, latent_step_2, ..., latent_step_N],

That is, the keys of the dictionary are the image file names and the values are lists of length N containing the output latent of each step where N is the number of inference steps. Each element in the list is of shape (Kx512) where K is the number of style inputs of the generator.

You can use the saved latents to perform latent space manipulations, for example.

Step-by-Step Inference

Visualizing the intermediate outputs. Here, the intermediate outputs are saved from left to right with the input image shown on the right-hand side.

  Question about training restyle on a different dataset

    Question about training restyle on a different dataset

  Error when loading converted ada-pytorch model

    Error when loading converted ada-pytorch model

  The model collapse when use moco loss

    The model collapse when use moco loss

  Improving toonification result

    Improving toonification result

  failed to run colab notebook

    failed to run colab notebook

  Is the training weight valid only for the training set?

    Is the training weight valid only for the training set?

  Inference on trained model does not reflect test or train output during model training

    Inference on trained model does not reflect test or train output during model training

  Generate the image according to the latent code

    Generate the image according to the latent code

  Trains for infinite time duration regardless of `max_steps`

    Trains for infinite time duration regardless of `max_steps`

    148.3s | 697 | Total Trainable Params: 235955852
    154.1s | 698 | Downloading: "" to /root/.cache/torch/hub/checkpoints/alexnet-owt-7be5be79.pth
    162.8s | 699 | 0%\|                                                \| 0.00/233M [00:00<?, ?B/s]   0%\|                                        \| 48.0k/233M [00:00<08:53, 458kB/s]   0%\|                                         \| 128k/233M [00:00<06:30, 626kB/s]   0%\|                                        \| 288k/233M [00:00<03:51, 1.05MB/s]   0%\|                                        \| 592k/233M [00:00<02:14, 1.82MB/s]   1%\|▏                                      \| 1.19M/233M [00:00<01:11, 3.40MB/s]   1%\|▍                                      \| 2.47M/233M [00:00<00:35, 6.72MB/s]   2%\|▊                                      \| 5.03M/233M [00:00<00:18, 13.2MB/s]   3%\|█▎                                     \| 8.09M/233M [00:00<00:12, 19.1MB/s]   5%\|█▊                                     \| 11.2M/233M [00:00<00:10, 23.1MB/s]   6%\|██▍                                    \| 14.2M/233M [00:01<00:08, 25.8MB/s]   7%\|██▉                                    \| 17.3M/233M [00:01<00:08, 27.6MB/s]   9%\|███▍                                   \| 20.2M/233M [00:01<00:07, 28.4MB/s]  10%\|███▉                                   \| 23.2M/233M [00:01<00:07, 29.5MB/s]  11%\|████▍                                  \| 26.3M/233M [00:01<00:07, 30.2MB/s]  13%\|████▉                                  \| 29.4M/233M [00:01<00:06, 30.7MB/s]  14%\|█████▍                                 \| 32.4M/233M [00:01<00:06, 31.1MB/s]  15%\|█████▉                                 \| 35.5M/233M [00:01<00:06, 31.4MB/s]  17%\|██████▍                                \| 38.5M/233M [00:01<00:06, 31.5MB/s]  18%\|██████▉                                \| 41.6M/233M [00:01<00:06, 31.6MB/s]  19%\|███████▍                               \| 44.6M/233M [00:02<00:06, 31.6MB/s]  20%\|███████▉                               \| 47.7M/233M [00:02<00:06, 31.7MB/s]  22%\|████████▍                              \| 50.8M/233M [00:02<00:06, 31.8MB/s]  23%\|█████████                              \| 53.8M/233M [00:02<00:05, 31.8MB/s]  24%\|█████████▌                             \| 56.9M/233M [00:02<00:05, 31.8MB/s]  26%\|██████████                             \| 59.9M/233M [00:02<00:05, 31.9MB/s]  27%\|██████████▌                            \| 63.0M/233M [00:02<00:05, 31.8MB/s]  28%\|███████████                            \| 66.1M/233M [00:02<00:05, 31.9MB/s]  30%\|███████████▌                           \| 69.1M/233M [00:02<00:05, 31.9MB/s]  31%\|████████████                           \| 72.2M/233M [00:02<00:05, 31.9MB/s]  32%\|████████████▌                          \| 75.2M/233M [00:03<00:05, 31.8MB/s]  34%\|█████████████                          \| 78.3M/233M [00:03<00:05, 31.7MB/s]  35%\|█████████████▌                         \| 81.3M/233M [00:03<00:05, 31.7MB/s]  36%\|██████████████                         \| 84.4M/233M [00:03<00:04, 31.8MB/s]  37%\|██████████████▌                        \| 87.4M/233M [00:03<00:04, 31.8MB/s]  39%\|███████████████▏                       \| 90.5M/233M [00:03<00:04, 31.9MB/s]  40%\|███████████████▋                       \| 93.6M/233M [00:03<00:04, 31.9MB/s]  41%\|████████████████▏                      \| 96.7M/233M [00:03<00:04, 32.0MB/s]  43%\|████████████████▋                      \| 99.8M/233M [00:03<00:04, 32.1MB/s]  44%\|█████████████████▋                      \| 103M/233M [00:03<00:04, 32.1MB/s]  45%\|██████████████████▏                     \| 106M/233M [00:04<00:04, 32.0MB/s]  47%\|██████████████████▋                     \| 109M/233M [00:04<00:04, 32.0MB/s]  48%\|███████████████████▏                    \| 112M/233M [00:04<00:04, 31.7MB/s]  49%\|███████████████████▋                    \| 115M/233M [00:04<00:03, 31.7MB/s]  51%\|████████████████████▎                   \| 118M/233M [00:04<00:03, 31.7MB/s]  52%\|████████████████████▊                   \| 121M/233M [00:04<00:03, 31.8MB/s]  53%\|█████████████████████▎                  \| 124M/233M [00:04<00:03, 31.8MB/s]  55%\|█████████████████████▊                  \| 127M/233M [00:04<00:03, 31.9MB/s]  56%\|██████████████████████▍                 \| 130M/233M [00:04<00:03, 32.0MB/s]  57%\|██████████████████████▉                 \| 133M/233M [00:04<00:03, 32.0MB/s]  59%\|███████████████████████▍                \| 136M/233M [00:05<00:03, 32.0MB/s]  60%\|███████████████████████▉                \| 140M/233M [00:05<00:03, 32.0MB/s]  61%\|████████████████████████▍               \| 143M/233M [00:05<00:02, 32.1MB/s]  63%\|█████████████████████████               \| 146M/233M [00:05<00:02, 32.0MB/s]  64%\|█████████████████████████▌              \| 149M/233M [00:05<00:02, 32.0MB/s]  65%\|██████████████████████████              \| 152M/233M [00:05<00:02, 32.0MB/s]  66%\|██████████████████████████▌             \| 155M/233M [00:05<00:02, 31.8MB/s]  68%\|███████████████████████████             \| 158M/233M [00:05<00:02, 31.8MB/s]  69%\|███████████████████████████▋            \| 161M/233M [00:05<00:02, 31.8MB/s]  70%\|████████████████████████████▏           \| 164M/233M [00:05<00:02, 31.9MB/s]  72%\|████████████████████████████▋           \| 167M/233M [00:06<00:02, 32.0MB/s]  73%\|█████████████████████████████▏          \| 170M/233M [00:06<00:02, 32.0MB/s]  74%\|█████████████████████████████▋          \| 173M/233M [00:06<00:01, 31.9MB/s]  76%\|██████████████████████████████▎         \| 176M/233M [00:06<00:01, 31.7MB/s]  77%\|██████████████████████████████▊         \| 179M/233M [00:06<00:01, 31.7MB/s]  78%\|███████████████████████████████▎        \| 182M/233M [00:06<00:01, 31.8MB/s]  80%\|███████████████████████████████▊        \| 185M/233M [00:06<00:01, 31.8MB/s]  81%\|████████████████████████████████▎       \| 189M/233M [00:06<00:01, 31.9MB/s]  82%\|████████████████████████████████▉       \| 192M/233M [00:06<00:01, 32.0MB/s]  84%\|█████████████████████████████████▍      \| 195M/233M [00:06<00:01, 32.0MB/s]  85%\|█████████████████████████████████▉      \| 198M/233M [00:07<00:01, 32.1MB/s]  86%\|██████████████████████████████████▍     \| 201M/233M [00:07<00:01, 32.1MB/s]  87%\|██████████████████████████████████▉     \| 204M/233M [00:07<00:00, 32.1MB/s]  89%\|███████████████████████████████████▌    \| 207M/233M [00:07<00:00, 32.0MB/s]  90%\|████████████████████████████████████    \| 210M/233M [00:07<00:00, 31.7MB/s]  91%\|████████████████████████████████████▌   \| 213M/233M [00:07<00:00, 31.7MB/s]  93%\|█████████████████████████████████████   \| 216M/233M [00:07<00:00, 31.7MB/s]  94%\|█████████████████████████████████████▌  \| 219M/233M [00:07<00:00, 31.7MB/s]  95%\|██████████████████████████████████████▏ \| 222M/233M [00:07<00:00, 31.9MB/s]  97%\|██████████████████████████████████████▋ \| 225M/233M [00:07<00:00, 31.9MB/s]  98%\|███████████████████████████████████████▏\| 228M/233M [00:08<00:00, 31.9MB/s]  99%\|███████████████████████████████████████▋\| 231M/233M [00:08<00:00, 32.0MB/s] 100%\|████████████████████████████████████████\| 233M/233M [00:08<00:00, 29.7MB/s]
    163.0s | 700 | Downloading: "" to /root/.cache/torch/hub/checkpoints/alex.pth
    163.3s | 701 | 0%\|                                               \| 0.00/5.87k [00:00<?, ?B/s] 100%\|██████████████████████████████████████\| 5.87k/5.87k [00:00<00:00, 4.88MB/s]
    163.3s | 702 | Loading ResNet ArcFace
    163.9s | 703 | Loading dataset for my_ffhq_encode
    311.9s | 704 | Number of training samples: 70000
    311.9s | 705 | Number of test samples: 70000
    316.0s | 706 | Changed progressive stage to:  ProgressiveStage.WTraining
    334.0s | 707 | ./training/ UserWarning: This overload of addcmul_ is deprecated:
    334.0s | 708 | addcmul_(Number value, Tensor tensor1, Tensor tensor2)
    334.0s | 709 | Consider using one of the following signatures instead:
    334.0s | 710 | addcmul_(Tensor tensor1, Tensor tensor2, *, Number value) (Triggered internally at  /usr/local/src/pytorch/torch/csrc/utils/python_arg_parser.cpp:1025.)
    334.0s | 711 | exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
    335.1s | 712 | Metrics for train, step 0
    335.1s | 713 | d_real_loss =  0.6707150936126709
    335.1s | 714 | d_fake_loss =  0.7088415622711182
    335.1s | 715 | discriminator_loss =  1.379556655883789
    335.1s | 716 | discriminator_r1_loss =  0.11644786596298218
    335.1s | 717 | encoder_discriminator_loss =  0.6753013730049133
    335.1s | 718 | total_delta_loss =  0.0
    335.1s | 719 | loss_id =  0.9233516454696655
    335.1s | 720 | id_improve =  -0.9233517416287214
    335.1s | 721 | loss_l2 =  0.29226595163345337
    335.1s | 722 | loss_lpips =  0.4791472554206848
    335.1s | 723 | loss =  0.8354490995407104
    1150.6s | 724 | Metrics for train, step 50
    1150.6s | 725 | d_real_loss =  0.5728033781051636
    1150.6s | 726 | d_fake_loss =  0.7072099447250366
    1150.6s | 727 | discriminator_loss =  1.2800133228302002
    1150.6s | 728 | encoder_discriminator_loss =  0.6189665794372559
    1150.6s | 729 | total_delta_loss =  0.0
    1150.6s | 730 | loss_id =  1.0349425077438354
    1150.6s | 731 | id_improve =  -1.0349424059968442
    1150.6s | 732 | loss_l2 =  0.3499150276184082
    1150.6s | 733 | loss_lpips =  0.5180467963218689
    1150.6s | 734 | loss =  0.9297434091567993
    1920.9s | 735 | Metrics for train, step 100
    1920.9s | 736 | d_real_loss =  0.46433788537979126
    1920.9s | 737 | d_fake_loss =  0.7012063264846802
    1920.9s | 738 | discriminator_loss =  1.1655442714691162
    1920.9s | 739 | encoder_discriminator_loss =  0.5394962430000305
    1920.9s | 740 | total_delta_loss =  0.0
    1920.9s | 741 | loss_id =  0.9708028435707092
    1920.9s | 742 | id_improve =  -0.9708028591703624
    1920.9s | 743 | loss_l2 =  0.27206945419311523
    1920.9s | 744 | loss_lpips =  0.4531230926513672
    1920.9s | 745 | loss =  0.7855978608131409
    opened by RahulBhalley 7
  Image-to-Image Translation using ReStyle

    Image-to-Image Translation using ReStyle

    Hey! First of all, great work! I just wanted to ask whether there is any documentation for trying the ReStyle Encoder for Image-to-Image Translation? I am working on generating real images from sketches in a non-facial domain and have already tried the vanilla psp image-to-image translation pipeline. I saw your comment here saying that the ReStyle Encoder is better suited for non-facial domain and thus wanted to try it. I have already tried setting the source and the target in the data config as the folder to sketch and real images respectively. It generates images as attached below. Shouldn't the input here be a sketch image? 1800 Thanks in advance!

    opened by abhisheklalwani 7
  The visual change/delta at the last iteration is larger than the preceding steps

    The visual change/delta at the last iteration is larger than the preceding steps

    I observed that in many images the change between the last image and the one before is significant (sometimes better or worse as far as the human eye. This is for the FFHQ restylepSp pretrained. the tqo examples below from CELEBHQ (10.jpg and 1000.jpg). Why would this be the case? (I am using 20 iters below)

    10 1000


    opened by yaseryacoob 7
  how to express sadness

    how to express sadness

    Hello! Thank you for your research, Because I know little about the GAN network, I have a question. I want to express more expressions, such as sadness. How can I do this? Thank you for your help

    opened by wudidecc 0
