StyleGAN2-ADA - Official PyTorch implementation

Overview

StyleGAN2-ADA — Official PyTorch implementation

Teaser image

Training Generative Adversarial Networks with Limited Data
Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, Timo Aila
https://arxiv.org/abs/2006.06676

Abstract: Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes. The approach does not require changes to loss functions or network architectures, and is applicable both when training from scratch and when fine-tuning an existing GAN on another dataset. We demonstrate, on several datasets, that good results are now possible using only a few thousand training images, often matching StyleGAN2 results with an order of magnitude fewer images. We expect this to open up new application domains for GANs. We also find that the widely used CIFAR-10 is, in fact, a limited data benchmark, and improve the record FID from 5.59 to 2.42.

For business inquiries, please contact [email protected]
For press and other inquiries, please contact Hector Marinez at [email protected]

Release notes

This repository is a faithful reimplementation of StyleGAN2-ADA in PyTorch, focusing on correctness, performance, and compatibility.

Correctness

  • Full support for all primary training configurations.
  • Extensive verification of image quality, training curves, and quality metrics against the TensorFlow version.
  • Results are expected to match in all cases, excluding the effects of pseudo-random numbers and floating-point arithmetic.

Performance

  • Training is typically 5%–30% faster compared to the TensorFlow version on NVIDIA Tesla V100 GPUs.
  • Inference is up to 35% faster in high resolutions, but it may be slightly slower in low resolutions.
  • GPU memory usage is comparable to the TensorFlow version.
  • Faster startup time when training new networks (<50s), and also when using pre-trained networks (<4s).
  • New command line options for tweaking the training performance.

Compatibility

  • Compatible with old network pickles created using the TensorFlow version.
  • New ZIP/PNG based dataset format for maximal interoperability with existing 3rd party tools.
  • TFRecords datasets are no longer supported — they need to be converted to the new format.
  • New JSON-based format for logs, metrics, and training curves.
  • Training curves are also exported in the old TFEvents format if TensorBoard is installed.
  • Command line syntax is mostly unchanged, with a few exceptions (e.g., dataset_tool.py).
  • Comparison methods are not supported (--cmethod, --dcap, --cfg=cifarbaseline, --aug=adarv)
  • Truncation is now disabled by default.

Data repository

Path Description
stylegan2-ada-pytorch Main directory hosted on Amazon S3
  ├  ada-paper.pdf Paper PDF
  ├  images Curated example images produced using the pre-trained models
  ├  videos Curated example interpolation videos
  └  pretrained Pre-trained models
    ├  ffhq.pkl FFHQ at 1024x1024, trained using original StyleGAN2
    ├  metfaces.pkl MetFaces at 1024x1024, transfer learning from FFHQ using ADA
    ├  afhqcat.pkl AFHQ Cat at 512x512, trained from scratch using ADA
    ├  afhqdog.pkl AFHQ Dog at 512x512, trained from scratch using ADA
    ├  afhqwild.pkl AFHQ Wild at 512x512, trained from scratch using ADA
    ├  cifar10.pkl Class-conditional CIFAR-10 at 32x32
    ├  brecahad.pkl BreCaHAD at 512x512, trained from scratch using ADA
    ├  paper-fig7c-training-set-sweeps Models used in Fig.7c (sweep over training set size)
    ├  paper-fig11a-small-datasets Models used in Fig.11a (small datasets & transfer learning)
    ├  paper-fig11b-cifar10 Models used in Fig.11b (CIFAR-10)
    ├  transfer-learning-source-nets Models used as starting point for transfer learning
    └  metrics Feature detectors used by the quality metrics

Requirements

  • Linux and Windows are supported, but we recommend Linux for performance and compatibility reasons.
  • 1–8 high-end NVIDIA GPUs with at least 12 GB of memory. We have done all testing and development using NVIDIA DGX-1 with 8 Tesla V100 GPUs.
  • 64-bit Python 3.7, PyTorch 1.7.1, and CUDA toolkit 11.0 or newer. Use CUDA toolkit 11.1 or later with RTX 3090. See https://pytorch.org/ for PyTorch install instructions.
  • Python libraries: pip install click requests tqdm pyspng ninja imageio-ffmpeg==0.4.3. We use the Anaconda3 2020.11 distribution which installs most of these by default.
  • Docker users: use the provided Dockerfile to build an image with the required library dependencies.

The code relies heavily on custom PyTorch extensions that are compiled on the fly using NVCC. On Windows, the compilation requires Microsoft Visual Studio. We recommend installing Visual Studio Community Edition and adding it into PATH using "C:\Program Files (x86)\Microsoft Visual Studio\\Community\VC\Auxiliary\Build\vcvars64.bat".

Getting started

Pre-trained networks are stored as *.pkl files that can be referenced using local filenames or URLs:

# Generate curated MetFaces images without truncation (Fig.10 left)
python generate.py --outdir=out --trunc=1 --seeds=85,265,297,849 \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metfaces.pkl

# Generate uncurated MetFaces images with truncation (Fig.12 upper left)
python generate.py --outdir=out --trunc=0.7 --seeds=600-605 \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metfaces.pkl

# Generate class conditional CIFAR-10 images (Fig.17 left, Car)
python generate.py --outdir=out --seeds=0-35 --class=1 \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/cifar10.pkl

# Style mixing example
python style_mixing.py --outdir=out --rows=85,100,75,458,1500 --cols=55,821,1789,293 \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metfaces.pkl

Outputs from the above commands are placed under out/*.png, controlled by --outdir. Downloaded network pickles are cached under $HOME/.cache/dnnlib, which can be overridden by setting the DNNLIB_CACHE_DIR environment variable. The default PyTorch extension build directory is $HOME/.cache/torch_extensions, which can be overridden by setting TORCH_EXTENSIONS_DIR.

Docker: You can run the above curated image example using Docker as follows:

docker build --tag sg2ada:latest .
./docker_run.sh python3 generate.py --outdir=out --trunc=1 --seeds=85,265,297,849 \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metfaces.pkl

Note: The Docker image requires NVIDIA driver release r455.23 or later.

Legacy networks: The above commands can load most of the network pickles created using the previous TensorFlow versions of StyleGAN2 and StyleGAN2-ADA. However, for future compatibility, we recommend converting such legacy pickles into the new format used by the PyTorch version:

python legacy.py \
    --source=https://nvlabs-fi-cdn.nvidia.com/stylegan2/networks/stylegan2-cat-config-f.pkl \
    --dest=stylegan2-cat-config-f.pkl

Projecting images to latent space

To find the matching latent vector for a given image file, run:

python projector.py --outdir=out --target=~/mytargetimg.png \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/ffhq.pkl

For optimal results, the target image should be cropped and aligned similar to the FFHQ dataset. The above command saves the projection target out/target.png, result out/proj.png, latent vector out/projected_w.npz, and progression video out/proj.mp4. You can render the resulting latent vector by specifying --projected_w for generate.py:

python generate.py --outdir=out --projected_w=out/projected_w.npz \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/ffhq.pkl

Using networks from Python

You can use pre-trained networks in your own Python code as follows:

with open('ffhq.pkl', 'rb') as f:
    G = pickle.load(f)['G_ema'].cuda()  # torch.nn.Module
z = torch.randn([1, G.z_dim]).cuda()    # latent codes
c = None                                # class labels (not used in this example)
img = G(z, c)                           # NCHW, float32, dynamic range [-1, +1]

The above code requires torch_utils and dnnlib to be accessible via PYTHONPATH. It does not need source code for the networks themselves — their class definitions are loaded from the pickle via torch_utils.persistence.

The pickle contains three networks. 'G' and 'D' are instantaneous snapshots taken during training, and 'G_ema' represents a moving average of the generator weights over several training steps. The networks are regular instances of torch.nn.Module, with all of their parameters and buffers placed on the CPU at import and gradient computation disabled by default.

The generator consists of two submodules, G.mapping and G.synthesis, that can be executed separately. They also support various additional options:

w = G.mapping(z, c, truncation_psi=0.5, truncation_cutoff=8)
img = G.synthesis(w, noise_mode='const', force_fp32=True)

Please refer to generate.py, style_mixing.py, and projector.py for further examples.

Preparing datasets

Datasets are stored as uncompressed ZIP archives containing uncompressed PNG files and a metadata file dataset.json for labels.

Custom datasets can be created from a folder containing images; see python dataset_tool.py --help for more information. Alternatively, the folder can also be used directly as a dataset, without running it through dataset_tool.py first, but doing so may lead to suboptimal performance.

Legacy TFRecords datasets are not supported — see below for instructions on how to convert them.

FFHQ:

Step 1: Download the Flickr-Faces-HQ dataset as TFRecords.

Step 2: Extract images from TFRecords using dataset_tool.py from the TensorFlow version of StyleGAN2-ADA:

# Using dataset_tool.py from TensorFlow version at
# https://github.com/NVlabs/stylegan2-ada/
python ../stylegan2-ada/dataset_tool.py unpack \
    --tfrecord_dir=~/ffhq-dataset/tfrecords/ffhq --output_dir=/tmp/ffhq-unpacked

Step 3: Create ZIP archive using dataset_tool.py from this repository:

# Original 1024x1024 resolution.
python dataset_tool.py --source=/tmp/ffhq-unpacked --dest=~/datasets/ffhq.zip

# Scaled down 256x256 resolution.
python dataset_tool.py --source=/tmp/ffhq-unpacked --dest=~/datasets/ffhq256x256.zip \
    --width=256 --height=256

MetFaces: Download the MetFaces dataset and create ZIP archive:

python dataset_tool.py --source=~/downloads/metfaces/images --dest=~/datasets/metfaces.zip

AFHQ: Download the AFHQ dataset and create ZIP archive:

python dataset_tool.py --source=~/downloads/afhq/train/cat --dest=~/datasets/afhqcat.zip
python dataset_tool.py --source=~/downloads/afhq/train/dog --dest=~/datasets/afhqdog.zip
python dataset_tool.py --source=~/downloads/afhq/train/wild --dest=~/datasets/afhqwild.zip

CIFAR-10: Download the CIFAR-10 python version and convert to ZIP archive:

python dataset_tool.py --source=~/downloads/cifar-10-python.tar.gz --dest=~/datasets/cifar10.zip

LSUN: Download the desired categories from the LSUN project page and convert to ZIP archive:

python dataset_tool.py --source=~/downloads/lsun/raw/cat_lmdb --dest=~/datasets/lsuncat200k.zip \
    --transform=center-crop --width=256 --height=256 --max_images=200000

python dataset_tool.py --source=~/downloads/lsun/raw/car_lmdb --dest=~/datasets/lsuncar200k.zip \
    --transform=center-crop-wide --width=512 --height=384 --max_images=200000

BreCaHAD:

Step 1: Download the BreCaHAD dataset.

Step 2: Extract 512x512 resolution crops using dataset_tool.py from the TensorFlow version of StyleGAN2-ADA:

# Using dataset_tool.py from TensorFlow version at
# https://github.com/NVlabs/stylegan2-ada/
python dataset_tool.py extract_brecahad_crops --cropsize=512 \
    --output_dir=/tmp/brecahad-crops --brecahad_dir=~/downloads/brecahad/images

Step 3: Create ZIP archive using dataset_tool.py from this repository:

python dataset_tool.py --source=/tmp/brecahad-crops --dest=~/datasets/brecahad.zip

Training new networks

In its most basic form, training new networks boils down to:

python train.py --outdir=~/training-runs --data=~/mydataset.zip --gpus=1 --dry-run
python train.py --outdir=~/training-runs --data=~/mydataset.zip --gpus=1

The first command is optional; it validates the arguments, prints out the training configuration, and exits. The second command kicks off the actual training.

In this example, the results are saved to a newly created directory ~/training-runs/-mydataset-auto1, controlled by --outdir. The training exports network pickles (network-snapshot-.pkl) and example images (fakes.png) at regular intervals (controlled by --snap). For each pickle, it also evaluates FID (controlled by --metrics) and logs the resulting scores in metric-fid50k_full.jsonl (as well as TFEvents if TensorBoard is installed).

The name of the output directory reflects the training configuration. For example, 00000-mydataset-auto1 indicates that the base configuration was auto1, meaning that the hyperparameters were selected automatically for training on one GPU. The base configuration is controlled by --cfg:

Base config Description
auto (default) Automatically select reasonable defaults based on resolution and GPU count. Serves as a good starting point for new datasets but does not necessarily lead to optimal results.
stylegan2 Reproduce results for StyleGAN2 config F at 1024x1024 using 1, 2, 4, or 8 GPUs.
paper256 Reproduce results for FFHQ and LSUN Cat at 256x256 using 1, 2, 4, or 8 GPUs.
paper512 Reproduce results for BreCaHAD and AFHQ at 512x512 using 1, 2, 4, or 8 GPUs.
paper1024 Reproduce results for MetFaces at 1024x1024 using 1, 2, 4, or 8 GPUs.
cifar Reproduce results for CIFAR-10 (tuned configuration) using 1 or 2 GPUs.

The training configuration can be further customized with additional command line options:

  • --aug=noaug disables ADA.
  • --cond=1 enables class-conditional training (requires a dataset with labels).
  • --mirror=1 amplifies the dataset with x-flips. Often beneficial, even with ADA.
  • --resume=ffhq1024 --snap=10 performs transfer learning from FFHQ trained at 1024x1024.
  • --resume=~/training-runs//network-snapshot-.pkl resumes a previous training run.
  • --gamma=10 overrides R1 gamma. We recommend trying a couple of different values for each new dataset.
  • --aug=ada --target=0.7 adjusts ADA target value (default: 0.6).
  • --augpipe=blit enables pixel blitting but disables all other augmentations.
  • --augpipe=bgcfnc enables all available augmentations (blit, geom, color, filter, noise, cutout).

Please refer to python train.py --help for the full list.

Expected training time

The total training time depends heavily on resolution, number of GPUs, dataset, desired quality, and hyperparameters. The following table lists expected wallclock times to reach different points in the training, measured in thousands of real images shown to the discriminator ("kimg"):

Resolution GPUs 1000 kimg 25000 kimg sec/kimg GPU mem CPU mem
128x128 1 4h 05m 4d 06h 12.8–13.7 7.2 GB 3.9 GB
128x128 2 2h 06m 2d 04h 6.5–6.8 7.4 GB 7.9 GB
128x128 4 1h 20m 1d 09h 4.1–4.6 4.2 GB 16.3 GB
128x128 8 1h 13m 1d 06h 3.9–4.9 2.6 GB 31.9 GB
256x256 1 6h 36m 6d 21h 21.6–24.2 5.0 GB 4.5 GB
256x256 2 3h 27m 3d 14h 11.2–11.8 5.2 GB 9.0 GB
256x256 4 1h 45m 1d 20h 5.6–5.9 5.2 GB 17.8 GB
256x256 8 1h 24m 1d 11h 4.4–5.5 3.2 GB 34.7 GB
512x512 1 21h 03m 21d 22h 72.5–74.9 7.6 GB 5.0 GB
512x512 2 10h 59m 11d 10h 37.7–40.0 7.8 GB 9.8 GB
512x512 4 5h 29m 5d 17h 18.7–19.1 7.9 GB 17.7 GB
512x512 8 2h 48m 2d 22h 9.5–9.7 7.8 GB 38.2 GB
1024x1024 1 1d 20h 46d 03h 154.3–161.6 8.1 GB 5.3 GB
1024x1024 2 23h 09m 24d 02h 80.6–86.2 8.6 GB 11.9 GB
1024x1024 4 11h 36m 12d 02h 40.1–40.8 8.4 GB 21.9 GB
1024x1024 8 5h 54m 6d 03h 20.2–20.6 8.3 GB 44.7 GB

The above measurements were done using NVIDIA Tesla V100 GPUs with default settings (--cfg=auto --aug=ada --metrics=fid50k_full). "sec/kimg" shows the expected range of variation in raw training performance, as reported in log.txt. "GPU mem" and "CPU mem" show the highest observed memory consumption, excluding the peak at the beginning caused by torch.backends.cudnn.benchmark.

In typical cases, 25000 kimg or more is needed to reach convergence, but the results are already quite reasonable around 5000 kimg. 1000 kimg is often enough for transfer learning, which tends to converge significantly faster. The following figure shows example convergence curves for different datasets as a function of wallclock time, using the same settings as above:

Training curves

Note: --cfg=auto serves as a reasonable first guess for the hyperparameters but it does not necessarily lead to optimal results for a given dataset. For example, --cfg=stylegan2 yields considerably better FID for FFHQ-140k at 1024x1024 than illustrated above. We recommend trying out at least a few different values of --gamma for each new dataset.

Quality metrics

By default, train.py automatically computes FID for each network pickle exported during training. We recommend inspecting metric-fid50k_full.jsonl (or TensorBoard) at regular intervals to monitor the training progress. When desired, the automatic computation can be disabled with --metrics=none to speed up the training slightly (3%–9%).

Additional quality metrics can also be computed after the training:

# Previous training run: look up options automatically, save result to JSONL file.
python calc_metrics.py --metrics=pr50k3_full \
    --network=~/training-runs/00000-ffhq10k-res64-auto1/network-snapshot-000000.pkl

# Pre-trained network pickle: specify dataset explicitly, print result to stdout.
python calc_metrics.py --metrics=fid50k_full --data=~/datasets/ffhq.zip --mirror=1 \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/ffhq.pkl

The first example looks up the training configuration and performs the same operation as if --metrics=pr50k3_full had been specified during training. The second example downloads a pre-trained network pickle, in which case the values of --mirror and --data must be specified explicitly.

Note that many of the metrics have a significant one-off cost when calculating them for the first time for a new dataset (up to 30min). Also note that the evaluation is done using a different random seed each time, so the results will vary if the same metric is computed multiple times.

We employ the following metrics in the ADA paper. Execution time and GPU memory usage is reported for one NVIDIA Tesla V100 GPU at 1024x1024 resolution:

Metric Time GPU mem Description
fid50k_full 13 min 1.8 GB Fréchet inception distance[1] against the full dataset
kid50k_full 13 min 1.8 GB Kernel inception distance[2] against the full dataset
pr50k3_full 13 min 4.1 GB Precision and recall[3] againt the full dataset
is50k 13 min 1.8 GB Inception score[4] for CIFAR-10

In addition, the following metrics from the StyleGAN and StyleGAN2 papers are also supported:

Metric Time GPU mem Description
fid50k 13 min 1.8 GB Fréchet inception distance against 50k real images
kid50k 13 min 1.8 GB Kernel inception distance against 50k real images
pr50k3 13 min 4.1 GB Precision and recall against 50k real images
ppl2_wend 36 min 2.4 GB Perceptual path length[5] in W, endpoints, full image
ppl_zfull 36 min 2.4 GB Perceptual path length in Z, full paths, cropped image
ppl_wfull 36 min 2.4 GB Perceptual path length in W, full paths, cropped image
ppl_zend 36 min 2.4 GB Perceptual path length in Z, endpoints, cropped image
ppl_wend 36 min 2.4 GB Perceptual path length in W, endpoints, cropped image

References:

  1. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, Heusel et al. 2017
  2. Demystifying MMD GANs, Bińkowski et al. 2018
  3. Improved Precision and Recall Metric for Assessing Generative Models, Kynkäänniemi et al. 2019
  4. Improved Techniques for Training GANs, Salimans et al. 2016
  5. A Style-Based Generator Architecture for Generative Adversarial Networks, Karras et al. 2018

License

Copyright © 2021, NVIDIA Corporation. All rights reserved.

This work is made available under the Nvidia Source Code License.

Citation

@inproceedings{Karras2020ada,
  title     = {Training Generative Adversarial Networks with Limited Data},
  author    = {Tero Karras and Miika Aittala and Janne Hellsten and Samuli Laine and Jaakko Lehtinen and Timo Aila},
  booktitle = {Proc. NeurIPS},
  year      = {2020}
}

Development

This is a research reference implementation and is treated as a one-time code drop. As such, we do not accept outside code contributions in the form of pull requests.

Acknowledgements

We thank David Luebke for helpful comments; Tero Kuosmanen and Sabu Nadarajan for their support with compute infrastructure; and Edgar Schönfeld for guidance on setting up unconditional BigGAN.

Comments
  • upfirdn2d_plugin Problem

    upfirdn2d_plugin Problem

    Describe the bug Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!

    Please stop closing people's issues without a confirmed fix for this problem. #2 (comment) does not work and there is no confirmed fix on that issue that was closed without a confirmed fix.

    Please be serious about it and let's work together for a fix instead of ignoring the problem and referring people to a close topic that does not offer any solution to their problem.

    We tried everything proposed we also tried both Cuda 11.0 and 11.1, with different version of PyTorch just in case. We are a team of 5 people and we all had the same problem in both Windows and Linux machine and even in google Collab which tells me that this is more than just a configuration problem.

    and no %pip install ninja did not solve the problem in any of the machines we have in our lab. also, using verbosity = 'full' does not seem to include any additional helpful information.

    Desktop (please complete the following information):

    Those are the two machines I used

    Machine 1

    • ubuntu 20.04.1,
    • pytorch 1.7.1
    • CUDA 11.1,
    • RTX 3090

    Machine 2

    • Windows 10
    • pytorch 1.7.1
    • CUDA 11.1, also tried with - CUDA 11.0
    • CUDA toolkit version (e.g., CUDA 11.0)
    • NVIDIA driver version 461.40
    • RTX 3090
    opened by ghost 37
  • RuntimeError: CUDA error: no kernel image is available for execution on the device

    RuntimeError: CUDA error: no kernel image is available for execution on the device

    I'm trying to run the sample code but it raises an error. I'm running on RTX 3090 with cuda 11.1(as the description recommends) and cudnn8.0.5. The message is attached below. image

    I'm able to run pytorch with cuda. image Do you have any idea how to solve this problem? Thanks in advance!

    opened by xielongze 26
  • Vast.ai instance - **No module named 'upfirdn2d_plugin'**

    Vast.ai instance - **No module named 'upfirdn2d_plugin'**

    Stuck here big time with ImportError: No module named 'upfirdn2d_plugin'

    I am using a vast.ai instance nvidia/cuda:11.2.1-cudnn8-runtime-ubuntu18.04

    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |                               |                      |               MIG M. |
    |===============================+======================+======================|
    |   0  Tesla V100-PCIE...  Off  | 00000000:00:07.0 Off |                    0 |
    | N/A   30C    P0    35W / 250W |      0MiB / 16160MiB |      0%      Default |
    

    Conda environment is set with conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch --yes (doesn't matter if I try a newer one)

    What I've tried

    FIrst I made sure my VM has CUDA 11.2 installed. Then I've installed a newer torch with CUDA 11.1.1, which did not help and I've rolled back (made a new env).

    Removed torch_extensions Just as described here: https://github.com/NVlabs/stylegan2-ada-pytorch/issues/11?_pjax=%23js-repo-pjax-container

    Didn't help

    gcc I found this thread and https://github.com/NVlabs/stylegan2-ada-pytorch/issues/35

    And tried installing gcc7 conda install -c conda-forge/label/gcc7 gcc_linux-64 (didn't help)

    and even gcc5 conda install -c psi4 gcc-5 The latter sent me in a weird loop and I've abandoned this path.

    This does not help either https://github.com/NVlabs/stylegan2-ada-pytorch/issues/2#issuecomment-773275680

    Google Colab works fine and has ubuntu 18.04 with gcc 7.5.0 installed which I am trying to mimic. Hope that is the correct logic.

    UPD: Another instance with gcc 7.5.0 throws the same error as well

    gcc --version
    gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
    Copyright (C) 2017 Free Software Foundation, Inc.
    

    UPD2 Installing gcc 5 as described here: https://askubuntu.com/questions/1087150/install-gcc-5-on-ubuntu-18-04 Did not help either

    UPD3 Sorry for not including the traceback originally

    Traceback (most recent call last):
      File "/root/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 32, in _init
        _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
      File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
        torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
      File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
        keep_intermediates=keep_intermediates)
      File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
        return _import_module_from_library(name, build_directory, is_python_module)
      File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
        file, path, description = imp.find_module(module_name, [path])
      File "/usr/local/envs/stylegan/lib/python3.7/imp.py", line 296, in find_module
        raise ImportError(_ERR_MSG.format(name), name=name)
    ImportError: No module named 'upfirdn2d_plugin'
    
      warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
    Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
    /root/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:
    
    Traceback (most recent call last):
      File "/root/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 32, in _init
        _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
      File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
        torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
      File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
        keep_intermediates=keep_intermediates)
      File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
        return _import_module_from_library(name, build_directory, is_python_module)
      File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
        file, path, description = imp.find_module(module_name, [path])
      File "/usr/local/envs/stylegan/lib/python3.7/imp.py", line 296, in find_module
        raise ImportError(_ERR_MSG.format(name), name=name)
    ImportError: No module named 'upfirdn2d_plugin'
    
      warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
    

    Please advice on any possible next steps. No idea where to move next.

    Originally posted by @dokluch in https://github.com/NVlabs/stylegan2-ada-pytorch/issues/2#issuecomment-801715229

    opened by dokluch 18
  • UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown

    UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown

    I'm getting this error on a Google Colab. This started showing up all of a sudden in the last two days, I've only changed the data, code remained pretty much the same

    tick 0     kimg 0.0      time 2m 59s       sec/tick 7.9     sec/kimg 989.17  maintenance 170.8  cpumem 4.98   gpumem 10.59  augment 0.000
    Evaluating metrics...
    /usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
      return forward_call(*input, **kwargs)
    /usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown
      len(cache))
    

    Happened on Tesla T4 and P100, restarted the hosted runtime a few times, hew no much difference

    opened by wandrzej 11
  • Error at Tick 1 : Either Evaluating Metrics or the irreverant alert in pytorch kicks to windows problem reporting

    Error at Tick 1 : Either Evaluating Metrics or the irreverant alert in pytorch kicks to windows problem reporting

    Describe the bug Crashing at Tick 0

    To Reproduce (base) PS C:\Users\Dunwo> conda activate stylegantry (stylegantry) PS C:\Users\Dunwo> cd temp (stylegantry) PS C:\Users\Dunwo\temp> cd .\stylegan2-ada-pytorch\ (stylegantry) PS C:\Users\Dunwo\temp\stylegan2-ada-pytorch> python train.py --data C:\Ganoutput --outdir C:\GanResults

    Training options: { "num_gpus": 1, "image_snapshot_ticks": 50, "network_snapshot_ticks": 50, "metrics": [ "fid50k_full" ], "random_seed": 0, "training_set_kwargs": { "class_name": "training.dataset.ImageFolderDataset", "path": "C:\Ganoutput", "use_labels": false, "max_size": 13439, "xflip": false, "resolution": 512 }, "data_loader_kwargs": { "pin_memory": true, "num_workers": 3, "prefetch_factor": 2 }, "G_kwargs": { "class_name": "training.networks.Generator", "z_dim": 512, "w_dim": 512, "mapping_kwargs": { "num_layers": 2 }, "synthesis_kwargs": { "channel_base": 32768, "channel_max": 512, "num_fp16_res": 4, "conv_clamp": 256 } }, "D_kwargs": { "class_name": "training.networks.Discriminator", "block_kwargs": {}, "mapping_kwargs": {}, "epilogue_kwargs": { "mbstd_group_size": 4 }, "channel_base": 32768, "channel_max": 512, "num_fp16_res": 4, "conv_clamp": 256 }, "G_opt_kwargs": { "class_name": "torch.optim.Adam", "lr": 0.0025, "betas": [ 0, 0.99 ], "eps": 1e-08 }, "D_opt_kwargs": { "class_name": "torch.optim.Adam", "lr": 0.0025, "betas": [ 0, 0.99 ], "eps": 1e-08 }, "loss_kwargs": { "class_name": "training.loss.StyleGAN2Loss", "r1_gamma": 6.5536 }, "total_kimg": 25000, "batch_size": 8, "batch_gpu": 8, "ema_kimg": 2.5, "ema_rampup": 0.05, "ada_target": 0.6, "augment_kwargs": { "class_name": "training.augment.AugmentPipe", "xflip": 1, "rotate90": 1, "xint": 1, "scale": 1, "rotate": 1, "aniso": 1, "xfrac": 1, "brightness": 1, "contrast": 1, "lumaflip": 1, "hue": 1, "saturation": 1 }, "run_dir": "C:\GanResults\00014-Ganoutput-auto1" }

    Output directory: C:\GanResults\00014-Ganoutput-auto1 Training data: C:\Ganoutput Training duration: 25000 kimg Number of GPUs: 1 Number of images: 13439 Image resolution: 512 Conditional model: False Dataset x-flips: False

    Creating output directory... Launching processes... Loading training set...

    Num images: 13439 Image shape: [3, 512, 512] Label shape: [0]

    Constructing networks... Setting up PyTorch plugin "bias_act_plugin"... Done. Setting up PyTorch plugin "upfirdn2d_plugin"... Done.

    Generator Parameters Buffers Output shape Datatype


    mapping.fc0 262656 - [8, 512] float32 mapping.fc1 262656 - [8, 512] float32 mapping - 512 [8, 16, 512] float32 synthesis.b4.conv1 2622465 32 [8, 512, 4, 4] float32 synthesis.b4.torgb 264195 - [8, 3, 4, 4] float32 synthesis.b4:0 8192 16 [8, 512, 4, 4] float32 synthesis.b4:1 - - [8, 512, 4, 4] float32 synthesis.b8.conv0 2622465 80 [8, 512, 8, 8] float32 synthesis.b8.conv1 2622465 80 [8, 512, 8, 8] float32 synthesis.b8.torgb 264195 - [8, 3, 8, 8] float32 synthesis.b8:0 - 16 [8, 512, 8, 8] float32 synthesis.b8:1 - - [8, 512, 8, 8] float32 synthesis.b16.conv0 2622465 272 [8, 512, 16, 16] float32 synthesis.b16.conv1 2622465 272 [8, 512, 16, 16] float32 synthesis.b16.torgb 264195 - [8, 3, 16, 16] float32 synthesis.b16:0 - 16 [8, 512, 16, 16] float32 synthesis.b16:1 - - [8, 512, 16, 16] float32 synthesis.b32.conv0 2622465 1040 [8, 512, 32, 32] float32 synthesis.b32.conv1 2622465 1040 [8, 512, 32, 32] float32 synthesis.b32.torgb 264195 - [8, 3, 32, 32] float32 synthesis.b32:0 - 16 [8, 512, 32, 32] float32 synthesis.b32:1 - - [8, 512, 32, 32] float32 synthesis.b64.conv0 2622465 4112 [8, 512, 64, 64] float16 synthesis.b64.conv1 2622465 4112 [8, 512, 64, 64] float16 synthesis.b64.torgb 264195 - [8, 3, 64, 64] float16 synthesis.b64:0 - 16 [8, 512, 64, 64] float16 synthesis.b64:1 - - [8, 512, 64, 64] float32 synthesis.b128.conv0 1442561 16400 [8, 256, 128, 128] float16 synthesis.b128.conv1 721409 16400 [8, 256, 128, 128] float16 synthesis.b128.torgb 132099 - [8, 3, 128, 128] float16 synthesis.b128:0 - 16 [8, 256, 128, 128] float16 synthesis.b128:1 - - [8, 256, 128, 128] float32 synthesis.b256.conv0 426369 65552 [8, 128, 256, 256] float16 synthesis.b256.conv1 213249 65552 [8, 128, 256, 256] float16 synthesis.b256.torgb 66051 - [8, 3, 256, 256] float16 synthesis.b256:0 - 16 [8, 128, 256, 256] float16 synthesis.b256:1 - - [8, 128, 256, 256] float32 synthesis.b512.conv0 139457 262160 [8, 64, 512, 512] float16 synthesis.b512.conv1 69761 262160 [8, 64, 512, 512] float16 synthesis.b512.torgb 33027 - [8, 3, 512, 512] float16 synthesis.b512:0 - 16 [8, 64, 512, 512] float16 synthesis.b512:1 - - [8, 64, 512, 512] float32


    Total 28700647 699904 - -

    Discriminator Parameters Buffers Output shape Datatype


    b512.fromrgb 256 16 [8, 64, 512, 512] float16 b512.skip 8192 16 [8, 128, 256, 256] float16 b512.conv0 36928 16 [8, 64, 512, 512] float16 b512.conv1 73856 16 [8, 128, 256, 256] float16 b512 - 16 [8, 128, 256, 256] float16 b256.skip 32768 16 [8, 256, 128, 128] float16 b256.conv0 147584 16 [8, 128, 256, 256] float16 b256.conv1 295168 16 [8, 256, 128, 128] float16 b256 - 16 [8, 256, 128, 128] float16 b128.skip 131072 16 [8, 512, 64, 64] float16 b128.conv0 590080 16 [8, 256, 128, 128] float16 b128.conv1 1180160 16 [8, 512, 64, 64] float16 b128 - 16 [8, 512, 64, 64] float16 b64.skip 262144 16 [8, 512, 32, 32] float16 b64.conv0 2359808 16 [8, 512, 64, 64] float16 b64.conv1 2359808 16 [8, 512, 32, 32] float16 b64 - 16 [8, 512, 32, 32] float16 b32.skip 262144 16 [8, 512, 16, 16] float32 b32.conv0 2359808 16 [8, 512, 32, 32] float32 b32.conv1 2359808 16 [8, 512, 16, 16] float32 b32 - 16 [8, 512, 16, 16] float32 b16.skip 262144 16 [8, 512, 8, 8] float32 b16.conv0 2359808 16 [8, 512, 16, 16] float32 b16.conv1 2359808 16 [8, 512, 8, 8] float32 b16 - 16 [8, 512, 8, 8] float32 b8.skip 262144 16 [8, 512, 4, 4] float32 b8.conv0 2359808 16 [8, 512, 8, 8] float32 b8.conv1 2359808 16 [8, 512, 4, 4] float32 b8 - 16 [8, 512, 4, 4] float32 b4.mbstd - - [8, 513, 4, 4] float32 b4.conv 2364416 16 [8, 512, 4, 4] float32 b4.fc 4194816 - [8, 512] float32 b4.out 513 - [8, 1] float32


    Total 28982849 480 - -

    Setting up augmentation... Distributing across 1 GPUs... Setting up training phases... Exporting sample images... Initializing logs... Training for 25000 kimg...

    tick 0 kimg 0.0 time 51s sec/tick 6.2 sec/kimg 773.03 maintenance 44.9 cpumem 3.61 gpumem 14.76 augment 0.000 Evaluating metrics... C:\Users\Dunwo\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ..\c10/core/TensorImpl.h:1156.) return forward_call(*input, **kwargs) (stylegantry) PS C:\Users\Dunwo\temp\stylegan2-ada-pytorch>

    Please copy&paste text instead of screenshots for better searchability.

    Expected behavior At this stage im expecting gpu usage to ramp up and ticks 1 and more to follow. i dont think there should be any windows problem reporting

    Screenshots It generates the first tick and log ceyp538d2co71 but its hard to tell if its when it begins evaluating metrics or when the irreverant warning comes up Desktop (please complete the following information): niiqqtkn2co71 As soon as it gets here there will be a windows problem reporting in the task maanger. but there is no pop up or alert or anything and then there is nothing. no debugs, no errors its like its been aborted

    • OS: [ Windows 10]
    • PyTorch version (1.9.0)
    • CUDA toolkit version (e.g., CUDA 11.1)
    • NVIDIA driver version 471.96
    • GPU [ RTX 3090]
    • Docker: Did not use docker

    Additional context I'm new to this but willing to learn and not afraid to google my own problems and troubleshoot. the issue here is there is no debug or error alert at all. so i have nothing to go on

    opened by Passingbyposts 10
  • train.py fails when gpus=2 (or something other than gpus=1)

    train.py fails when gpus=2 (or something other than gpus=1)

    OS: CentOS Version 7 Python: 3.7.6 Pytorch Version: 1.7.1+cu110 GPU: 2 V100s Docker: No, have not gone that route yet Related Posted Issues: none that I could find based solely on GPU count

    I am running the github repo for stylegan2-ada-pytorch. Through the help of others with Pytorch versions, I was able to do successful training with gpus=1. So, gpus=1 is working.

    The system I am on has 2 V100s. When I set gpus=2 on "python train.py ...." I receive the following errors: (Traceback truncated and file references anonymized.)

    Distributing across 2 GPUs... Setting up training phases... Exporting sample images... Initializing logs... Truncated Traceback (most recent call last): torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus) File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/…python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

    -- Process 1 terminated with the following error: Truncated Traceback (most recent call last): File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File …./notebooks/stylegan2-ada-pytorch/train.py", line 422, in subprocess_fn training_loop.training_loop(rank=rank, **args) File "…/notebooks/stylegan2-ada-pytorch/training/training_loop.py", line 290, in training_loop loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain) File "…/notebooks/stylegan2-ada-pytorch/training/loss.py", line 134, in accumulate_gradients training_stats.report('Loss/D/loss', loss_Dgen + loss_Dreal) RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 0

    opened by metaphorz 9
  • Stuck on Evaluating Metrics

    Stuck on Evaluating Metrics

    After downloading afhq dataset and after creating zip file with:

    python dataset_tool.py --source=downloads/afhq/afhq/train/cat --dest=datasets/cat.zip
    

    I start the train:

    python train.py --outdir=training-runs --data=datasets/cat.zip --gpus=1
    

    and execution stops at Evaluate metrics:

    Discriminator  Parameters  Buffers  Output shape        Datatype
    ---            ---         ---      ---                 ---     
    b512.fromrgb   256         16       [8, 64, 512, 512]   float16 
    b512.skip      8192        16       [8, 128, 256, 256]  float16 
    b512.conv0     36928       16       [8, 64, 512, 512]   float16 
    b512.conv1     73856       16       [8, 128, 256, 256]  float16 
    b512           -           16       [8, 128, 256, 256]  float16 
    b256.skip      32768       16       [8, 256, 128, 128]  float16 
    b256.conv0     147584      16       [8, 128, 256, 256]  float16 
    b256.conv1     295168      16       [8, 256, 128, 128]  float16 
    b256           -           16       [8, 256, 128, 128]  float16 
    b128.skip      131072      16       [8, 512, 64, 64]    float16 
    b128.conv0     590080      16       [8, 256, 128, 128]  float16 
    b128.conv1     1180160     16       [8, 512, 64, 64]    float16 
    b128           -           16       [8, 512, 64, 64]    float16 
    b64.skip       262144      16       [8, 512, 32, 32]    float16 
    b64.conv0      2359808     16       [8, 512, 64, 64]    float16 
    b64.conv1      2359808     16       [8, 512, 32, 32]    float16 
    b64            -           16       [8, 512, 32, 32]    float16 
    b32.skip       262144      16       [8, 512, 16, 16]    float32 
    b32.conv0      2359808     16       [8, 512, 32, 32]    float32 
    b32.conv1      2359808     16       [8, 512, 16, 16]    float32 
    b32            -           16       [8, 512, 16, 16]    float32 
    b16.skip       262144      16       [8, 512, 8, 8]      float32 
    b16.conv0      2359808     16       [8, 512, 16, 16]    float32 
    b16.conv1      2359808     16       [8, 512, 8, 8]      float32 
    b16            -           16       [8, 512, 8, 8]      float32 
    b8.skip        262144      16       [8, 512, 4, 4]      float32 
    b8.conv0       2359808     16       [8, 512, 8, 8]      float32 
    b8.conv1       2359808     16       [8, 512, 4, 4]      float32 
    b8             -           16       [8, 512, 4, 4]      float32 
    b4.mbstd       -           -        [8, 513, 4, 4]      float32 
    b4.conv        2364416     16       [8, 512, 4, 4]      float32 
    b4.fc          4194816     -        [8, 512]            float32 
    b4.out         513         -        [8, 1]              float32 
    ---            ---         ---      ---                 ---     
    Total          28982849    480      -                   -       
    
    Setting up augmentation...
    Distributing across 1 GPUs...
    Setting up training phases...
    Exporting sample images...
    Initializing logs...
    Skipping tfevents export: No module named 'tensorboard'
    Training for 25000 kimg...
    
    tick 0     kimg 0.0      time 50s          sec/tick 11.2    sec/kimg 1397.00 maintenance 39.2   cpumem 4.61   gpumem 10.32  augment 0.000
    Evaluating metrics...
    
    • OS: Ubuntu 18.04
    • PyTorch version 1.8.1
    • CUDA toolkit version 11.1
    • NVIDIA Driver Version: 460.80
    • GPU nvidia T4
    • Docker: did you use Docker? no

    What might be the reason for such behavior ?

    opened by Adblu 9
  • Error building extension 'upfirdn2d_plugin' and 'bias_act_plugin'

    Error building extension 'upfirdn2d_plugin' and 'bias_act_plugin'

    I have a similar bug to this issue #https://github.com/NVlabs/stylegan2-ada-pytorch/issues/39

    However I think it's a bit different. I get similar errors for both upfirdn2d_plugin and bias_act_plugin

    Here's the stack Traceback (most recent call last): File "train.py", line 538, in main() # pylint: disable=no-value-for-parameter File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\click\core.py", line 829, in call return self.main(*args, **kwargs) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\click\core.py", line 782, in main rv = self.invoke(ctx) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\click\core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\click\core.py", line 610, in invoke return callback(*args, **kwargs) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\click\decorators.py", line 21, in new_func return f(get_current_context(), *args, **kwargs) File "train.py", line 531, in main subprocess_fn(rank=0, args=args, temp_dir=temp_dir) File "train.py", line 383, in subprocess_fn training_loop.training_loop(rank=rank, **args) File "Y:\projects\stylegan2ada\training\training_loop.py", line 166, in training_loop img = misc.print_module_summary(G, [z, c]) File "Y:\projects\stylegan2ada\torch_utils\misc.py", line 212, in print_module_summary outputs = module(*inputs) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "Y:\projects\stylegan2ada\training\networks.py", line 499, in forward img = self.synthesis(ws, **synthesis_kwargs) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "Y:\projects\stylegan2ada\training\networks.py", line 471, in forward x, img = block(x, img, cur_ws, **block_kwargs) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "Y:\projects\stylegan2ada\training\networks.py", line 405, in forward x = self.conv0(x, next(w_iter), fused_modconv=fused_modconv, **layer_kwargs) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "Y:\projects\stylegan2ada\training\networks.py", line 300, in forward padding=self.padding, resample_filter=self.resample_filter, flip_weight=flip_weight, fused_modconv=fused_modconv) File "Y:\projects\stylegan2ada\torch_utils\misc.py", line 101, in decorator return fn(*args, **kwargs) File "Y:\projects\stylegan2ada\training\networks.py", line 65, in modulated_conv2d x = conv2d_resample.conv2d_resample(x=x, w=weight.to(x.dtype), f=resample_filter, up=up, down=down, padding=padding, flip_weight=flip_weight) File "Y:\projects\stylegan2ada\torch_utils\misc.py", line 101, in decorator return fn(*args, kwargs) File "Y:\projects\stylegan2ada\torch_utils\ops\conv2d_resample.py", line 139, in conv2d_resample x = upfirdn2d.upfirdn2d(x=x, f=f, padding=[px0+pxt,px1+pxt,py0+pyt,py1+pyt], gain=up2, flip_filter=flip_filter) File "Y:\projects\stylegan2ada\torch_utils\ops\upfirdn2d.py", line 160, in upfirdn2d if impl == 'cuda' and x.device.type == 'cuda' and _init(): File "Y:\projects\stylegan2ada\torch_utils\ops\upfirdn2d.py", line 31, in _init _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math']) File "Y:\projects\stylegan2ada\torch_utils\custom_ops.py", line 110, in get_plugin torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\utils\cpp_extension.py", line 1091, in load keep_intermediates=keep_intermediates) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\utils\cpp_extension.py", line 1302, in _jit_compile is_standalone=is_standalone) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\utils\cpp_extension.py", line 1407, in _write_ninja_file_and_build_library error_prefix=f"Error building extension '{name}'") File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\utils\cpp_extension.py", line 1683, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'upfirdn2d_plugin': ninja: error: build.ninja:3: lexing error

    It's saying something about a lexing error when ninja is trying to build

    My nvcc --version returns

    nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Tue_Sep_15_19:12:04_Pacific_Daylight_Time_2020 Cuda compilation tools, release 11.1, V11.1.74 Build cuda_11.1.relgpu_drvr455TC455_06.29069683_0

    opened by KhoaVo 7
  • RuntimeError: AssertionError:

    RuntimeError: AssertionError:

    Hi. I'm trying to run the sample code but it raises an error.

    tick 0     kimg 0.0      time 1m 02s       sec/tick 15.7    sec/kimg 3923.85 maintenance 46.2   cpumem 3.91   gpumem 37.23  augment 0.000
    Evaluating metrics...
    Traceback (most recent call last):
      File "train.py", line 530, in <module>
        main() # pylint: disable=no-value-for-parameter
      File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
        return self.main(*args, **kwargs)
      File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
        rv = self.invoke(ctx)
      File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
        return callback(*args, **kwargs)
      File "/opt/conda/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
        return f(get_current_context(), *args, **kwargs)
      File "train.py", line 523, in main
        subprocess_fn(rank=0, args=args, temp_dir=temp_dir)
      File "train.py", line 376, in subprocess_fn
        training_loop.training_loop(rank=rank, **args)
      File "/workspace/training/training_loop.py", line 371, in training_loop
        result_dict = metric_main.calc_metric(metric=metric, G=snapshot_data['G_ema'],
      File "/workspace/metrics/metric_main.py", line 45, in calc_metric
        results = _metric_dict[metric](opts)
      File "/workspace/metrics/metric_main.py", line 85, in fid50k_full
        fid = frechet_inception_distance.compute_fid(opts, max_real=None, num_gen=50000)
      File "/workspace/metrics/frechet_inception_distance.py", line 25, in compute_fid
        mu_real, sigma_real = metric_utils.compute_feature_stats_for_dataset(
      File "/workspace/metrics/metric_utils.py", line 216, in compute_feature_stats_for_dataset
        features = detector(images.to(opts.device), **detector_kwargs)
      File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
        result = self.forward(*input, **kwargs)
    torch.jit.Error: The following operation failed in the TorchScript interpreter.
    Traceback of TorchScript, serialized code (most recent call last):
      File "code/__torch__.py", line 20, in forward
          pass
        else:
          ops.prim.RaiseException("AssertionError: ")
          ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
        if use_fp16:
          _4 = 5
    
    Traceback of TorchScript, original code (most recent call last):
      File "c:\p4research\research\tkarras\dnn\gan3support\feature_detectors\inception.py", line 197, in forward
        def forward(self, img, return_features: bool = False, use_fp16: bool = False, no_output_bias: bool = False):
            batch_size, channels, height, width = img.shape # [NCHW]
            assert channels == 3
            ~~~~~~~~~~~~~~~~~~~~ <--- HERE
    
            # Cast to float.
    RuntimeError: AssertionError:
    

    Do you have any idea how to solve this problem? Thanks in advance

    opened by mulkong 7
  • raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV

    raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV

    Describe the bug

    Evaluating metrics...
    /mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
      return forward_call(*input, **kwargs)
    /mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
      return forward_call(*input, **kwargs)
    /mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
      return forward_call(*input, **kwargs)
    /mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
      return forward_call(*input, **kwargs)
    Traceback (most recent call last):
      File "train.py", line 538, in <module>
        main() # pylint: disable=no-value-for-parameter
      File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
        return self.main(*args, **kwargs)
      File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/click/core.py", line 1062, in main
        rv = self.invoke(ctx)
      File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/click/core.py", line 763, in invoke
        return __callback(*args, **kwargs)
      File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
        return f(get_current_context(), *args, **kwargs)
      File "train.py", line 533, in main
        torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
      File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
        return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
      File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
        while not context.join():
      File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join
        raise ProcessExitedException(
    torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV
    
    /mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 68 leaked semaphore objects to clean up at shutdown
      warnings.warn('resource_tracker: There appear to be %d '
    
    

    To Reproduce python train.py --outdir=training_runs --data=anime_trainB.zip --gpus=4

    Server

    • OS: Linux Ubuntu 18.04.1
    • PyTorch version pytorch 1.9.0
    • CUDA toolkit version CUDA 11.4
    • NVIDIA driver version 470.57.02
    • GPU four RTX 3090

    Additional context

    opened by zhanjiahui 5
  • Gradient Accumulation Control

    Gradient Accumulation Control

    I've noticed that the control of gradient accumulation is a bit challenging - but perhaps I'm not familiar enough with the code. Is there a bit of guidance on how to adjust the code to increase the amount of accumulation before the weight updates?

    In particular, when running on a card with lower memory at 256x256 it takes me about 1 minute for 1kimg, and takes about 4 minutes per 1kimg when I go to 512x512 (4x makes sense to me due to the scaling of the resolution). However, because my batch size falls to 1 to accommodate my RAM requirements, I get GAN collapse. To avoid this, I've successfully reduced the learning rate to about 1/4 of the default, which seems to fix GAN collapse. Problem is that while I am still processing 1kimg every 4 minutes, I only achieve 1/4 of the weight update, so effectively I'm finding my training time to achieve similar results at the lower resolution has increased by a factor of 16, when it should only have increased by ~4x due to the bigger resolution.

    I would expect if I could just increase the gradient accumulation by 4x, I could keep the higher learning rate and avoid GAN collapse at the same time. But I'm having a bit of trouble mucking around with this, because the use of batch_gpu and num_gpus in the training_loop.py seems to get overwritten by train.py args, and creates a few other issues when I adjust the code.

    Much appreciated!

    opened by paradox715 5
  • padding

    padding

    in the fallback code of upfirdn2d.py there is :

    padding = [padx, padx, pady, pady]
    padx0, padx1, pady0, pady1 = padding
    

    in another repo (https://github.com/rosinality/stylegan2-pytorch/blob/master/op/upfirdn2d.py) there is :

    if len(pad) == 2:
            pad = (pad[0], pad[1], pad[0], pad[1])
    

    seems the order is not the same (and i had a dimension error when upsampling here). Maybe a mistake?

    opened by aRavanel 1
  • AssertionError : list(image.shape) == self.image_shape

    AssertionError : list(image.shape) == self.image_shape

    I intended to train network. I prepared training data with png format and pass the folder path. first time, it went well, but after I added new images, program says "AssertionError" in training/dataset.py, line88 in __getitem__ : assert list(image.shape) == self.image_shape But the shape of images I passed are all the same (128,128), I checked and the first data is the same size. Anyone helps me ?

    opened by ku60 0
  • Multi-Label support?

    Multi-Label support?

    Does this implementation support conditioning with multiple labels? Or what does c_dim stand for?

    Kind regards!

    Edit: I was also wondering if training with --aug noaug would correspond to vanilla StyleGAN2. If no, is it possible to do that with other train options?

    opened by lebeli 0
  • SyntaxError: invalid character in identifier](url)

    SyntaxError: invalid character in identifier](url)

    [Traceback (most recent call last):
      File "train.py", line 20, in <module>
        from training import training_loop
      File "/content/FcF-Inpainting/training/training_loop.py", line 1
        import os
              ^
    
    SyntaxError: invalid character in identifier](url)
    opened by jcrbsa 0
  • TypeError: run_G() missing 1 required positional argument: 'c'`

    TypeError: run_G() missing 1 required positional argument: 'c'`

    After run the following command:

    python3 train.py \
        --outdir=$OUTPUT_PATH \
        --img_data=$TRAIN_PATH \
        --gpus 1 \
        --gamma 10 \
        --aug 'noaug' \
        --metrics True \
        --eval_img_data $VAL_PATH \
        --batch 32 
    

    Show the following error:

    Traceback (most recent call last):
      File "train.py", line 523, in <module>
        main() # pylint: disable=no-value-for-parameter
      File "/content/drive/MyDrive/env/FcF-Inpainting/virtualenv/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
        return self.main(*args, **kwargs)
      File "/content/drive/MyDrive/env/FcF-Inpainting/virtualenv/lib/python3.7/site-packages/click/core.py", line 1055, in main
        rv = self.invoke(ctx)
      File "/content/drive/MyDrive/env/FcF-Inpainting/virtualenv/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/content/drive/MyDrive/env/FcF-Inpainting/virtualenv/lib/python3.7/site-packages/click/core.py", line 760, in invoke
        return __callback(*args, **kwargs)
      File "/content/drive/MyDrive/env/FcF-Inpainting/virtualenv/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
        return f(get_current_context(), *args, **kwargs)
      File "train.py", line 516, in main
        subprocess_fn(rank=0, args=args, temp_dir=temp_dir)
      File "train.py", line 391, in subprocess_fn
        training_loop.training_loop(rank=rank, **args)
      File "/content/drive/MyDrive/env/FcF-Inpainting/training/training_loop.py", line 327, in training_loop
        loss.accumulate_gradients(phase=phase.name, erased_img=erased_img, real_img=real_img, mask=mask, real_c=real_c, gen_c=gen_c, sync=sync, gain=gain)
      File "/content/drive/MyDrive/env/FcF-Inpainting/training/losses/loss.py", line 65, in accumulate_gradients
        gen_img, _ = self.run_G(g_inputs, gen_c, sync=sync) # May get synced by Gpl.
    TypeError: run_G() missing 1 required positional argument: 'c'`
    
    opened by jcrbsa 0
Owner
NVIDIA Research Projects
NVIDIA Research Projects
StyleGAN2 with adaptive discriminator augmentation (ADA) - Official TensorFlow implementation

StyleGAN2 with adaptive discriminator augmentation (ADA) — Official TensorFlow implementation Training Generative Adversarial Networks with Limited Da

NVIDIA Research Projects 1.7k Dec 29, 2022
StyleGAN2-ada for practice

This version of the newest PyTorch-based StyleGAN2-ada is intended mostly for fellow artists, who rarely look at scientific metrics, but rather need a working creative tool. Tested on Python 3.7 + PyTorch 1.7.1, requires FFMPEG for sequence-to-video conversions. For more explicit details refer to the original implementations.

vadim epstein 170 Nov 16, 2022
A colab notebook for training Stylegan2-ada on colab, transfer learning onto your own dataset.

Stylegan2-Ada-Google-Colab-Starter-Notebook A no thrills colab notebook for training Stylegan2-ada on colab. transfer learning onto your own dataset h

Harnick Khera 66 Dec 16, 2022
Cartoon-StyleGan2 🙃 : Fine-tuning StyleGAN2 for Cartoon Face Generation

Fine-tuning StyleGAN2 for Cartoon Face Generation

Jihye Back 520 Jan 4, 2023
Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

Face Identity Disentanglement via Latent Space Mapping - Implement in pytorch with StyleGAN 2 Description Pytorch implementation of the paper Face Ide

Daniel Roich 58 Dec 24, 2022
StyleGAN2 - Official TensorFlow Implementation

StyleGAN2 - Official TensorFlow Implementation

NVIDIA Research Projects 10.1k Dec 28, 2022
Navigating StyleGAN2 w latent space using CLIP

Navigating StyleGAN2 w latent space using CLIP an attempt to build sth with the official SG2-ADA Pytorch impl kinda inspired by Generating Images from

Mike K. 55 Dec 6, 2022
StyleGAN2 Webtoon / Anime Style Toonify

StyleGAN2 Webtoon / Anime Style Toonify Korea Webtoon or Japanese Anime Character Stylegan2 base high Quality 1024x1024 / 512x512 Generate and Transfe

null 121 Dec 21, 2022
Pretrained models for Jax/Flax: StyleGAN2, GPT2, VGG, ResNet.

Pretrained models for Jax/Flax: StyleGAN2, GPT2, VGG, ResNet.

Matthias Wright 169 Dec 26, 2022
Fine-tuning StyleGAN2 for Cartoon Face Generation

Cartoon-StyleGAN ?? : Fine-tuning StyleGAN2 for Cartoon Face Generation Abstract Recent studies have shown remarkable success in the unsupervised imag

Jihye Back 520 Jan 4, 2023
A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

Awesome Pretrained StyleGAN2 A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution. Note the readme is a

Justin 1.1k Dec 24, 2022
A web porting for NVlabs' StyleGAN2, to facilitate exploring all kinds characteristic of StyleGAN networks

This project is a web porting for NVlabs' StyleGAN2, to facilitate exploring all kinds characteristic of StyleGAN networks. Thanks for NVlabs' excelle

K.L. 150 Dec 15, 2022
ALBERT-pytorch-implementation - ALBERT pytorch implementation

ALBERT-pytorch-implementation developing... 모델의 개념이해를 돕기 위한 구현물로 현재 변수명을 상세히 적었고

BG Kim 3 Oct 6, 2022
Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

null 49 Nov 23, 2022
Official PyTorch implementation of Joint Object Detection and Multi-Object Tracking with Graph Neural Networks

This is the official PyTorch implementation of our paper: "Joint Object Detection and Multi-Object Tracking with Graph Neural Networks". Our project website and video demos are here.

Richard Wang 443 Dec 6, 2022
Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

HiSD: Image-to-image Translation via Hierarchical Style Disentanglement Official pytorch implementation of paper "Image-to-image Translation

null 364 Dec 14, 2022
Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

IC-Conv This repository is an official implementation of the paper Inception Convolution with Efficient Dilation Search. Getting Started Download Imag

Jie Liu 111 Dec 31, 2022
Official PyTorch Implementation of Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity

UnRigidFlow This is the official PyTorch implementation of UnRigidFlow (IJCAI2019). Here are two sample results (~10MB gif for each) of our unsupervis

Liang Liu 28 Nov 16, 2022
Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.

LLA: Loss-aware Label Assignment for Dense Pedestrian Detection This project provides an implementation for "LLA: Loss-aware Label Assignment for Dens

null 35 Dec 6, 2022