StyleGAN2-ADA - Official PyTorch implementation

NVIDIA Research Projects

Last update: Dec 30, 2022

Related tags

Deep Learning stylegan2-ada-pytorch

Overview

StyleGAN2-ADA — Official PyTorch implementation

Training Generative Adversarial Networks with Limited Data
Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, Timo Aila
https://arxiv.org/abs/2006.06676

Abstract: Training generative adversarial networks (GAN) using too little data typically leads to discriminator overfitting, causing training to diverge. We propose an adaptive discriminator augmentation mechanism that significantly stabilizes training in limited data regimes. The approach does not require changes to loss functions or network architectures, and is applicable both when training from scratch and when fine-tuning an existing GAN on another dataset. We demonstrate, on several datasets, that good results are now possible using only a few thousand training images, often matching StyleGAN2 results with an order of magnitude fewer images. We expect this to open up new application domains for GANs. We also find that the widely used CIFAR-10 is, in fact, a limited data benchmark, and improve the record FID from 5.59 to 2.42.

For business inquiries, please contact [email protected]
For press and other inquiries, please contact Hector Marinez at [email protected]

Release notes

This repository is a faithful reimplementation of StyleGAN2-ADA in PyTorch, focusing on correctness, performance, and compatibility.

Correctness

Full support for all primary training configurations.
Extensive verification of image quality, training curves, and quality metrics against the TensorFlow version.
Results are expected to match in all cases, excluding the effects of pseudo-random numbers and floating-point arithmetic.

Performance

Training is typically 5%–30% faster compared to the TensorFlow version on NVIDIA Tesla V100 GPUs.
Inference is up to 35% faster in high resolutions, but it may be slightly slower in low resolutions.
GPU memory usage is comparable to the TensorFlow version.
Faster startup time when training new networks (<50s), and also when using pre-trained networks (<4s).
New command line options for tweaking the training performance.

Compatibility

Compatible with old network pickles created using the TensorFlow version.
New ZIP/PNG based dataset format for maximal interoperability with existing 3rd party tools.
TFRecords datasets are no longer supported — they need to be converted to the new format.
New JSON-based format for logs, metrics, and training curves.
Training curves are also exported in the old TFEvents format if TensorBoard is installed.
Command line syntax is mostly unchanged, with a few exceptions (e.g., dataset_tool.py).
Comparison methods are not supported (--cmethod, --dcap, --cfg=cifarbaseline, --aug=adarv)
Truncation is now disabled by default.

Data repository

Path	Description
stylegan2-ada-pytorch	Main directory hosted on Amazon S3
├ ada-paper.pdf	Paper PDF
├ images	Curated example images produced using the pre-trained models
├ videos	Curated example interpolation videos
└ pretrained	Pre-trained models
├ ffhq.pkl	FFHQ at 1024x1024, trained using original StyleGAN2
├ metfaces.pkl	MetFaces at 1024x1024, transfer learning from FFHQ using ADA
├ afhqcat.pkl	AFHQ Cat at 512x512, trained from scratch using ADA
├ afhqdog.pkl	AFHQ Dog at 512x512, trained from scratch using ADA
├ afhqwild.pkl	AFHQ Wild at 512x512, trained from scratch using ADA
├ cifar10.pkl	Class-conditional CIFAR-10 at 32x32
├ brecahad.pkl	BreCaHAD at 512x512, trained from scratch using ADA
├ paper-fig7c-training-set-sweeps	Models used in Fig.7c (sweep over training set size)
├ paper-fig11a-small-datasets	Models used in Fig.11a (small datasets & transfer learning)
├ paper-fig11b-cifar10	Models used in Fig.11b (CIFAR-10)
├ transfer-learning-source-nets	Models used as starting point for transfer learning
└ metrics	Feature detectors used by the quality metrics

Requirements

Linux and Windows are supported, but we recommend Linux for performance and compatibility reasons.
1–8 high-end NVIDIA GPUs with at least 12 GB of memory. We have done all testing and development using NVIDIA DGX-1 with 8 Tesla V100 GPUs.
64-bit Python 3.7, PyTorch 1.7.1, and CUDA toolkit 11.0 or newer. Use CUDA toolkit 11.1 or later with RTX 3090. See https://pytorch.org/ for PyTorch install instructions.
Python libraries: pip install click requests tqdm pyspng ninja imageio-ffmpeg==0.4.3. We use the Anaconda3 2020.11 distribution which installs most of these by default.
Docker users: use the provided Dockerfile to build an image with the required library dependencies.

The code relies heavily on custom PyTorch extensions that are compiled on the fly using NVCC. On Windows, the compilation requires Microsoft Visual Studio. We recommend installing Visual Studio Community Edition and adding it into PATH using "C:\Program Files (x86)\Microsoft Visual Studio\\Community\VC\Auxiliary\Build\vcvars64.bat".

Getting started

Pre-trained networks are stored as *.pkl files that can be referenced using local filenames or URLs:

# Generate curated MetFaces images without truncation (Fig.10 left)
python generate.py --outdir=out --trunc=1 --seeds=85,265,297,849 \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metfaces.pkl

# Generate uncurated MetFaces images with truncation (Fig.12 upper left)
python generate.py --outdir=out --trunc=0.7 --seeds=600-605 \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metfaces.pkl

# Generate class conditional CIFAR-10 images (Fig.17 left, Car)
python generate.py --outdir=out --seeds=0-35 --class=1 \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/cifar10.pkl

# Style mixing example
python style_mixing.py --outdir=out --rows=85,100,75,458,1500 --cols=55,821,1789,293 \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metfaces.pkl

Outputs from the above commands are placed under out/*.png, controlled by --outdir. Downloaded network pickles are cached under $HOME/.cache/dnnlib, which can be overridden by setting the DNNLIB_CACHE_DIR environment variable. The default PyTorch extension build directory is $HOME/.cache/torch_extensions, which can be overridden by setting TORCH_EXTENSIONS_DIR.

Docker: You can run the above curated image example using Docker as follows:

docker build --tag sg2ada:latest .
./docker_run.sh python3 generate.py --outdir=out --trunc=1 --seeds=85,265,297,849 \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/metfaces.pkl

Note: The Docker image requires NVIDIA driver release r455.23 or later.

Legacy networks: The above commands can load most of the network pickles created using the previous TensorFlow versions of StyleGAN2 and StyleGAN2-ADA. However, for future compatibility, we recommend converting such legacy pickles into the new format used by the PyTorch version:

python legacy.py \
    --source=https://nvlabs-fi-cdn.nvidia.com/stylegan2/networks/stylegan2-cat-config-f.pkl \
    --dest=stylegan2-cat-config-f.pkl

Projecting images to latent space

To find the matching latent vector for a given image file, run:

python projector.py --outdir=out --target=~/mytargetimg.png \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/ffhq.pkl

For optimal results, the target image should be cropped and aligned similar to the FFHQ dataset. The above command saves the projection target out/target.png, result out/proj.png, latent vector out/projected_w.npz, and progression video out/proj.mp4. You can render the resulting latent vector by specifying --projected_w for generate.py:

python generate.py --outdir=out --projected_w=out/projected_w.npz \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/ffhq.pkl

Using networks from Python

You can use pre-trained networks in your own Python code as follows:

with open('ffhq.pkl', 'rb') as f:
    G = pickle.load(f)['G_ema'].cuda()  # torch.nn.Module
z = torch.randn([1, G.z_dim]).cuda()    # latent codes
c = None                                # class labels (not used in this example)
img = G(z, c)                           # NCHW, float32, dynamic range [-1, +1]

The above code requires torch_utils and dnnlib to be accessible via PYTHONPATH. It does not need source code for the networks themselves — their class definitions are loaded from the pickle via torch_utils.persistence.

The pickle contains three networks. 'G' and 'D' are instantaneous snapshots taken during training, and 'G_ema' represents a moving average of the generator weights over several training steps. The networks are regular instances of torch.nn.Module, with all of their parameters and buffers placed on the CPU at import and gradient computation disabled by default.

The generator consists of two submodules, G.mapping and G.synthesis, that can be executed separately. They also support various additional options:

w = G.mapping(z, c, truncation_psi=0.5, truncation_cutoff=8)
img = G.synthesis(w, noise_mode='const', force_fp32=True)

Please refer to generate.py, style_mixing.py, and projector.py for further examples.

Preparing datasets

Datasets are stored as uncompressed ZIP archives containing uncompressed PNG files and a metadata file dataset.json for labels.

Custom datasets can be created from a folder containing images; see python dataset_tool.py --help for more information. Alternatively, the folder can also be used directly as a dataset, without running it through dataset_tool.py first, but doing so may lead to suboptimal performance.

Legacy TFRecords datasets are not supported — see below for instructions on how to convert them.

FFHQ:

Step 1: Download the Flickr-Faces-HQ dataset as TFRecords.

Step 2: Extract images from TFRecords using dataset_tool.py from the TensorFlow version of StyleGAN2-ADA:

# Using dataset_tool.py from TensorFlow version at
# https://github.com/NVlabs/stylegan2-ada/
python ../stylegan2-ada/dataset_tool.py unpack \
    --tfrecord_dir=~/ffhq-dataset/tfrecords/ffhq --output_dir=/tmp/ffhq-unpacked

Step 3: Create ZIP archive using dataset_tool.py from this repository:

# Original 1024x1024 resolution.
python dataset_tool.py --source=/tmp/ffhq-unpacked --dest=~/datasets/ffhq.zip

# Scaled down 256x256 resolution.
python dataset_tool.py --source=/tmp/ffhq-unpacked --dest=~/datasets/ffhq256x256.zip \
    --width=256 --height=256

MetFaces: Download the MetFaces dataset and create ZIP archive:

python dataset_tool.py --source=~/downloads/metfaces/images --dest=~/datasets/metfaces.zip

AFHQ: Download the AFHQ dataset and create ZIP archive:

python dataset_tool.py --source=~/downloads/afhq/train/cat --dest=~/datasets/afhqcat.zip
python dataset_tool.py --source=~/downloads/afhq/train/dog --dest=~/datasets/afhqdog.zip
python dataset_tool.py --source=~/downloads/afhq/train/wild --dest=~/datasets/afhqwild.zip

CIFAR-10: Download the CIFAR-10 python version and convert to ZIP archive:

python dataset_tool.py --source=~/downloads/cifar-10-python.tar.gz --dest=~/datasets/cifar10.zip

LSUN: Download the desired categories from the LSUN project page and convert to ZIP archive:

python dataset_tool.py --source=~/downloads/lsun/raw/cat_lmdb --dest=~/datasets/lsuncat200k.zip \
    --transform=center-crop --width=256 --height=256 --max_images=200000

python dataset_tool.py --source=~/downloads/lsun/raw/car_lmdb --dest=~/datasets/lsuncar200k.zip \
    --transform=center-crop-wide --width=512 --height=384 --max_images=200000

BreCaHAD:

Step 1: Download the BreCaHAD dataset.

Step 2: Extract 512x512 resolution crops using dataset_tool.py from the TensorFlow version of StyleGAN2-ADA:

# Using dataset_tool.py from TensorFlow version at
# https://github.com/NVlabs/stylegan2-ada/
python dataset_tool.py extract_brecahad_crops --cropsize=512 \
    --output_dir=/tmp/brecahad-crops --brecahad_dir=~/downloads/brecahad/images

Step 3: Create ZIP archive using dataset_tool.py from this repository:

python dataset_tool.py --source=/tmp/brecahad-crops --dest=~/datasets/brecahad.zip

Training new networks

In its most basic form, training new networks boils down to:

python train.py --outdir=~/training-runs --data=~/mydataset.zip --gpus=1 --dry-run
python train.py --outdir=~/training-runs --data=~/mydataset.zip --gpus=1

The first command is optional; it validates the arguments, prints out the training configuration, and exits. The second command kicks off the actual training.

In this example, the results are saved to a newly created directory ~/training-runs/-mydataset-auto1, controlled by --outdir. The training exports network pickles (network-snapshot-.pkl) and example images (fakes.png) at regular intervals (controlled by --snap). For each pickle, it also evaluates FID (controlled by --metrics) and logs the resulting scores in metric-fid50k_full.jsonl (as well as TFEvents if TensorBoard is installed).

The name of the output directory reflects the training configuration. For example, 00000-mydataset-auto1 indicates that the base configuration was auto1, meaning that the hyperparameters were selected automatically for training on one GPU. The base configuration is controlled by --cfg:

Base config	Description
`auto` (default)	Automatically select reasonable defaults based on resolution and GPU count. Serves as a good starting point for new datasets but does not necessarily lead to optimal results.
`stylegan2`	Reproduce results for StyleGAN2 config F at 1024x1024 using 1, 2, 4, or 8 GPUs.
`paper256`	Reproduce results for FFHQ and LSUN Cat at 256x256 using 1, 2, 4, or 8 GPUs.
`paper512`	Reproduce results for BreCaHAD and AFHQ at 512x512 using 1, 2, 4, or 8 GPUs.
`paper1024`	Reproduce results for MetFaces at 1024x1024 using 1, 2, 4, or 8 GPUs.
`cifar`	Reproduce results for CIFAR-10 (tuned configuration) using 1 or 2 GPUs.

The training configuration can be further customized with additional command line options:

--aug=noaug disables ADA.
--cond=1 enables class-conditional training (requires a dataset with labels).
--mirror=1 amplifies the dataset with x-flips. Often beneficial, even with ADA.
--resume=ffhq1024 --snap=10 performs transfer learning from FFHQ trained at 1024x1024.
--resume=~/training-runs//network-snapshot-.pkl resumes a previous training run.
--gamma=10 overrides R1 gamma. We recommend trying a couple of different values for each new dataset.
--aug=ada --target=0.7 adjusts ADA target value (default: 0.6).
--augpipe=blit enables pixel blitting but disables all other augmentations.
--augpipe=bgcfnc enables all available augmentations (blit, geom, color, filter, noise, cutout).

Please refer to python train.py --help for the full list.

Expected training time

The total training time depends heavily on resolution, number of GPUs, dataset, desired quality, and hyperparameters. The following table lists expected wallclock times to reach different points in the training, measured in thousands of real images shown to the discriminator ("kimg"):

Resolution	GPUs	1000 kimg	25000 kimg	sec/kimg	GPU mem	CPU mem
128x128	1	4h 05m	4d 06h	12.8–13.7	7.2 GB	3.9 GB
128x128	2	2h 06m	2d 04h	6.5–6.8	7.4 GB	7.9 GB
128x128	4	1h 20m	1d 09h	4.1–4.6	4.2 GB	16.3 GB
128x128	8	1h 13m	1d 06h	3.9–4.9	2.6 GB	31.9 GB
256x256	1	6h 36m	6d 21h	21.6–24.2	5.0 GB	4.5 GB
256x256	2	3h 27m	3d 14h	11.2–11.8	5.2 GB	9.0 GB
256x256	4	1h 45m	1d 20h	5.6–5.9	5.2 GB	17.8 GB
256x256	8	1h 24m	1d 11h	4.4–5.5	3.2 GB	34.7 GB
512x512	1	21h 03m	21d 22h	72.5–74.9	7.6 GB	5.0 GB
512x512	2	10h 59m	11d 10h	37.7–40.0	7.8 GB	9.8 GB
512x512	4	5h 29m	5d 17h	18.7–19.1	7.9 GB	17.7 GB
512x512	8	2h 48m	2d 22h	9.5–9.7	7.8 GB	38.2 GB
1024x1024	1	1d 20h	46d 03h	154.3–161.6	8.1 GB	5.3 GB
1024x1024	2	23h 09m	24d 02h	80.6–86.2	8.6 GB	11.9 GB
1024x1024	4	11h 36m	12d 02h	40.1–40.8	8.4 GB	21.9 GB
1024x1024	8	5h 54m	6d 03h	20.2–20.6	8.3 GB	44.7 GB

The above measurements were done using NVIDIA Tesla V100 GPUs with default settings (--cfg=auto --aug=ada --metrics=fid50k_full). "sec/kimg" shows the expected range of variation in raw training performance, as reported in log.txt. "GPU mem" and "CPU mem" show the highest observed memory consumption, excluding the peak at the beginning caused by torch.backends.cudnn.benchmark.

In typical cases, 25000 kimg or more is needed to reach convergence, but the results are already quite reasonable around 5000 kimg. 1000 kimg is often enough for transfer learning, which tends to converge significantly faster. The following figure shows example convergence curves for different datasets as a function of wallclock time, using the same settings as above:

Note: --cfg=auto serves as a reasonable first guess for the hyperparameters but it does not necessarily lead to optimal results for a given dataset. For example, --cfg=stylegan2 yields considerably better FID for FFHQ-140k at 1024x1024 than illustrated above. We recommend trying out at least a few different values of --gamma for each new dataset.

Quality metrics

By default, train.py automatically computes FID for each network pickle exported during training. We recommend inspecting metric-fid50k_full.jsonl (or TensorBoard) at regular intervals to monitor the training progress. When desired, the automatic computation can be disabled with --metrics=none to speed up the training slightly (3%–9%).

Additional quality metrics can also be computed after the training:

# Previous training run: look up options automatically, save result to JSONL file.
python calc_metrics.py --metrics=pr50k3_full \
    --network=~/training-runs/00000-ffhq10k-res64-auto1/network-snapshot-000000.pkl

# Pre-trained network pickle: specify dataset explicitly, print result to stdout.
python calc_metrics.py --metrics=fid50k_full --data=~/datasets/ffhq.zip --mirror=1 \
    --network=https://nvlabs-fi-cdn.nvidia.com/stylegan2-ada-pytorch/pretrained/ffhq.pkl

The first example looks up the training configuration and performs the same operation as if --metrics=pr50k3_full had been specified during training. The second example downloads a pre-trained network pickle, in which case the values of --mirror and --data must be specified explicitly.

Note that many of the metrics have a significant one-off cost when calculating them for the first time for a new dataset (up to 30min). Also note that the evaluation is done using a different random seed each time, so the results will vary if the same metric is computed multiple times.

We employ the following metrics in the ADA paper. Execution time and GPU memory usage is reported for one NVIDIA Tesla V100 GPU at 1024x1024 resolution:

Metric	Time	GPU mem	Description
`fid50k_full`	13 min	1.8 GB	Fréchet inception distance^[1] against the full dataset
`kid50k_full`	13 min	1.8 GB	Kernel inception distance^[2] against the full dataset
`pr50k3_full`	13 min	4.1 GB	Precision and recall^[3] againt the full dataset
`is50k`	13 min	1.8 GB	Inception score^[4] for CIFAR-10

In addition, the following metrics from the StyleGAN and StyleGAN2 papers are also supported:

Metric	Time	GPU mem	Description
`fid50k`	13 min	1.8 GB	Fréchet inception distance against 50k real images
`kid50k`	13 min	1.8 GB	Kernel inception distance against 50k real images
`pr50k3`	13 min	4.1 GB	Precision and recall against 50k real images
`ppl2_wend`	36 min	2.4 GB	Perceptual path length^[5] in W, endpoints, full image
`ppl_zfull`	36 min	2.4 GB	Perceptual path length in Z, full paths, cropped image
`ppl_wfull`	36 min	2.4 GB	Perceptual path length in W, full paths, cropped image
`ppl_zend`	36 min	2.4 GB	Perceptual path length in Z, endpoints, cropped image
`ppl_wend`	36 min	2.4 GB	Perceptual path length in W, endpoints, cropped image

References:

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium, Heusel et al. 2017
Demystifying MMD GANs, Bińkowski et al. 2018
Improved Precision and Recall Metric for Assessing Generative Models, Kynkäänniemi et al. 2019
Improved Techniques for Training GANs, Salimans et al. 2016
A Style-Based Generator Architecture for Generative Adversarial Networks, Karras et al. 2018

License

This work is made available under the Nvidia Source Code License.

Citation

@inproceedings{Karras2020ada,
  title     = {Training Generative Adversarial Networks with Limited Data},
  author    = {Tero Karras and Miika Aittala and Janne Hellsten and Samuli Laine and Jaakko Lehtinen and Timo Aila},
  booktitle = {Proc. NeurIPS},
  year      = {2020}
}

Development

This is a research reference implementation and is treated as a one-time code drop. As such, we do not accept outside code contributions in the form of pull requests.

Acknowledgements

We thank David Luebke for helpful comments; Tero Kuosmanen and Sabu Nadarajan for their support with compute infrastructure; and Edgar Schönfeld for guidance on setting up unconditional BigGAN.

Comments

upfirdn2d_plugin Problem
Describe the bug Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!

Please stop closing people's issues without a confirmed fix for this problem. #2 (comment) does not work and there is no confirmed fix on that issue that was closed without a confirmed fix.

Please be serious about it and let's work together for a fix instead of ignoring the problem and referring people to a close topic that does not offer any solution to their problem.

We tried everything proposed we also tried both Cuda 11.0 and 11.1, with different version of PyTorch just in case. We are a team of 5 people and we all had the same problem in both Windows and Linux machine and even in google Collab which tells me that this is more than just a configuration problem.

and no %pip install ninja did not solve the problem in any of the machines we have in our lab. also, using verbosity = 'full' does not seem to include any additional helpful information.

Desktop (please complete the following information):

Those are the two machines I used

Machine 1

ubuntu 20.04.1,

pytorch 1.7.1

CUDA 11.1,

RTX 3090

Machine 2

Windows 10

pytorch 1.7.1

CUDA 11.1, also tried with - CUDA 11.0

CUDA toolkit version (e.g., CUDA 11.0)

NVIDIA driver version 461.40

RTX 3090
opened by ghost 37
RuntimeError: CUDA error: no kernel image is available for execution on the device

I'm trying to run the sample code but it raises an error. I'm running on RTX 3090 with cuda 11.1(as the description recommends) and cudnn8.0.5. The message is attached below.

I'm able to run pytorch with cuda. Do you have any idea how to solve this problem? Thanks in advance!

opened by xielongze 26

Vast.ai instance - No module named 'upfirdn2d_plugin'

Stuck here big time with ImportError: No module named 'upfirdn2d_plugin'

I am using a vast.ai instance nvidia/cuda:11.2.1-cudnn8-runtime-ubuntu18.04

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:00:07.0 Off |                    0 |
| N/A   30C    P0    35W / 250W |      0MiB / 16160MiB |      0%      Default |

Conda environment is set with conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=11.0 -c pytorch --yes (doesn't matter if I try a newer one)

What I've tried

FIrst I made sure my VM has CUDA 11.2 installed. Then I've installed a newer torch with CUDA 11.1.1, which did not help and I've rolled back (made a new env).

Removed torch_extensions Just as described here: https://github.com/NVlabs/stylegan2-ada-pytorch/issues/11?_pjax=%23js-repo-pjax-container

Didn't help

gcc I found this thread and https://github.com/NVlabs/stylegan2-ada-pytorch/issues/35

And tried installing gcc7 conda install -c conda-forge/label/gcc7 gcc_linux-64 (didn't help)

and even gcc5 conda install -c psi4 gcc-5 The latter sent me in a weird loop and I've abandoned this path.

This does not help either https://github.com/NVlabs/stylegan2-ada-pytorch/issues/2#issuecomment-773275680

Google Colab works fine and has ubuntu 18.04 with gcc 7.5.0 installed which I am trying to mimic. Hope that is the correct logic.

UPD: Another instance with gcc 7.5.0 throws the same error as well

gcc --version
gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0
Copyright (C) 2017 Free Software Foundation, Inc.

UPD2 Installing gcc 5 as described here: https://askubuntu.com/questions/1087150/install-gcc-5-on-ubuntu-18-04 Did not help either

UPD3 Sorry for not including the traceback originally

Traceback (most recent call last):
  File "/root/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 32, in _init
    _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
    file, path, description = imp.find_module(module_name, [path])
  File "/usr/local/envs/stylegan/lib/python3.7/imp.py", line 296, in find_module
    raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'upfirdn2d_plugin'

  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())
Setting up PyTorch plugin "upfirdn2d_plugin"... Failed!
/root/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py:34: UserWarning: Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:

Traceback (most recent call last):
  File "/root/stylegan2-ada-pytorch/torch_utils/ops/upfirdn2d.py", line 32, in _init
    _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math'])
  File "/root/stylegan2-ada-pytorch/torch_utils/custom_ops.py", line 110, in get_plugin
    torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 997, in load
    keep_intermediates=keep_intermediates)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1213, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "/usr/local/envs/stylegan/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1560, in _import_module_from_library
    file, path, description = imp.find_module(module_name, [path])
  File "/usr/local/envs/stylegan/lib/python3.7/imp.py", line 296, in find_module
    raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'upfirdn2d_plugin'

  warnings.warn('Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation. Details:\n\n' + traceback.format_exc())

Please advice on any possible next steps. No idea where to move next.

Originally posted by @dokluch in https://github.com/NVlabs/stylegan2-ada-pytorch/issues/2#issuecomment-801715229

opened by dokluch 18

UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown

I'm getting this error on a Google Colab. This started showing up all of a sudden in the last two days, I've only changed the data, code remained pretty much the same

tick 0     kimg 0.0      time 2m 59s       sec/tick 7.9     sec/kimg 989.17  maintenance 170.8  cpumem 4.98   gpumem 10.59  augment 0.000
Evaluating metrics...
/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return forward_call(*input, **kwargs)
/usr/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 34 leaked semaphores to clean up at shutdown
  len(cache))

Happened on Tesla T4 and P100, restarted the hosted runtime a few times, hew no much difference

opened by wandrzej 11

Error at Tick 1 : Either Evaluating Metrics or the irreverant alert in pytorch kicks to windows problem reporting
Describe the bug Crashing at Tick 0

To Reproduce (base) PS C:\Users\Dunwo> conda activate stylegantry (stylegantry) PS C:\Users\Dunwo> cd temp (stylegantry) PS C:\Users\Dunwo\temp> cd .\stylegan2-ada-pytorch\ (stylegantry) PS C:\Users\Dunwo\temp\stylegan2-ada-pytorch> python train.py --data C:\Ganoutput --outdir C:\GanResults

Training options: { "num_gpus": 1, "image_snapshot_ticks": 50, "network_snapshot_ticks": 50, "metrics": [ "fid50k_full" ], "random_seed": 0, "training_set_kwargs": { "class_name": "training.dataset.ImageFolderDataset", "path": "C:\Ganoutput", "use_labels": false, "max_size": 13439, "xflip": false, "resolution": 512 }, "data_loader_kwargs": { "pin_memory": true, "num_workers": 3, "prefetch_factor": 2 }, "G_kwargs": { "class_name": "training.networks.Generator", "z_dim": 512, "w_dim": 512, "mapping_kwargs": { "num_layers": 2 }, "synthesis_kwargs": { "channel_base": 32768, "channel_max": 512, "num_fp16_res": 4, "conv_clamp": 256 } }, "D_kwargs": { "class_name": "training.networks.Discriminator", "block_kwargs": {}, "mapping_kwargs": {}, "epilogue_kwargs": { "mbstd_group_size": 4 }, "channel_base": 32768, "channel_max": 512, "num_fp16_res": 4, "conv_clamp": 256 }, "G_opt_kwargs": { "class_name": "torch.optim.Adam", "lr": 0.0025, "betas": [ 0, 0.99 ], "eps": 1e-08 }, "D_opt_kwargs": { "class_name": "torch.optim.Adam", "lr": 0.0025, "betas": [ 0, 0.99 ], "eps": 1e-08 }, "loss_kwargs": { "class_name": "training.loss.StyleGAN2Loss", "r1_gamma": 6.5536 }, "total_kimg": 25000, "batch_size": 8, "batch_gpu": 8, "ema_kimg": 2.5, "ema_rampup": 0.05, "ada_target": 0.6, "augment_kwargs": { "class_name": "training.augment.AugmentPipe", "xflip": 1, "rotate90": 1, "xint": 1, "scale": 1, "rotate": 1, "aniso": 1, "xfrac": 1, "brightness": 1, "contrast": 1, "lumaflip": 1, "hue": 1, "saturation": 1 }, "run_dir": "C:\GanResults\00014-Ganoutput-auto1" }

Output directory: C:\GanResults\00014-Ganoutput-auto1 Training data: C:\Ganoutput Training duration: 25000 kimg Number of GPUs: 1 Number of images: 13439 Image resolution: 512 Conditional model: False Dataset x-flips: False

Creating output directory... Launching processes... Loading training set...

Num images: 13439 Image shape: [3, 512, 512] Label shape: [0]

Constructing networks... Setting up PyTorch plugin "bias_act_plugin"... Done. Setting up PyTorch plugin "upfirdn2d_plugin"... Done.

Generator Parameters Buffers Output shape Datatype

mapping.fc0 262656 - [8, 512] float32 mapping.fc1 262656 - [8, 512] float32 mapping - 512 [8, 16, 512] float32 synthesis.b4.conv1 2622465 32 [8, 512, 4, 4] float32 synthesis.b4.torgb 264195 - [8, 3, 4, 4] float32 synthesis.b4:0 8192 16 [8, 512, 4, 4] float32 synthesis.b4:1 - - [8, 512, 4, 4] float32 synthesis.b8.conv0 2622465 80 [8, 512, 8, 8] float32 synthesis.b8.conv1 2622465 80 [8, 512, 8, 8] float32 synthesis.b8.torgb 264195 - [8, 3, 8, 8] float32 synthesis.b8:0 - 16 [8, 512, 8, 8] float32 synthesis.b8:1 - - [8, 512, 8, 8] float32 synthesis.b16.conv0 2622465 272 [8, 512, 16, 16] float32 synthesis.b16.conv1 2622465 272 [8, 512, 16, 16] float32 synthesis.b16.torgb 264195 - [8, 3, 16, 16] float32 synthesis.b16:0 - 16 [8, 512, 16, 16] float32 synthesis.b16:1 - - [8, 512, 16, 16] float32 synthesis.b32.conv0 2622465 1040 [8, 512, 32, 32] float32 synthesis.b32.conv1 2622465 1040 [8, 512, 32, 32] float32 synthesis.b32.torgb 264195 - [8, 3, 32, 32] float32 synthesis.b32:0 - 16 [8, 512, 32, 32] float32 synthesis.b32:1 - - [8, 512, 32, 32] float32 synthesis.b64.conv0 2622465 4112 [8, 512, 64, 64] float16 synthesis.b64.conv1 2622465 4112 [8, 512, 64, 64] float16 synthesis.b64.torgb 264195 - [8, 3, 64, 64] float16 synthesis.b64:0 - 16 [8, 512, 64, 64] float16 synthesis.b64:1 - - [8, 512, 64, 64] float32 synthesis.b128.conv0 1442561 16400 [8, 256, 128, 128] float16 synthesis.b128.conv1 721409 16400 [8, 256, 128, 128] float16 synthesis.b128.torgb 132099 - [8, 3, 128, 128] float16 synthesis.b128:0 - 16 [8, 256, 128, 128] float16 synthesis.b128:1 - - [8, 256, 128, 128] float32 synthesis.b256.conv0 426369 65552 [8, 128, 256, 256] float16 synthesis.b256.conv1 213249 65552 [8, 128, 256, 256] float16 synthesis.b256.torgb 66051 - [8, 3, 256, 256] float16 synthesis.b256:0 - 16 [8, 128, 256, 256] float16 synthesis.b256:1 - - [8, 128, 256, 256] float32 synthesis.b512.conv0 139457 262160 [8, 64, 512, 512] float16 synthesis.b512.conv1 69761 262160 [8, 64, 512, 512] float16 synthesis.b512.torgb 33027 - [8, 3, 512, 512] float16 synthesis.b512:0 - 16 [8, 64, 512, 512] float16 synthesis.b512:1 - - [8, 64, 512, 512] float32

Total 28700647 699904 - -

Discriminator Parameters Buffers Output shape Datatype

b512.fromrgb 256 16 [8, 64, 512, 512] float16 b512.skip 8192 16 [8, 128, 256, 256] float16 b512.conv0 36928 16 [8, 64, 512, 512] float16 b512.conv1 73856 16 [8, 128, 256, 256] float16 b512 - 16 [8, 128, 256, 256] float16 b256.skip 32768 16 [8, 256, 128, 128] float16 b256.conv0 147584 16 [8, 128, 256, 256] float16 b256.conv1 295168 16 [8, 256, 128, 128] float16 b256 - 16 [8, 256, 128, 128] float16 b128.skip 131072 16 [8, 512, 64, 64] float16 b128.conv0 590080 16 [8, 256, 128, 128] float16 b128.conv1 1180160 16 [8, 512, 64, 64] float16 b128 - 16 [8, 512, 64, 64] float16 b64.skip 262144 16 [8, 512, 32, 32] float16 b64.conv0 2359808 16 [8, 512, 64, 64] float16 b64.conv1 2359808 16 [8, 512, 32, 32] float16 b64 - 16 [8, 512, 32, 32] float16 b32.skip 262144 16 [8, 512, 16, 16] float32 b32.conv0 2359808 16 [8, 512, 32, 32] float32 b32.conv1 2359808 16 [8, 512, 16, 16] float32 b32 - 16 [8, 512, 16, 16] float32 b16.skip 262144 16 [8, 512, 8, 8] float32 b16.conv0 2359808 16 [8, 512, 16, 16] float32 b16.conv1 2359808 16 [8, 512, 8, 8] float32 b16 - 16 [8, 512, 8, 8] float32 b8.skip 262144 16 [8, 512, 4, 4] float32 b8.conv0 2359808 16 [8, 512, 8, 8] float32 b8.conv1 2359808 16 [8, 512, 4, 4] float32 b8 - 16 [8, 512, 4, 4] float32 b4.mbstd - - [8, 513, 4, 4] float32 b4.conv 2364416 16 [8, 512, 4, 4] float32 b4.fc 4194816 - [8, 512] float32 b4.out 513 - [8, 1] float32

Total 28982849 480 - -

Setting up augmentation... Distributing across 1 GPUs... Setting up training phases... Exporting sample images... Initializing logs... Training for 25000 kimg...

tick 0 kimg 0.0 time 51s sec/tick 6.2 sec/kimg 773.03 maintenance 44.9 cpumem 3.61 gpumem 14.76 augment 0.000 Evaluating metrics... C:\Users\Dunwo\AppData\Local\Programs\Python\Python39\lib\site-packages\torch\nn\modules\module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at ..\c10/core/TensorImpl.h:1156.) return forward_call(*input, **kwargs) (stylegantry) PS C:\Users\Dunwo\temp\stylegan2-ada-pytorch>

Please copy&paste text instead of screenshots for better searchability.

Expected behavior At this stage im expecting gpu usage to ramp up and ticks 1 and more to follow. i dont think there should be any windows problem reporting

Screenshots It generates the first tick and log but its hard to tell if its when it begins evaluating metrics or when the irreverant warning comes up Desktop (please complete the following information): As soon as it gets here there will be a windows problem reporting in the task maanger. but there is no pop up or alert or anything and then there is nothing. no debugs, no errors its like its been aborted

OS: [ Windows 10]

PyTorch version (1.9.0)

CUDA toolkit version (e.g., CUDA 11.1)

NVIDIA driver version 471.96

GPU [ RTX 3090]

Docker: Did not use docker

Additional context I'm new to this but willing to learn and not afraid to google my own problems and troubleshoot. the issue here is there is no debug or error alert at all. so i have nothing to go on
opened by Passingbyposts 10
train.py fails when gpus=2 (or something other than gpus=1)

OS: CentOS Version 7 Python: 3.7.6 Pytorch Version: 1.7.1+cu110 GPU: 2 V100s Docker: No, have not gone that route yet Related Posted Issues: none that I could find based solely on GPU count

I am running the github repo for stylegan2-ada-pytorch. Through the help of others with Pytorch versions, I was able to do successful training with gpus=1. So, gpus=1 is working.

The system I am on has 2 V100s. When I set gpus=2 on "python train.py ...." I receive the following errors: (Traceback truncated and file references anonymized.)

Distributing across 2 GPUs... Setting up training phases... Exporting sample images... Initializing logs... Truncated Traceback (most recent call last): torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus) File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/…python3.7/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Truncated Traceback (most recent call last): File "…/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File …./notebooks/stylegan2-ada-pytorch/train.py", line 422, in subprocess_fn training_loop.training_loop(rank=rank, **args) File "…/notebooks/stylegan2-ada-pytorch/training/training_loop.py", line 290, in training_loop loss.accumulate_gradients(phase=phase.name, real_img=real_img, real_c=real_c, gen_z=gen_z, gen_c=gen_c, sync=sync, gain=gain) File "…/notebooks/stylegan2-ada-pytorch/training/loss.py", line 134, in accumulate_gradients training_stats.report('Loss/D/loss', loss_Dgen + loss_Dreal) RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 0

opened by metaphorz 9

Stuck on Evaluating Metrics

After downloading afhq dataset and after creating zip file with:

python dataset_tool.py --source=downloads/afhq/afhq/train/cat --dest=datasets/cat.zip

I start the train:

python train.py --outdir=training-runs --data=datasets/cat.zip --gpus=1

and execution stops at Evaluate metrics:

Discriminator  Parameters  Buffers  Output shape        Datatype
---            ---         ---      ---                 ---     
b512.fromrgb   256         16       [8, 64, 512, 512]   float16 
b512.skip      8192        16       [8, 128, 256, 256]  float16 
b512.conv0     36928       16       [8, 64, 512, 512]   float16 
b512.conv1     73856       16       [8, 128, 256, 256]  float16 
b512           -           16       [8, 128, 256, 256]  float16 
b256.skip      32768       16       [8, 256, 128, 128]  float16 
b256.conv0     147584      16       [8, 128, 256, 256]  float16 
b256.conv1     295168      16       [8, 256, 128, 128]  float16 
b256           -           16       [8, 256, 128, 128]  float16 
b128.skip      131072      16       [8, 512, 64, 64]    float16 
b128.conv0     590080      16       [8, 256, 128, 128]  float16 
b128.conv1     1180160     16       [8, 512, 64, 64]    float16 
b128           -           16       [8, 512, 64, 64]    float16 
b64.skip       262144      16       [8, 512, 32, 32]    float16 
b64.conv0      2359808     16       [8, 512, 64, 64]    float16 
b64.conv1      2359808     16       [8, 512, 32, 32]    float16 
b64            -           16       [8, 512, 32, 32]    float16 
b32.skip       262144      16       [8, 512, 16, 16]    float32 
b32.conv0      2359808     16       [8, 512, 32, 32]    float32 
b32.conv1      2359808     16       [8, 512, 16, 16]    float32 
b32            -           16       [8, 512, 16, 16]    float32 
b16.skip       262144      16       [8, 512, 8, 8]      float32 
b16.conv0      2359808     16       [8, 512, 16, 16]    float32 
b16.conv1      2359808     16       [8, 512, 8, 8]      float32 
b16            -           16       [8, 512, 8, 8]      float32 
b8.skip        262144      16       [8, 512, 4, 4]      float32 
b8.conv0       2359808     16       [8, 512, 8, 8]      float32 
b8.conv1       2359808     16       [8, 512, 4, 4]      float32 
b8             -           16       [8, 512, 4, 4]      float32 
b4.mbstd       -           -        [8, 513, 4, 4]      float32 
b4.conv        2364416     16       [8, 512, 4, 4]      float32 
b4.fc          4194816     -        [8, 512]            float32 
b4.out         513         -        [8, 1]              float32 
---            ---         ---      ---                 ---     
Total          28982849    480      -                   -       

Setting up augmentation...
Distributing across 1 GPUs...
Setting up training phases...
Exporting sample images...
Initializing logs...
Skipping tfevents export: No module named 'tensorboard'
Training for 25000 kimg...

tick 0     kimg 0.0      time 50s          sec/tick 11.2    sec/kimg 1397.00 maintenance 39.2   cpumem 4.61   gpumem 10.32  augment 0.000
Evaluating metrics...

OS: Ubuntu 18.04
PyTorch version 1.8.1
CUDA toolkit version 11.1
NVIDIA Driver Version: 460.80
GPU nvidia T4
Docker: did you use Docker? no

What might be the reason for such behavior ?

opened by Adblu 9

Error building extension 'upfirdn2d_plugin' and 'bias_act_plugin'

I have a similar bug to this issue #https://github.com/NVlabs/stylegan2-ada-pytorch/issues/39

However I think it's a bit different. I get similar errors for both upfirdn2d_plugin and bias_act_plugin

Here's the stack Traceback (most recent call last): File "train.py", line 538, in main() # pylint: disable=no-value-for-parameter File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\click\core.py", line 829, in call return self.main(*args, **kwargs) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\click\core.py", line 782, in main rv = self.invoke(ctx) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\click\core.py", line 1066, in invoke return ctx.invoke(self.callback, **ctx.params) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\click\core.py", line 610, in invoke return callback(*args, **kwargs) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\click\decorators.py", line 21, in new_func return f(get_current_context(), *args, **kwargs) File "train.py", line 531, in main subprocess_fn(rank=0, args=args, temp_dir=temp_dir) File "train.py", line 383, in subprocess_fn training_loop.training_loop(rank=rank, **args) File "Y:\projects\stylegan2ada\training\training_loop.py", line 166, in training_loop img = misc.print_module_summary(G, [z, c]) File "Y:\projects\stylegan2ada\torch_utils\misc.py", line 212, in print_module_summary outputs = module(*inputs) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "Y:\projects\stylegan2ada\training\networks.py", line 499, in forward img = self.synthesis(ws, **synthesis_kwargs) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "Y:\projects\stylegan2ada\training\networks.py", line 471, in forward x, img = block(x, img, cur_ws, **block_kwargs) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "Y:\projects\stylegan2ada\training\networks.py", line 405, in forward x = self.conv0(x, next(w_iter), fused_modconv=fused_modconv, **layer_kwargs) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl result = self.forward(*input, **kwargs) File "Y:\projects\stylegan2ada\training\networks.py", line 300, in forward padding=self.padding, resample_filter=self.resample_filter, flip_weight=flip_weight, fused_modconv=fused_modconv) File "Y:\projects\stylegan2ada\torch_utils\misc.py", line 101, in decorator return fn(*args, **kwargs) File "Y:\projects\stylegan2ada\training\networks.py", line 65, in modulated_conv2d x = conv2d_resample.conv2d_resample(x=x, w=weight.to(x.dtype), f=resample_filter, up=up, down=down, padding=padding, flip_weight=flip_weight) File "Y:\projects\stylegan2ada\torch_utils\misc.py", line 101, in decorator return fn(*args, kwargs) File "Y:\projects\stylegan2ada\torch_utils\ops\conv2d_resample.py", line 139, in conv2d_resample x = upfirdn2d.upfirdn2d(x=x, f=f, padding=[px0+pxt,px1+pxt,py0+pyt,py1+pyt], gain=up2, flip_filter=flip_filter) File "Y:\projects\stylegan2ada\torch_utils\ops\upfirdn2d.py", line 160, in upfirdn2d if impl == 'cuda' and x.device.type == 'cuda' and _init(): File "Y:\projects\stylegan2ada\torch_utils\ops\upfirdn2d.py", line 31, in _init _plugin = custom_ops.get_plugin('upfirdn2d_plugin', sources=sources, extra_cuda_cflags=['--use_fast_math']) File "Y:\projects\stylegan2ada\torch_utils\custom_ops.py", line 110, in get_plugin torch.utils.cpp_extension.load(name=module_name, verbose=verbose_build, sources=sources, **build_kwargs) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\utils\cpp_extension.py", line 1091, in load keep_intermediates=keep_intermediates) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\utils\cpp_extension.py", line 1302, in _jit_compile is_standalone=is_standalone) File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\utils\cpp_extension.py", line 1407, in _write_ninja_file_and_build_library error_prefix=f"Error building extension '{name}'") File "C:\Users\vokho\anaconda3\envs\stylegan\lib\site-packages\torch\utils\cpp_extension.py", line 1683, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'upfirdn2d_plugin': ninja: error: build.ninja:3: lexing error

It's saying something about a lexing error when ninja is trying to build

My nvcc --version returns

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Tue_Sep_15_19:12:04_Pacific_Daylight_Time_2020 Cuda compilation tools, release 11.1, V11.1.74 Build cuda_11.1.relgpu_drvr455TC455_06.29069683_0

opened by KhoaVo 7

RuntimeError: AssertionError:

Hi. I'm trying to run the sample code but it raises an error.

tick 0     kimg 0.0      time 1m 02s       sec/tick 15.7    sec/kimg 3923.85 maintenance 46.2   cpumem 3.91   gpumem 37.23  augment 0.000
Evaluating metrics...
Traceback (most recent call last):
  File "train.py", line 530, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/click/decorators.py", line 21, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "train.py", line 523, in main
    subprocess_fn(rank=0, args=args, temp_dir=temp_dir)
  File "train.py", line 376, in subprocess_fn
    training_loop.training_loop(rank=rank, **args)
  File "/workspace/training/training_loop.py", line 371, in training_loop
    result_dict = metric_main.calc_metric(metric=metric, G=snapshot_data['G_ema'],
  File "/workspace/metrics/metric_main.py", line 45, in calc_metric
    results = _metric_dict[metric](opts)
  File "/workspace/metrics/metric_main.py", line 85, in fid50k_full
    fid = frechet_inception_distance.compute_fid(opts, max_real=None, num_gen=50000)
  File "/workspace/metrics/frechet_inception_distance.py", line 25, in compute_fid
    mu_real, sigma_real = metric_utils.compute_feature_stats_for_dataset(
  File "/workspace/metrics/metric_utils.py", line 216, in compute_feature_stats_for_dataset
    features = detector(images.to(opts.device), **detector_kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 744, in _call_impl
    result = self.forward(*input, **kwargs)
torch.jit.Error: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript, serialized code (most recent call last):
  File "code/__torch__.py", line 20, in forward
      pass
    else:
      ops.prim.RaiseException("AssertionError: ")
      ~~~~~~~~~~~~~~~~~~~~~~~ <--- HERE
    if use_fp16:
      _4 = 5

Traceback of TorchScript, original code (most recent call last):
  File "c:\p4research\research\tkarras\dnn\gan3support\feature_detectors\inception.py", line 197, in forward
    def forward(self, img, return_features: bool = False, use_fp16: bool = False, no_output_bias: bool = False):
        batch_size, channels, height, width = img.shape # [NCHW]
        assert channels == 3
        ~~~~~~~~~~~~~~~~~~~~ <--- HERE

        # Cast to float.
RuntimeError: AssertionError:

Do you have any idea how to solve this problem? Thanks in advance

opened by mulkong 7

raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV

Describe the bug

Evaluating metrics...
/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return forward_call(*input, **kwargs)
/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return forward_call(*input, **kwargs)
/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return forward_call(*input, **kwargs)
/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/torch/nn/modules/module.py:1051: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered internally at  /pytorch/c10/core/TensorImpl.h:1156.)
  return forward_call(*input, **kwargs)
Traceback (most recent call last):
  File "train.py", line 538, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/click/core.py", line 1137, in __call__
    return self.main(*args, **kwargs)
  File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/click/core.py", line 1062, in main
    rv = self.invoke(ctx)
  File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/click/core.py", line 763, in invoke
    return __callback(*args, **kwargs)
  File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "train.py", line 533, in main
    torch.multiprocessing.spawn(fn=subprocess_fn, args=(args, temp_dir), nprocs=args.num_gpus)
  File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
  File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
    while not context.join():
  File "/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 130, in join
    raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGSEGV

/mnt/lab/zjh/anaconda3/envs/pytorch_gpu/lib/python3.8/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 68 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

To Reproduce python train.py --outdir=training_runs --data=anime_trainB.zip --gpus=4

Server

OS: Linux Ubuntu 18.04.1
PyTorch version pytorch 1.9.0
CUDA toolkit version CUDA 11.4
NVIDIA driver version 470.57.02
GPU four RTX 3090

Additional context

opened by zhanjiahui 5

Gradient Accumulation Control

I've noticed that the control of gradient accumulation is a bit challenging - but perhaps I'm not familiar enough with the code. Is there a bit of guidance on how to adjust the code to increase the amount of accumulation before the weight updates?

In particular, when running on a card with lower memory at 256x256 it takes me about 1 minute for 1kimg, and takes about 4 minutes per 1kimg when I go to 512x512 (4x makes sense to me due to the scaling of the resolution). However, because my batch size falls to 1 to accommodate my RAM requirements, I get GAN collapse. To avoid this, I've successfully reduced the learning rate to about 1/4 of the default, which seems to fix GAN collapse. Problem is that while I am still processing 1kimg every 4 minutes, I only achieve 1/4 of the weight update, so effectively I'm finding my training time to achieve similar results at the lower resolution has increased by a factor of 16, when it should only have increased by ~4x due to the bigger resolution.

I would expect if I could just increase the gradient accumulation by 4x, I could keep the higher learning rate and avoid GAN collapse at the same time. But I'm having a bit of trouble mucking around with this, because the use of batch_gpu and num_gpus in the training_loop.py seems to get overwritten by train.py args, and creates a few other issues when I adjust the code.

Much appreciated!

opened by paradox715 5
padding
in the fallback code of upfirdn2d.py there is :

padding = [padx, padx, pady, pady] padx0, padx1, pady0, pady1 = padding

in another repo (https://github.com/rosinality/stylegan2-pytorch/blob/master/op/upfirdn2d.py) there is :

if len(pad) == 2: pad = (pad[0], pad[1], pad[0], pad[1])

seems the order is not the same (and i had a dimension error when upsampling here). Maybe a mistake?
opened by aRavanel 1
AssertionError : list(image.shape) == self.image_shape

I intended to train network. I prepared training data with png format and pass the folder path. first time, it went well, but after I added new images, program says "AssertionError" in training/dataset.py, line88 in __getitem__ : assert list(image.shape) == self.image_shape But the shape of images I passed are all the same (128,128), I checked and the first data is the same size. Anyone helps me ?

opened by ku60 0
Multi-Label support?

Does this implementation support conditioning with multiple labels? Or what does c_dim stand for?

Kind regards!

Edit: I was also wondering if training with --aug noaug would correspond to vanilla StyleGAN2. If no, is it possible to do that with other train options?

opened by lebeli 0

SyntaxError: invalid character in identifier](url)

[Traceback (most recent call last):
  File "train.py", line 20, in <module>
    from training import training_loop
  File "/content/FcF-Inpainting/training/training_loop.py", line 1
    import os
          ^

SyntaxError: invalid character in identifier](url)

opened by jcrbsa 0

TypeError: run_G() missing 1 required positional argument: 'c'`

After run the following command:

python3 train.py \
    --outdir=$OUTPUT_PATH \
    --img_data=$TRAIN_PATH \
    --gpus 1 \
    --gamma 10 \
    --aug 'noaug' \
    --metrics True \
    --eval_img_data $VAL_PATH \
    --batch 32

Show the following error:

Traceback (most recent call last):
  File "train.py", line 523, in <module>
    main() # pylint: disable=no-value-for-parameter
  File "/content/drive/MyDrive/env/FcF-Inpainting/virtualenv/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/content/drive/MyDrive/env/FcF-Inpainting/virtualenv/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/content/drive/MyDrive/env/FcF-Inpainting/virtualenv/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/content/drive/MyDrive/env/FcF-Inpainting/virtualenv/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/content/drive/MyDrive/env/FcF-Inpainting/virtualenv/lib/python3.7/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "train.py", line 516, in main
    subprocess_fn(rank=0, args=args, temp_dir=temp_dir)
  File "train.py", line 391, in subprocess_fn
    training_loop.training_loop(rank=rank, **args)
  File "/content/drive/MyDrive/env/FcF-Inpainting/training/training_loop.py", line 327, in training_loop
    loss.accumulate_gradients(phase=phase.name, erased_img=erased_img, real_img=real_img, mask=mask, real_c=real_c, gen_c=gen_c, sync=sync, gain=gain)
  File "/content/drive/MyDrive/env/FcF-Inpainting/training/losses/loss.py", line 65, in accumulate_gradients
    gen_img, _ = self.run_G(g_inputs, gen_c, sync=sync) # May get synced by Gpl.
TypeError: run_G() missing 1 required positional argument: 'c'`

opened by jcrbsa 0

StyleGAN2-ADA - Official PyTorch implementation

Related tags

Overview

StyleGAN2-ADA — Official PyTorch implementation

Release notes

Data repository

Requirements

Getting started

Projecting images to latent space

Using networks from Python

Preparing datasets

Training new networks

Expected training time

Quality metrics

License

Citation

Development

Acknowledgements

Comments

What I've tried

Owner

NVIDIA Research Projects

StyleGAN2 with adaptive discriminator augmentation (ADA) - Official TensorFlow implementation

StyleGAN2-ada for practice

A colab notebook for training Stylegan2-ada on colab, transfer learning onto your own dataset.

Cartoon-StyleGan2 🙃 : Fine-tuning StyleGAN2 for Cartoon Face Generation

Non-Official Pytorch implementation of "Face Identity Disentanglement via Latent Space Mapping" https://arxiv.org/abs/2005.07728 Using StyleGAN2 instead of StyleGAN

StyleGAN2 - Official TensorFlow Implementation

Navigating StyleGAN2 w latent space using CLIP

StyleGAN2 Webtoon / Anime Style Toonify

Pretrained models for Jax/Flax: StyleGAN2, GPT2, VGG, ResNet.

Fine-tuning StyleGAN2 for Cartoon Face Generation

A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

A web porting for NVlabs' StyleGAN2, to facilitate exploring all kinds characteristic of StyleGAN networks

ALBERT-pytorch-implementation - ALBERT pytorch implementation

Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Official PyTorch implementation of Joint Object Detection and Multi-Object Tracking with Graph Neural Networks

Official pytorch implementation of paper "Image-to-image Translation via Hierarchical Style Disentanglement".

Official pytorch implementation of paper "Inception Convolution with Efficient Dilation Search" (CVPR 2021 Oral).

Official PyTorch Implementation of Unsupervised Learning of Scene Flow Estimation Fusing with Local Rigidity

Official implementation of our paper "LLA: Loss-aware Label Assignment for Dense Pedestrian Detection" in Pytorch.