NVIDIA Deep Learning Examples for Tensor Cores

Overview

NVIDIA Deep Learning Examples for Tensor Cores

Introduction

This repository provides State-of-the-Art Deep Learning examples that are easy to train and deploy, achieving the best reproducible accuracy and performance with NVIDIA CUDA-X software stack running on NVIDIA Volta, Turing and Ampere GPUs.

NVIDIA GPU Cloud (NGC) Container Registry

These examples, along with our NVIDIA deep learning software stack, are provided in a monthly updated Docker container on the NGC container registry (https://ngc.nvidia.com). These containers include:

  • The latest NVIDIA examples from this repository
  • The latest NVIDIA contributions shared upstream to the respective framework
  • The latest NVIDIA Deep Learning software libraries, such as cuDNN, NCCL, cuBLAS, etc. which have all been through a rigorous monthly quality assurance process to ensure that they provide the best possible performance
  • Monthly release notes for each of the NVIDIA optimized containers

Computer Vision

Models Framework A100 AMP Multi-GPU Multi-Node TRT ONNX Triton DLC NB
ResNet-50 PyTorch Yes Yes Yes - Yes - Yes Yes -
ResNeXt-101 PyTorch Yes Yes Yes - Yes - Yes Yes -
SE-ResNeXt-101 PyTorch Yes Yes Yes - Yes - Yes Yes -
EfficientNet-B0 PyTorch Yes Yes Yes - - - - Yes -
EfficientNet-B4 PyTorch Yes Yes Yes - - - - Yes -
EfficientNet-WideSE-B0 PyTorch Yes Yes Yes - - - - Yes -
EfficientNet-WideSE-B4 PyTorch Yes Yes Yes - - - - Yes -
Mask R-CNN PyTorch Yes Yes Yes - - - - - Yes
nnUNet PyTorch Yes Yes Yes - - - - Yes -
SSD PyTorch Yes Yes Yes - - - - - Yes
ResNet-50 TensorFlow Yes Yes Yes - - - - Yes -
ResNeXt101 TensorFlow Yes Yes Yes - - - - Yes -
SE-ResNeXt-101 TensorFlow Yes Yes Yes - - - - Yes -
Mask R-CNN TensorFlow Yes Yes Yes - - - - Yes -
SSD TensorFlow Yes Yes Yes - - - - Yes Yes
U-Net Ind TensorFlow Yes Yes Yes - - - - Yes Yes
U-Net Med TensorFlow Yes Yes Yes - - - - Yes -
U-Net 3D TensorFlow Yes Yes Yes - - - - Yes -
V-Net Med TensorFlow Yes Yes Yes - - - - Yes -
U-Net Med TensorFlow2 Yes Yes Yes - - - - Yes -
Mask R-CNN TensorFlow2 Yes Yes Yes - - - - Yes -
EfficientNet TensorFlow2 Yes Yes Yes Yes - - - Yes -
ResNet-50 MXNet - Yes Yes - - - - - -

Natural Language Processing

Models Framework A100 AMP Multi-GPU Multi-Node TRT ONNX Triton DLC NB
BERT PyTorch Yes Yes Yes Yes - - Yes Yes -
TransformerXL PyTorch Yes Yes Yes Yes - - - Yes -
GNMT PyTorch Yes Yes Yes - - - - - -
Transformer PyTorch Yes Yes Yes - - - - - -
ELECTRA TensorFlow2 Yes Yes Yes Yes - - - Yes -
BERT TensorFlow Yes Yes Yes Yes Yes - Yes Yes Yes
BERT TensorFlow2 Yes Yes Yes Yes - - - Yes -
BioBert TensorFlow Yes Yes Yes - - - - Yes Yes
TransformerXL TensorFlow Yes Yes Yes - - - - - -
GNMT TensorFlow Yes Yes Yes - - - - - -
Faster Transformer Tensorflow - - - - Yes - - - -

Recommender Systems

Models Framework A100 AMP Multi-GPU Multi-Node TRT ONNX Triton DLC NB
DLRM PyTorch Yes Yes Yes - - Yes Yes Yes Yes
DLRM TensorFlow2 Yes Yes Yes Yes - - - Yes -
NCF PyTorch Yes Yes Yes - - - - - -
Wide&Deep TensorFlow Yes Yes Yes - - - - Yes -
Wide&Deep TensorFlow2 Yes Yes Yes - - - - Yes -
NCF TensorFlow Yes Yes Yes - - - - Yes -
VAE-CF TensorFlow Yes Yes Yes - - - - - -

Speech to Text

Models Framework A100 AMP Multi-GPU Multi-Node TRT ONNX Triton DLC NB
Jasper PyTorch Yes Yes Yes - Yes Yes Yes Yes Yes
Hidden Markov Model Kaldi - - Yes - - - Yes - -

Text to Speech

Models Framework A100 AMP Multi-GPU Multi-Node TRT ONNX Triton DLC NB
FastPitch PyTorch Yes Yes Yes - - - - Yes -
FastSpeech PyTorch - Yes Yes - Yes - - - -
Tacotron 2 and WaveGlow PyTorch Yes Yes Yes - Yes Yes Yes Yes -

Graph Neural Networks

Models Framework A100 AMP Multi-GPU Multi-Node TRT ONNX Triton DLC NB
SE(3)-Transformer PyTorch Yes Yes Yes - - - - - -

NVIDIA support

In each of the network READMEs, we indicate the level of support that will be provided. The range is from ongoing updates and improvements to a point-in-time release for thought leadership.

Glossary

Multinode Training
Supported on a pyxis/enroot Slurm cluster.

Deep Learning Compiler (DLC)
TensorFlow XLA and PyTorch JIT and/or TorchScript

Accelerated Linear Algebra (XLA)
XLA is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes. The results are improvements in speed and memory usage.

PyTorch JIT and/or TorchScript
TorchScript is a way to create serializable and optimizable models from PyTorch code. TorchScript, an intermediate representation of a PyTorch model (subclass of nn.Module) that can then be run in a high-performance environment such as C++.

Automatic Mixed Precision (AMP)
Automatic Mixed Precision (AMP) enables mixed precision training on Volta, Turing, and NVIDIA Ampere GPU architectures automatically.

TensorFloat-32 (TF32)
TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.

Jupyter Notebooks (NB)
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.

Feedback / Contributions

We're posting these examples on GitHub to better support the community, facilitate feedback, as well as collect and implement contributions using GitHub Issues and pull requests. We welcome all contributions!

Known issues

In each of the network READMEs, we indicate any known issues and encourage the community to provide feedback.

Comments
  • Do you have pre-trained models to continue training?

    Do you have pre-trained models to continue training?

    I'm working on Tacotron 2

    I've tried to continue training from provided checkpoints JoC_Tacotron2_FP32_PyT_20190306 and JoC_WaveGlow_FP32_PyT_20190306, but it didn't work out.

    :::NVLOGv0.2.2 Tacotron2_PyT 1574863255.828569651 (/workspace/tacotron2/dllogger/logger.py:279) run_start
    :::NVLOGv0.2.2 Tacotron2_PyT 1574863255.837591887 (/workspace/tacotron2/dllogger/logger.py:251) cpu_info: {"num": 16, "name": "Intel(R) Xeon(R) CPU @ 2.00GHz"}
    :::NVLOGv0.2.2 Tacotron2_PyT 1574863255.845489264 (/workspace/tacotron2/dllogger/logger.py:251) mem_info: {"ram": "102G"}
    :::NVLOGv0.2.2 Tacotron2_PyT 1574863255.917979240 (/workspace/tacotron2/dllogger/logger.py:251) gpu_info: {"driver_version": "418.87.00", "num": 1, "name": ["Tesla P100-PCIE-16GB"], "mem": ["16280 MiB"]}
    :::NVLOGv0.2.2 Tacotron2_PyT 1574863255.921807289 (/workspace/tacotron2/dllogger/logger.py:251) args: {"output_directory": "./output/", "dataset_path": "./", "model_name": "Tacotron2", "log_file": "./output/nvlog.json", "anneal_steps": ["500", "1000", "1500"], "anneal_factor": 0.1, "epochs": 1501, "epochs_per_checkpoint": 50, "checkpoint_path": "./JoC_Tacotron2_FP32_PyT_20190306", "seed": 1234, "dynamic_loss_scaling": true, "amp_run": true, "cudnn_enabled": true, "cudnn_benchmark": false, "disable_uniform_initialize_bn_weight": false, "use_saved_learning_rate": false, "learning_rate": 0.001, "weight_decay": 1e-06, "grad_clip_thresh": 1.0, "batch_size": 128, "grad_clip": 5.0, "load_mel_from_disk": false, "training_files": "filelists/ljs_audio_text_train_filelist.txt", "validation_files": "filelists/ljs_audio_text_val_filelist.txt", "text_cleaners": ["english_cleaners"], "max_wav_value": 32768.0, "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "rank": 0, "world_size": 1, "dist_url": "tcp://localhost:23456", "group_name": "group_name", "dist_backend": "nccl", "mask_padding": false, "n_mel_channels": 80, "n_symbols": 148, "symbols_embedding_dim": 512, "encoder_kernel_size": 5, "encoder_n_convolutions": 3, "encoder_embedding_dim": 512, "n_frames_per_step": 1, "decoder_rnn_dim": 1024, "prenet_dim": 256, "max_decoder_steps": 2000, "gate_threshold": 0.5, "p_attention_dropout": 0.1, "p_decoder_dropout": 0.1, "decoder_no_early_stopping": false, "attention_rnn_dim": 1024, "attention_dim": 128, "attention_location_n_filters": 32, "attention_location_kernel_size": 31, "postnet_embedding_dim": 512, "postnet_kernel_size": 5, "postnet_n_convolutions": 5}
    :::NVLOGv0.2.2 Tacotron2_PyT 1574863255.922522545 (/workspace/tacotron2/dllogger/logger.py:251) run_start
    Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.
    
    Defaults for this optimization level are:
    enabled                : True
    opt_level              : O1
    cast_model_type        : None
    patch_torch_functions  : True
    keep_batchnorm_fp32    : None
    master_weights         : None
    loss_scale             : dynamic
    Processing user overrides (additional kwargs that are not None)...
    After processing overrides, optimization options are:
    enabled                : True
    opt_level              : O1
    cast_model_type        : None
    patch_torch_functions  : True
    keep_batchnorm_fp32    : None
    master_weights         : None
    loss_scale             : dynamic
    Traceback (most recent call last):
      File "train.py", line 501, in <module>
        main()
      File "train.py", line 350, in main
        args.amp_run, args.checkpoint_path)
      File "train.py", line 202, in load_checkpoint
        torch.cuda.set_rng_state_all(checkpoint['cuda_rng_state_all'])
    KeyError: 'cuda_rng_state_all'
    

    I guess that checkpoints are not made to continue training.

    Do you have pre-trained models to continue training from them?

    opened by maloyan 34
  • segmentation fault when running tensorflow op

    segmentation fault when running tensorflow op

    ➜  build git:(tf_multihead_attention) ✗ python ../sample/tensorflow/transformer_fp32.py 1 12 32 12 64
    /home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint8 = np.dtype([("qint8", np.int8, 1)])
    /home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
    /home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint16 = np.dtype([("qint16", np.int16, 1)])
    /home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
    /home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint32 = np.dtype([("qint32", np.int32, 1)])
    /home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      np_resource = np.dtype([("resource", np.ubyte, 1)])
    /home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint8 = np.dtype([("qint8", np.int8, 1)])
    /home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
    /home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint16 = np.dtype([("qint16", np.int16, 1)])
    /home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
    /home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      _np_qint32 = np.dtype([("qint32", np.int32, 1)])
    /home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
      np_resource = np.dtype([("resource", np.ubyte, 1)])
    Argumentlist: batch_size 1 num_layers 12 seq_len 32
    WARNING: Logging before flag parsing goes to stderr.
    W0819 10:26:09.628401 140044507706432 deprecation_wrapper.py:119] From ../sample/tensorflow/transformer_fp32.py:201: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.
    
    W0819 10:26:09.628654 140044507706432 deprecation_wrapper.py:119] From ../sample/tensorflow/transformer_fp32.py:201: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.
    
    W0819 10:26:09.629366 140044507706432 deprecation.py:323] From ../sample/tensorflow/transformer_fp32.py:109: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
    Instructions for updating:
    Use keras.layers.dense instead.
    W0819 10:26:10.424225 140044507706432 lazy_loader.py:50]
    The TensorFlow contrib module will not be included in TensorFlow 2.0.
    For more information, please see:
      * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
      * https://github.com/tensorflow/addons
      * https://github.com/tensorflow/io (for I/O related ops)
    If you depend on functionality not listed there, please file an issue.
    
    W0819 10:26:12.424194 140044507706432 deprecation_wrapper.py:119] From ../sample/tensorflow/transformer_fp32.py:341: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.
    
    W0819 10:26:12.424426 140044507706432 deprecation_wrapper.py:119] From ../sample/tensorflow/transformer_fp32.py:342: The name tf.OptimizerOptions is deprecated. Please use tf.compat.v1.OptimizerOptions instead.
    
    W0819 10:26:12.424533 140044507706432 deprecation_wrapper.py:119] From ../sample/tensorflow/transformer_fp32.py:343: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
    
    2019-08-19 10:26:12.424749: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
    2019-08-19 10:26:12.437915: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
    2019-08-19 10:26:12.639802: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2019-08-19 10:26:12.641082: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5461d20 executing computations on platform CUDA. Devices:
    2019-08-19 10:26:12.641107: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
    2019-08-19 10:26:12.641115: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): Tesla T4, Compute Capability 7.5
    2019-08-19 10:26:12.644194: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
    2019-08-19 10:26:12.650989: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5677000 executing computations on platform Host. Devices:
    2019-08-19 10:26:12.651016: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
    2019-08-19 10:26:12.653295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
    name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
    pciBusID: 0000:5e:00.0
    2019-08-19 10:26:12.653364: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
    2019-08-19 10:26:12.654268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
    name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
    pciBusID: 0000:d8:00.0
    2019-08-19 10:26:12.654320: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
    2019-08-19 10:26:12.654350: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
    2019-08-19 10:26:12.654457: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
    2019-08-19 10:26:12.654504: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
    2019-08-19 10:26:12.654544: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
    2019-08-19 10:26:12.654584: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
    2019-08-19 10:26:12.658600: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
    2019-08-19 10:26:12.658624: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
    2019-08-19 10:26:12.658705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
    2019-08-19 10:26:12.658715: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1
    2019-08-19 10:26:12.658723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y
    2019-08-19 10:26:12.658729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N
    2019-08-19 10:26:13.230271: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
    0 layer_0/attention/self/query/kernel:0 (768, 768)
    1 layer_0/attention/self/query/bias:0 (768,)
    2 layer_0/attention/self/key/kernel:0 (768, 768)
    3 layer_0/attention/self/key/bias:0 (768,)
    4 layer_0/attention/self/value/kernel:0 (768, 768)
    5 layer_0/attention/self/value/bias:0 (768,)
    6 layer_0/attention/output/dense/kernel:0 (768, 768)
    7 layer_0/attention/output/dense/bias:0 (768,)
    8 layer_0/attention/output/LayerNorm/beta:0 (768,)
    9 layer_0/attention/output/LayerNorm/gamma:0 (768,)
    10 layer_0/intermediate/dense/kernel:0 (768, 3072)
    ...
    180 layer_11/attention/self/value/kernel:0 (768, 768)
    181 layer_11/attention/self/value/bias:0 (768,)
    182 layer_11/attention/output/dense/kernel:0 (768, 768)
    183 layer_11/attention/output/dense/bias:0 (768,)
    184 layer_11/attention/output/LayerNorm/beta:0 (768,)
    185 layer_11/attention/output/LayerNorm/gamma:0 (768,)
    186 layer_11/intermediate/dense/kernel:0 (768, 3072)
    187 layer_11/intermediate/dense/bias:0 (3072,)
    188 layer_11/output/dense/kernel:0 (3072, 768)
    189 layer_11/output/dense/bias:0 (768,)
    190 layer_11/output/LayerNorm/beta:0 (768,)
    191 layer_11/output/LayerNorm/gamma:0 (768,)
    [1]    63119 segmentation fault (core dumped)  python ../sample/tensorflow/transformer_fp32.py 1 12 32 12 64
    
    opened by duduscript 21
  • Is it possible to train voices and models with fastpitch using CPU only and/or without NVIDIA Container Toolkit/docker? (doubt)

    Is it possible to train voices and models with fastpitch using CPU only and/or without NVIDIA Container Toolkit/docker? (doubt)

    I am new to these things and from everything i read from the instructions on the repository and google after a few hours, i am unsure if its worth all the trouble of trying to set up the nvidia docker, ubuntu, wsl2, and even joining insider programs in microsoft and nvidia, which i never had done before and seems a massive hassle to me, and i want to be sure it's worth joining all those programs.

    I say this because apparently my new graphics card is not suited for training fastpitch/deep learning, perhaps my cpu would even be faster for fastpitch training ??

    NVIDIA GEFORCE GTX 1060 SUPER 6GB INTEL CORE I9 9900KF

    From what i seen online, unfortunately my card doesnt have tensor cores and not enough vram for deep learning, so i ask, it there a way to train fastpitch models without using gpu and all those requirements such as the nvidia toolkit, drivers, wsl, etc etc and using only CPU? Note that i dont mind if it's slow or not, i really want to be able to use text to speech, using fastpitch so i wonder if perhaps the cpu could be faster than this card OR if despite the card being weak for deep learning.

    opened by cesm1980 20
  • CUDA error

    CUDA error

    encountering cuda runtime error File "/workspace/tacotron2/tacotron2/data_function.py", line 148, in batch_to_gpu max_len = torch.max(input_lengths.data).item() RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at /tmp/pip-req-build-akjifb_7/aten/src/THC/generic/THCTensorMathReduce.cu:94

    nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243

    +-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 0000E5F5:00:00.0 Off | 0 | | N/A 50C P0 59W / 149W | 0MiB / 11441MiB | 97% Default | +-------------------------------+----------------------+----------------------+

    +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

    image

    opened by shradda8 20
  • NanLossDuringTrainingError when training BERT large model

    NanLossDuringTrainingError when training BERT large model

    I have been using the BERT with FP16 + XLA implementation for several weeks. It works great for BERT BASE model training. Recently I started to use it to train LARGE model with FP16+XLA. The training went well until around 344k step. It hit NanLossDuringTrainingError with error "Model diverged with loss = NaN.". The error stack with tf 1.13.1 is below. Can you provide some insights on what's wrong? Thanks.

    Model diverged with loss = NaN. Error recorded from training_loop: NaN loss during training. training_loop marked as finished WARNING: Reraising captured error Traceback (most recent call last): File "run_pretraining.py", line 610, in File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "run_pretraining.py", line 582, in main File "/usr/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2457, in train rendezvous.raise_errors() File "/usr/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors six.reraise(typ, value, traceback) File "/usr/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2452, in train saving_listeners=saving_listeners) File "/usr/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default saving_listeners) File "/usr/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run run_metadata=run_metadata) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run run_metadata=run_metadata) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run raise six.reraise(*original_exc_info) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run return self._sess.run(*args, **kwargs) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1335, in run run_metadata=run_metadata)) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run raise NanLossDuringTrainingError

    opened by LiweiPeng 17
  • When XLA is used, bert nvprof shows no compute and communication overlap

    When XLA is used, bert nvprof shows no compute and communication overlap

    When I trained the bert model with horovod and XLA, although XLA can significantly improve the perf, it decreased the scalability significantly.

    Our cluster has 16 nodes, 4 V100 32GB/node. The nodes are linked with 100GB Mellanox RoCE RDMA. There is no NVLINK/NVSwitch technology used.

    When XLA is not used, the scalability from 1 to 16 GPU is 1.92x. When XLA is used, the scalability dropped to 1.8x from 1 to 16 GPUs.

    Environment: Framework: (TensorFlow, Keras, PyTorch, MXNet) Tensorflow Framework version: 1.14 Horovod version: 0.16.4 MPI version: 3.1.1 CUDA version: 10.0 NCCL version: 2.4.7 Python version: 2.7.5 OS and version: CentOS 7.4 GCC version: 4.8.5

    When XLA is not used, nvprof shows that there is good compute and communication (shown as 'mem' in the screenshot below) overlap, as expected. a

    However, when XLA is used, nvprof shows that there is little compute and communication overlap. b

    The problem is: What caused this no compute and communication overlap when XLA is used? Is this a bug or expected behavior for XLA?

    opened by LiweiPeng 16
  • [Tacotron2-Waveglow/PyTorch] tacotron & waveglow trt engine inference ERROR

    [Tacotron2-Waveglow/PyTorch] tacotron & waveglow trt engine inference ERROR

    Related to Model/Framework(s) [Tacotron2-Waveglow/PyTorch]

    Describe the bug

    I try inference by pre-trained models is ok: python inference.py --tacotron2 nvidia_tacotron2pyt_fp16_20190427 --waveglow nvidia_waveglowpyt_fp16_20190427 -o output/ --include-warmup -i phrases/phrase_1_64.txt --fp16 --log-file=output/nvlog_fp16.json

    Then I try to inference by trt engine,

    The follow steps about tacotron + waveglow tensorRT inference, models are from NGC models repository

    1. tacotron exports to onnx : python export_tacotron2_onnx.py --tacotron2 nvidia_tacotron2pyt_fp16_20190427 -o exports/ --fp16

    2. waveglow 256 channel exports to onnx : python export_waveglow_onnx.py --waveglow audio/nvidia_waveglow256pyt_fp16 --wn-channels 256 -o exports/ --fp16

    3. tacotron & waveglow exports onnx to trt engine : python export_onnx2trt.py -o audio/ --encoder exports/encoder.onnx --decoder exports/decoder_iter.onnx --postnet exports/postnet.onnx --waveglow exports/waveglow.onnx --fp16

    4. tacotron & waveglow trt engine inference : python inference_trt.py -i test.txt -o audio/ --encoder audio/encoder_fp16.engine --decoder audio/decoder_iter_fp16.engine --postnet audio/postnet_fp16.engine --waveglow audio/waveglow_fp16.engine --fp16

    Then I get the following error: [TensorRT] WARNING: TensorRT was linked against cuBLAS 10.2.2 but loaded cuBLAS 10.2.1 [TensorRT] WARNING: TensorRT was linked against cuBLAS 10.2.2 but loaded cuBLAS 10.2.1 [TensorRT] WARNING: TensorRT was linked against cuBLAS 10.2.2 but loaded cuBLAS 10.2.1 [TensorRT] WARNING: TensorRT was linked against cuBLAS 10.2.2 but loaded cuBLAS 10.2.1 Running Tacotron2 Encoder [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::resolveSlots::1092, condition: allInputDimensionsSpecified(routine) Running Tacotron2 Decoder [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::resolveSlots::1092, condition: allInputDimensionsSpecified(routine) [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::resolveSlots::1092, condition: allInputDimensionsSpecified(routine)

    steps 1~3 will output model success, but cannot inference.

    Any suggestions about this problem? I try to search this error but it's seldom issues about this.

    I will thankful if you answer that Orz...

    bug 
    opened by RaymondTsao 14
  • [BERT/TF] Multi-node SQUAD fine tuning hangs

    [BERT/TF] Multi-node SQUAD fine tuning hangs

    Environment

    1. TensorFlow 1.15
    2. Horovod 0.18.1
    3. OpenMPI 4.0
    4. CUDA 10.0
    5. NCCL 2.4.8
    6. Python 3.5

    Issue

    I tried to run multi-node SQUAD fine tuning on two VMs (each with 8 * V100) using the following command:

    mpirun -np 16 -hostfile HOSTFILE -mca plm_rsh_no_tree_spawn 1 -bind-to socket -map-by slot -x NCCL_MIN_NRINGS=4 -x TF_CPP_MIN_LOG_LEVEL=0 -x NCCL_DEBUG=INFO -x -x NCCL_ALGO=Ring -x NCCL_BUFFSIZE=8388608 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 python3 DeepLearningExamples/TensorFlow/LanguageModeling/BERT/run_squad.py --vocab_file=DeepLearningExamples/TensorFlow/LanguageModeling/BERT/data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --bert_config_file=DeepLearningExamples/TensorFlow/LanguageModeling/BERT/data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/bert_config.json --init_checkpoint=DeepLearningExamples/TensorFlow/LanguageModeling/BERT/data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/bert_model.ckpt --do_train=True --train_file=DeepLearningExamples/TensorFlow/LanguageModeling/BERT/data/download/squad/v1.1/train-v1.1.json --train_batch_size=2 --learning_rate=5e-6 --num_train_epochs=1 --max_seq_length=128 --doc_stride=128 --output_dir=/tmp/pkb --horovod --use_fp16
    

    However, the master rank (rank 0) hangs with the following message (actually the iteration was not completely stalled. It progressed very slowly):

    2020-01-23 23:25:03.863322: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
    2020-01-23 23:25:03.928604: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
    2020-01-23 23:25:03.940879: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
    2020-01-23 23:25:04.633817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
    2020-01-23 23:25:04.872561: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
    2020-01-23 23:25:04.924935: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
    2020-01-23 23:25:05.165986: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
    2020-01-23 23:25:05.313782: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
    [2020-01-23 23:26:23. 26353: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by
     subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit diff
    erent tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
    Stalled ranks:
    8: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeigh
    tDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allreduc
    e/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Distr
    ibutedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOpt
    imizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_101
    2_0 ...]
    9: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeigh
    tDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allreduc
    e/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Distr
    ibutedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOpt
    imizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_101
    2_0 ...]
    10: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeig
    htDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allredu
    ce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Dist
    ributedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOp
    timizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_10
    12_0 ...]
    11: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeig
    htDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allredu
    ce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Dist
    ributedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOp
    timizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_10
    12_0 ...]
    12: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeig
    htDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allredu
    ce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Dist
    ributedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOp
    timizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_10
    12_0 ...]
    13: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeig
    htDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allredu
    ce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Dist
    ributedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOp
    timizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_10
    12_0 ...]
    14: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeig
    htDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allredu
    ce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Dist
    ributedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOp
    timizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_10
    12_0 ...]
    15: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeig
    htDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allredu
    ce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Dist
    ributedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOp
    timizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_10
    12_0 ...]
    

    I also tested on the single node with 8 GPUs. It worked without any issues.

    bug 
    opened by changlan 13
  • [FastPitch 1.1/PyTorch] Advice/best practices for good alignment when fine-tuning

    [FastPitch 1.1/PyTorch] Advice/best practices for good alignment when fine-tuning

    Related to FastPitch 1.1/PyTorch

    Hi @alancucki @rafaelvalle. I've been experimenting with the FastPitch 1.1 update incorporating rad-tts into fastpitch, since the commit a while back. The alignment mechanism is amazing, and enables some great things (eg s2s).

    However, I'm unfortunately having some issues with the convergence on some datasets with this approach, compared to 1.0. I've successfully converged an LJSpeech model quite well, but when fine-tuning a pre-trained model (like the one provided), it seems that the alignment is having some real trouble converging.

    I have tried it on 4 datasets so far - 2 male, 2 female. I did get something ALMOST good with one of the female datasets (~9h), but they all converge to a fairly high KL loss (>=0.85 at 40k-120k its, compared to less than 0.35 where I stopped LJ at 25k its). I added soft and hard alignment plots to the logs, and they resemble plots c) and d) from Fig 2 in the RAD-TTS paper. I noticed also in Figure 2 from "One TTS Alignment To Rule Them All", that the convergence speed was lower using RAD-TTS with FastPitch (compared to Tacotron2 durs), before they arrived at a similar point - could this be exacerbated by smaller datasets?

    I have tried experimenting with resuming the LJ optimizer (from the model I trained myself) as well as one newly initialized (from the provided LJ model), with and without including the KL weight warm-up stage. I also tried with and without arpabet, with and without energy conditioning, and also several tweaks to lr scheduling, and other such things, but I can never get anything as good as LJ (in the KL loss at least).

    When running inference, the sentence composition quality varies between datasets, ranging from missing letters to missing words, and for the smaller datasets, quite difficult to understand speech, spoken very fast.

    The same datasets worked very well in the previous Tacotron2+FastPitch set-up, so I'm confident that the data quality is high. Have you by any chance had any successes yourselves with something other than LJ? And would you have any tips/advice for how to better converge the alignment on smaller datasets (with transfer learning)?

    Thank you for all your great work!

    bug 
    opened by DanRuta 12
  • I got

    I got "killed" when I create instance from sharded files.

    Related to Pytorch/LanguageModeling

    I followed the instruction and preprocess the downloaded 'bookscorpus' dataset. The sharding action went well. But in the 'creat_hdf5_files' action, I got “killed” when creating instance from sharded files, for both sequence length 128 case and sequence length 512 case. Since there is no more information about the error, I have no idea how to fix it. Could you do me a favor? Thanks a lot.

    opened by TonyTangYu 12
  • Why The Test Result of Transformer NMT Task with 4 GPUs Is Worse Than What Is Reported in Readme

    Why The Test Result of Transformer NMT Task with 4 GPUs Is Worse Than What Is Reported in Readme

    In the readme file, 4 GPUs can achieve a BLEU of 28.35 and even 28.67 when training more epochs.

    GPU count | Mixed precision BLEU | fp32 BLEU | Mixed precision training time | fp32 training time -- | -- | -- | -- | -- 8 | 28.69 | 28.43 | 446 min | 1896 min 4 | 28.35 | 28.31 | 834 min | 3733 min

    GPU count | Precision | BLEU score | Epochs to train | Training time -- | -- | -- | -- | -- 4 | fp16 | 28.67 | 74 | 1925 min 4 | fp32 | 28.40 | 47 | 5478 min

    However, I have run the code with 4 GPUs and I did not modify the code at all but The Best Result I got is 27.63 on my "checkpoint_best.pt" which is epoch 19 in my case. I have run totally 80 epochs and the best BLEU over all those epochs is 28.13 which is not considered as the "checkpoint_best.pt" in the validation process.

    I used the following command line to train the model:

    nohup python -m torch.distributed.launch --nproc_per_node 4 /workspace/translation/train.py /workspace/data-bin/wmt14_en_de_joined_dict
    --arch transformer_wmt_en_de_big_t2t
    --share-all-embeddings
    --optimizer adam
    --adam-betas '(0.9, 0.997)'
    --adam-eps "1e-9"
    --clip-norm 0.0
    --lr-scheduler inverse_sqrt
    --warmup-init-lr 0.0
    --update-freq 2
    --warmup-updates 8000
    --lr 0.0006
    --min-lr 0.0
    --dropout 0.1
    --weight-decay 0.0
    --criterion label_smoothed_cross_entropy
    --label-smoothing 0.1
    --max-tokens 5120
    --seed 1
    --max-epoch 80
    --ignore-case
    --fp16
    --save-dir /workspace/checkpoints
    --distributed-init-method env:// > train.nohup.out &

    I also tried different warmup-updates and lr, and the results are similar. The result I got is like:

    Test Checkpoint1 | Translated 3003 sentences (84994 tokens) in 25.2s (119.35 sentences/s, 3377.84 tokens/s) | Generate test with beam=4: BLEU4 = 18.11, 50.2/23.5/12.7/7.2 (BP=1.000, ratio=1.041, syslen=67147, reflen=64512) Test Checkpoint2 | Translated 3003 sentences (87704 tokens) in 27.5s (109.17 sentences/s, 3188.43 tokens/s) | Generate test with beam=4: BLEU4 = 21.26, 52.5/26.7/15.5/9.4 (BP=1.000, ratio=1.061, syslen=68450, reflen=64512) Test Checkpoint3 | Translated 3003 sentences (86611 tokens) in 25.8s (116.61 sentences/s, 3363.17 tokens/s) | Generate test with beam=4: BLEU4 = 23.91, 55.5/29.5/17.8/11.2 (BP=1.000, ratio=1.040, syslen=67079, reflen=64512) Test Checkpoint4 | Translated 3003 sentences (86518 tokens) in 25.8s (116.61 sentences/s, 3359.54 tokens/s) | Generate test with beam=4: BLEU4 = 25.26, 56.7/30.9/19.0/12.3 (BP=1.000, ratio=1.035, syslen=66758, reflen=64512) Test Checkpoint5 | Translated 3003 sentences (86768 tokens) in 25.7s (116.96 sentences/s, 3379.47 tokens/s) | Generate test with beam=4: BLEU4 = 25.63, 56.8/31.2/19.4/12.5 (BP=1.000, ratio=1.034, syslen=66698, reflen=64512) Test Checkpoint6 | Translated 3003 sentences (87220 tokens) in 25.8s (116.21 sentences/s, 3375.30 tokens/s) | Generate test with beam=4: BLEU4 = 25.98, 56.9/31.5/19.8/12.9 (BP=1.000, ratio=1.042, syslen=67205, reflen=64512) Test Checkpoint7 | Translated 3003 sentences (87715 tokens) in 25.9s (115.80 sentences/s, 3382.54 tokens/s) | Generate test with beam=4: BLEU4 = 26.24, 57.2/31.8/20.0/13.0 (BP=1.000, ratio=1.045, syslen=67413, reflen=64512) Test Checkpoint8 | Translated 3003 sentences (87808 tokens) in 26.8s (111.88 sentences/s, 3271.39 tokens/s) | Generate test with beam=4: BLEU4 = 26.82, 57.6/32.3/20.5/13.6 (BP=1.000, ratio=1.045, syslen=67444, reflen=64512) Test Checkpoint9 | Translated 3003 sentences (87394 tokens) in 25.6s (117.26 sentences/s, 3412.38 tokens/s) | Generate test with beam=4: BLEU4 = 26.63, 57.8/32.2/20.3/13.3 (BP=1.000, ratio=1.039, syslen=67033, reflen=64512) Test Checkpoint10 | Translated 3003 sentences (86825 tokens) in 25.8s (116.31 sentences/s, 3362.82 tokens/s) | Generate test with beam=4: BLEU4 = 27.10, 58.1/32.7/20.7/13.7 (BP=1.000, ratio=1.031, syslen=66541, reflen=64512) Test Checkpoint11 | Translated 3003 sentences (86850 tokens) in 25.9s (116.11 sentences/s, 3358.03 tokens/s) | Generate test with beam=4: BLEU4 = 27.29, 58.1/32.8/20.9/13.9 (BP=1.000, ratio=1.032, syslen=66563, reflen=64512) Test Checkpoint12 | Translated 3003 sentences (87137 tokens) in 26.2s (114.74 sentences/s, 3329.31 tokens/s) | Generate test with beam=4: BLEU4 = 27.28, 58.2/32.9/20.9/13.8 (BP=1.000, ratio=1.035, syslen=66787, reflen=64512) Test Checkpoint13 | Translated 3003 sentences (86810 tokens) in 25.6s (117.41 sentences/s, 3393.98 tokens/s) | Generate test with beam=4: BLEU4 = 27.26, 58.3/32.9/20.9/13.8 (BP=1.000, ratio=1.031, syslen=66500, reflen=64512) Test Checkpoint14 | Translated 3003 sentences (87359 tokens) in 25.8s (116.30 sentences/s, 3383.15 tokens/s) | Generate test with beam=4: BLEU4 = 27.69, 58.3/33.2/21.3/14.3 (BP=1.000, ratio=1.036, syslen=66830, reflen=64512) Test Checkpoint15 | Translated 3003 sentences (87415 tokens) in 26.3s (114.33 sentences/s, 3327.98 tokens/s) | Generate test with beam=4: BLEU4 = 27.37, 58.1/32.9/21.0/14.0 (BP=1.000, ratio=1.038, syslen=66951, reflen=64512) Test Checkpoint16 | Translated 3003 sentences (87332 tokens) in 26.7s (112.51 sentences/s, 3272.10 tokens/s) | Generate test with beam=4: BLEU4 = 27.33, 58.1/32.9/21.0/13.9 (BP=1.000, ratio=1.039, syslen=66998, reflen=64512) Test Checkpoint17 | Translated 3003 sentences (86721 tokens) in 25.9s (116.06 sentences/s, 3351.62 tokens/s) | Generate test with beam=4: BLEU4 = 27.32, 58.4/33.0/20.9/13.8 (BP=1.000, ratio=1.029, syslen=66385, reflen=64512) Test Checkpoint18 | Translated 3003 sentences (87388 tokens) in 26.2s (114.71 sentences/s, 3338.08 tokens/s) | Generate test with beam=4: BLEU4 = 27.57, 58.3/33.1/21.2/14.2 (BP=1.000, ratio=1.038, syslen=66956, reflen=64512) Test Checkpoint19 | Translated 3003 sentences (86919 tokens) in 25.8s (116.28 sentences/s, 3365.50 tokens/s) | Generate test with beam=4: BLEU4 = 27.63, 58.6/33.3/21.2/14.1 (BP=1.000, ratio=1.033, syslen=66642, reflen=64512) Test Checkpoint20 | Translated 3003 sentences (87485 tokens) in 26.1s (115.24 sentences/s, 3357.16 tokens/s) | Generate test with beam=4: BLEU4 = 27.48, 58.1/33.0/21.1/14.1 (BP=1.000, ratio=1.037, syslen=66924, reflen=64512) Test Checkpoint21 | Translated 3003 sentences (86993 tokens) in 26.3s (114.07 sentences/s, 3304.46 tokens/s) | Generate test with beam=4: BLEU4 = 27.77, 58.5/33.3/21.4/14.3 (BP=1.000, ratio=1.032, syslen=66564, reflen=64512) Test Checkpoint22 | Translated 3003 sentences (87084 tokens) in 25.4s (118.07 sentences/s, 3424.04 tokens/s) | Generate test with beam=4: BLEU4 = 27.87, 58.6/33.3/21.5/14.4 (BP=1.000, ratio=1.032, syslen=66595, reflen=64512) Test Checkpoint23 | Translated 3003 sentences (87013 tokens) in 26.4s (113.92 sentences/s, 3300.98 tokens/s) | Generate test with beam=4: BLEU4 = 27.59, 58.4/33.2/21.2/14.1 (BP=1.000, ratio=1.033, syslen=66626, reflen=64512) Test Checkpoint24 | Translated 3003 sentences (86741 tokens) in 26.0s (115.49 sentences/s, 3335.84 tokens/s) | Generate test with beam=4: BLEU4 = 27.98, 58.7/33.5/21.6/14.4 (BP=1.000, ratio=1.029, syslen=66379, reflen=64512) Test Checkpoint25 | Translated 3003 sentences (86884 tokens) in 25.4s (118.05 sentences/s, 3415.42 tokens/s) | Generate test with beam=4: BLEU4 = 27.94, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.029, syslen=66392, reflen=64512) Test Checkpoint26 | Translated 3003 sentences (86840 tokens) in 26.4s (113.68 sentences/s, 3287.46 tokens/s) | Generate test with beam=4: BLEU4 = 27.91, 58.7/33.5/21.5/14.4 (BP=1.000, ratio=1.028, syslen=66344, reflen=64512) Test Checkpoint27 | Translated 3003 sentences (87050 tokens) in 26.2s (114.45 sentences/s, 3317.73 tokens/s) | Generate test with beam=4: BLEU4 = 27.88, 58.7/33.4/21.5/14.3 (BP=1.000, ratio=1.030, syslen=66451, reflen=64512) Test Checkpoint28 | Translated 3003 sentences (86981 tokens) in 25.8s (116.40 sentences/s, 3371.53 tokens/s) | Generate test with beam=4: BLEU4 = 27.80, 58.7/33.3/21.4/14.3 (BP=1.000, ratio=1.031, syslen=66488, reflen=64512) Test Checkpoint29 | Translated 3003 sentences (86219 tokens) in 25.6s (117.33 sentences/s, 3368.59 tokens/s) | Generate test with beam=4: BLEU4 = 27.82, 58.8/33.4/21.4/14.3 (BP=1.000, ratio=1.022, syslen=65941, reflen=64512) Test Checkpoint30 | Translated 3003 sentences (86879 tokens) in 26.9s (111.61 sentences/s, 3229.04 tokens/s) | Generate test with beam=4: BLEU4 = 27.88, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.031, syslen=66501, reflen=64512) Test Checkpoint31 | Translated 3003 sentences (87082 tokens) in 26.6s (112.83 sentences/s, 3271.95 tokens/s) | Generate test with beam=4: BLEU4 = 28.00, 58.8/33.6/21.6/14.4 (BP=1.000, ratio=1.032, syslen=66570, reflen=64512) Test Checkpoint32 | Translated 3003 sentences (86677 tokens) in 26.6s (112.93 sentences/s, 3259.43 tokens/s) | Generate test with beam=4: BLEU4 = 27.98, 58.8/33.5/21.6/14.4 (BP=1.000, ratio=1.028, syslen=66289, reflen=64512) Test Checkpoint33 | Translated 3003 sentences (87034 tokens) in 26.2s (114.54 sentences/s, 3319.61 tokens/s) | Generate test with beam=4: BLEU4 = 28.10, 58.8/33.6/21.7/14.5 (BP=1.000, ratio=1.032, syslen=66553, reflen=64512) Test Checkpoint34 | Translated 3003 sentences (87064 tokens) in 26.3s (114.28 sentences/s, 3313.16 tokens/s) | Generate test with beam=4: BLEU4 = 27.92, 58.4/33.3/21.6/14.4 (BP=1.000, ratio=1.031, syslen=66534, reflen=64512) Test Checkpoint35 | Translated 3003 sentences (86818 tokens) in 26.6s (112.86 sentences/s, 3262.78 tokens/s) | Generate test with beam=4: BLEU4 = 28.11, 58.9/33.7/21.7/14.5 (BP=1.000, ratio=1.028, syslen=66336, reflen=64512) Test Checkpoint36 | Translated 3003 sentences (87037 tokens) in 25.9s (115.89 sentences/s, 3358.98 tokens/s) | Generate test with beam=4: BLEU4 = 28.18, 58.8/33.6/21.8/14.6 (BP=1.000, ratio=1.031, syslen=66483, reflen=64512) Test Checkpoint37 | Translated 3003 sentences (86740 tokens) in 25.7s (116.91 sentences/s, 3376.92 tokens/s) | Generate test with beam=4: BLEU4 = 28.19, 58.9/33.7/21.8/14.6 (BP=1.000, ratio=1.026, syslen=66197, reflen=64512) Test Checkpoint38 | Translated 3003 sentences (87084 tokens) in 26.1s (115.05 sentences/s, 3336.24 tokens/s) | Generate test with beam=4: BLEU4 = 28.01, 58.7/33.5/21.6/14.5 (BP=1.000, ratio=1.032, syslen=66551, reflen=64512) Test Checkpoint39 | Translated 3003 sentences (86972 tokens) in 27.7s (108.47 sentences/s, 3141.58 tokens/s) | Generate test with beam=4: BLEU4 = 28.10, 58.7/33.5/21.7/14.6 (BP=1.000, ratio=1.030, syslen=66456, reflen=64512) Test Checkpoint40 | Translated 3003 sentences (86717 tokens) in 25.7s (116.94 sentences/s, 3376.78 tokens/s) | Generate test with beam=4: BLEU4 = 27.81, 58.7/33.4/21.4/14.2 (BP=1.000, ratio=1.028, syslen=66314, reflen=64512) Test Checkpoint41 | Translated 3003 sentences (86542 tokens) in 26.0s (115.52 sentences/s, 3329.06 tokens/s) | Generate test with beam=4: BLEU4 = 27.69, 58.9/33.3/21.3/14.1 (BP=1.000, ratio=1.025, syslen=66127, reflen=64512) Test Checkpoint42 | Translated 3003 sentences (86841 tokens) in 27.1s (110.96 sentences/s, 3208.64 tokens/s) | Generate test with beam=4: BLEU4 = 27.99, 58.7/33.5/21.6/14.5 (BP=1.000, ratio=1.028, syslen=66329, reflen=64512) Test Checkpoint43 | Translated 3003 sentences (86986 tokens) in 26.8s (111.92 sentences/s, 3241.95 tokens/s) | Generate test with beam=4: BLEU4 = 27.81, 58.6/33.3/21.4/14.3 (BP=1.000, ratio=1.031, syslen=66501, reflen=64512) Test Checkpoint44 | Translated 3003 sentences (86691 tokens) in 25.6s (117.24 sentences/s, 3384.53 tokens/s) | Generate test with beam=4: BLEU4 = 28.09, 58.8/33.6/21.7/14.6 (BP=1.000, ratio=1.026, syslen=66162, reflen=64512) Test Checkpoint45 | Translated 3003 sentences (86845 tokens) in 26.5s (113.44 sentences/s, 3280.52 tokens/s) | Generate test with beam=4: BLEU4 = 28.00, 58.8/33.5/21.6/14.4 (BP=1.000, ratio=1.029, syslen=66353, reflen=64512) Test Checkpoint46 | Translated 3003 sentences (86280 tokens) in 25.7s (116.75 sentences/s, 3354.46 tokens/s) | Generate test with beam=4: BLEU4 = 28.13, 59.0/33.6/21.7/14.6 (BP=1.000, ratio=1.021, syslen=65860, reflen=64512) Test Checkpoint47 | Translated 3003 sentences (86857 tokens) in 26.4s (113.64 sentences/s, 3286.92 tokens/s) | Generate test with beam=4: BLEU4 = 27.77, 58.6/33.3/21.4/14.3 (BP=1.000, ratio=1.029, syslen=66402, reflen=64512) Test Checkpoint48 | Translated 3003 sentences (87087 tokens) in 26.0s (115.65 sentences/s, 3353.93 tokens/s) | Generate test with beam=4: BLEU4 = 27.68, 58.4/33.2/21.3/14.2 (BP=1.000, ratio=1.032, syslen=66576, reflen=64512) Test Checkpoint49 | Translated 3003 sentences (86627 tokens) in 25.5s (117.97 sentences/s, 3402.95 tokens/s) | Generate test with beam=4: BLEU4 = 28.02, 59.0/33.6/21.6/14.4 (BP=1.000, ratio=1.026, syslen=66208, reflen=64512) Test Checkpoint50 | Translated 3003 sentences (86529 tokens) in 25.9s (116.09 sentences/s, 3345.07 tokens/s) | Generate test with beam=4: BLEU4 = 27.96, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.024, syslen=66049, reflen=64512) Test Checkpoint51 | Translated 3003 sentences (87095 tokens) in 26.2s (114.50 sentences/s, 3320.73 tokens/s) | Generate test with beam=4: BLEU4 = 27.80, 58.6/33.4/21.4/14.3 (BP=1.000, ratio=1.030, syslen=66471, reflen=64512) Test Checkpoint52 | Translated 3003 sentences (87160 tokens) in 27.2s (110.54 sentences/s, 3208.27 tokens/s) | Generate test with beam=4: BLEU4 = 27.89, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.032, syslen=66559, reflen=64512) Test Checkpoint53 | Translated 3003 sentences (86909 tokens) in 26.1s (114.96 sentences/s, 3326.93 tokens/s) | Generate test with beam=4: BLEU4 = 27.90, 58.8/33.5/21.5/14.3 (BP=1.000, ratio=1.029, syslen=66353, reflen=64512) Test Checkpoint54 | Translated 3003 sentences (86785 tokens) in 26.1s (114.94 sentences/s, 3321.61 tokens/s) | Generate test with beam=4: BLEU4 = 28.05, 58.8/33.6/21.6/14.5 (BP=1.000, ratio=1.028, syslen=66308, reflen=64512) Test Checkpoint55 | Translated 3003 sentences (86914 tokens) in 25.9s (115.95 sentences/s, 3355.82 tokens/s) | Generate test with beam=4: BLEU4 = 27.76, 58.5/33.3/21.4/14.2 (BP=1.000, ratio=1.029, syslen=66376, reflen=64512) Test Checkpoint56 | Translated 3003 sentences (86775 tokens) in 26.5s (113.27 sentences/s, 3273.16 tokens/s) | Generate test with beam=4: BLEU4 = 27.75, 58.5/33.2/21.4/14.3 (BP=1.000, ratio=1.028, syslen=66314, reflen=64512) Test Checkpoint57 | Translated 3003 sentences (86522 tokens) in 26.3s (114.39 sentences/s, 3295.88 tokens/s) | Generate test with beam=4: BLEU4 = 27.91, 58.9/33.4/21.5/14.3 (BP=1.000, ratio=1.024, syslen=66052, reflen=64512) Test Checkpoint58 | Translated 3003 sentences (86269 tokens) in 26.1s (114.94 sentences/s, 3301.85 tokens/s) | Generate test with beam=4: BLEU4 = 27.77, 58.7/33.3/21.4/14.2 (BP=1.000, ratio=1.021, syslen=65893, reflen=64512) Test Checkpoint59 | Translated 3003 sentences (86738 tokens) in 25.9s (115.78 sentences/s, 3344.27 tokens/s) | Generate test with beam=4: BLEU4 = 27.96, 58.5/33.4/21.6/14.5 (BP=1.000, ratio=1.029, syslen=66378, reflen=64512) Test Checkpoint60 | Translated 3003 sentences (86566 tokens) in 25.7s (116.92 sentences/s, 3370.48 tokens/s) | Generate test with beam=4: BLEU4 = 27.85, 58.7/33.4/21.5/14.3 (BP=1.000, ratio=1.025, syslen=66151, reflen=64512) Test Checkpoint61 | Translated 3003 sentences (86785 tokens) in 25.3s (118.91 sentences/s, 3436.47 tokens/s) | Generate test with beam=4: BLEU4 = 27.74, 58.7/33.3/21.3/14.2 (BP=1.000, ratio=1.028, syslen=66291, reflen=64512) Test Checkpoint62 | Translated 3003 sentences (86261 tokens) in 25.7s (116.79 sentences/s, 3354.79 tokens/s) | Generate test with beam=4: BLEU4 = 27.86, 58.8/33.4/21.5/14.3 (BP=1.000, ratio=1.021, syslen=65898, reflen=64512) Test Checkpoint63 | Translated 3003 sentences (86569 tokens) in 25.1s (119.58 sentences/s, 3447.32 tokens/s) | Generate test with beam=4: BLEU4 = 27.92, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.025, syslen=66155, reflen=64512) Test Checkpoint64 | Translated 3003 sentences (86583 tokens) in 25.8s (116.47 sentences/s, 3357.96 tokens/s) | Generate test with beam=4: BLEU4 = 27.59, 58.5/33.2/21.2/14.1 (BP=1.000, ratio=1.025, syslen=66146, reflen=64512) Test Checkpoint65 | Translated 3003 sentences (86707 tokens) in 26.2s (114.76 sentences/s, 3313.64 tokens/s) | Generate test with beam=4: BLEU4 = 27.78, 58.5/33.3/21.4/14.2 (BP=1.000, ratio=1.028, syslen=66294, reflen=64512) Test Checkpoint66 | Translated 3003 sentences (86478 tokens) in 26.0s (115.55 sentences/s, 3327.54 tokens/s) | Generate test with beam=4: BLEU4 = 27.63, 58.5/33.2/21.3/14.1 (BP=1.000, ratio=1.025, syslen=66114, reflen=64512) Test Checkpoint67 | Translated 3003 sentences (86564 tokens) in 25.8s (116.40 sentences/s, 3355.20 tokens/s) | Generate test with beam=4: BLEU4 = 27.92, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.026, syslen=66200, reflen=64512) Test Checkpoint68 | Translated 3003 sentences (86548 tokens) in 26.2s (114.58 sentences/s, 3302.20 tokens/s) | Generate test with beam=4: BLEU4 = 28.08, 58.8/33.6/21.7/14.5 (BP=1.000, ratio=1.024, syslen=66041, reflen=64512) Test Checkpoint69 | Translated 3003 sentences (86580 tokens) in 25.9s (116.08 sentences/s, 3346.72 tokens/s) | Generate test with beam=4: BLEU4 = 28.13, 58.8/33.7/21.7/14.6 (BP=1.000, ratio=1.026, syslen=66178, reflen=64512) Test Checkpoint70 | Translated 3003 sentences (86448 tokens) in 26.1s (115.01 sentences/s, 3310.94 tokens/s) | Generate test with beam=4: BLEU4 = 27.88, 58.8/33.5/21.5/14.3 (BP=1.000, ratio=1.023, syslen=65998, reflen=64512) Test Checkpoint71 | Translated 3003 sentences (86832 tokens) in 26.0s (115.69 sentences/s, 3345.26 tokens/s) | Generate test with beam=4: BLEU4 = 27.91, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.029, syslen=66355, reflen=64512) Test Checkpoint72 | Translated 3003 sentences (86550 tokens) in 25.6s (117.18 sentences/s, 3377.25 tokens/s) | Generate test with beam=4: BLEU4 = 27.95, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.024, syslen=66092, reflen=64512) Test Checkpoint73 | Translated 3003 sentences (86415 tokens) in 25.4s (118.17 sentences/s, 3400.41 tokens/s) | Generate test with beam=4: BLEU4 = 27.84, 58.8/33.4/21.4/14.3 (BP=1.000, ratio=1.023, syslen=65990, reflen=64512) Test Checkpoint74 | Translated 3003 sentences (86251 tokens) in 26.2s (114.65 sentences/s, 3292.82 tokens/s) | Generate test with beam=4: BLEU4 = 27.97, 58.8/33.5/21.6/14.4 (BP=1.000, ratio=1.021, syslen=65889, reflen=64512) Test Checkpoint75 | Translated 3003 sentences (86418 tokens) in 26.1s (115.03 sentences/s, 3310.16 tokens/s) | Generate test with beam=4: BLEU4 = 27.72, 58.6/33.2/21.3/14.2 (BP=1.000, ratio=1.023, syslen=65971, reflen=64512) Test Checkpoint76 | Translated 3003 sentences (86474 tokens) in 25.9s (116.04 sentences/s, 3341.50 tokens/s) | Generate test with beam=4: BLEU4 = 27.63, 58.6/33.2/21.2/14.1 (BP=1.000, ratio=1.023, syslen=66025, reflen=64512) Test Checkpoint77 | Translated 3003 sentences (86100 tokens) in 25.6s (117.20 sentences/s, 3360.35 tokens/s) | Generate test with beam=4: BLEU4 = 28.11, 59.1/33.7/21.7/14.5 (BP=1.000, ratio=1.018, syslen=65695, reflen=64512) Test Checkpoint78 | Translated 3003 sentences (86497 tokens) in 26.2s (114.53 sentences/s, 3298.82 tokens/s) | Generate test with beam=4: BLEU4 = 27.80, 58.7/33.4/21.4/14.3 (BP=1.000, ratio=1.024, syslen=66073, reflen=64512) Test Checkpoint79 | Translated 3003 sentences (86905 tokens) in 26.3s (114.22 sentences/s, 3305.35 tokens/s) | Generate test with beam=4: BLEU4 = 27.69, 58.5/33.2/21.3/14.2 (BP=1.000, ratio=1.028, syslen=66327, reflen=64512) Test Checkpoint80 | Translated 3003 sentences (86654 tokens) in 26.3s (114.36 sentences/s, 3300.06 tokens/s) | Generate test with beam=4: BLEU4 = 27.65, 58.5/33.2/21.3/14.1 (BP=1.000, ratio=1.026, syslen=66219, reflen=64512)

    So, why I am not able to achieve the results as reported in the readme file? Could you tell me the command line that you use to run transformer on 4 GPUs?

    Another question is that the "Attention is all you need" paper uses 0.1 as the initial learning rate whereas 0.0006 is used here. Why there is such a large difference on learning rate?

    opened by yaoyiran 12
  • DeepLearningExamples/MxNet/Classification/RN50v1.5 --> prepare_imagenet.sh

    DeepLearningExamples/MxNet/Classification/RN50v1.5 --> prepare_imagenet.sh

    HI Team

    When I am trying to run prepare_imagenet.sh , nothing is happening . Its keep on running with no output and IOs to the disk I have downloaded Imagenet 150GB dataset and did tar -xvzf imagenet.tar.

    Below is the folder structure I got

    root@ddb5e2b7ceaa:/workspace/rn50# ll /data/imagenet/train-val-recordio-passthrough/
    
    total 162373888
    
    drwxr-xr-x 5 root root         4096 Jan  2 10:15 ./
    
    drwxr-xr-x 3 root root           44 Jan  3 07:44 ../
    
    drwxr-xr-x 5 root root         4096 Jan  2 10:15 ILSVRC/
    
    -rw-r--r-- 1 root root 166022728827 Jan  2 03:09 ILSVRC2017_CLS-LOC.tar.gz
    
    drwxrwxr-x 5 root root         4096 Feb  9  2015 tiny-imagenet-200/
    
    -rw-r--r-- 1 root root    248100043 Jan  2 10:00 tiny-imagenet-200.zip
    
    root@ddb5e2b7ceaa:/workspace/rn50# ll /data/imagenet/train-val-recordio-passthrough/ILSVRC/
    
    total 20
    
    drwxr-xr-x 5 root root 4096 Jan  2 10:15 ./
    
    drwxr-xr-x 5 root root 4096 Jan  2 10:15 ../
    
    drwxr-xr-x 3 root root 4096 Jan  2 04:49 Annotations/
    
    drwxr-xr-x 3 root root 4096 Jan  2 06:14 Data/
    
    drwxr-xr-x 3 root root 4096 Jan  2 10:15 ImageSets/
    
    root@ddb5e2b7ceaa:/workspace/rn50# ll /data/imagenet/train-val-recordio-passthrough/ILSVRC/Data/
    
    total 12
    
    drwxr-xr-x 3 root   root 4096 Jan  2 06:14 ./
    
    drwxr-xr-x 5 root   root 4096 Jan  2 10:15 ../
    
    drwxr-xr-x 6 200031 1003 4096 Jan  3 06:21 CLS-LOC/
    
    root@ddb5e2b7ceaa:/workspace/rn50# ll /data/imagenet/train-val-recordio-passthrough/ILSVRC/Data/CLS-LOC/
    
    total 11808
    
    drwxr-xr-x    6 200031 1003    4096 Jan  3 06:21 ./
    
    drwxr-xr-x    3 root   root    4096 Jan  2 06:14 ../
    
    drwxr-xr-x    2 root   root    4096 Jan  3 06:58 out/
    
    drwxr-xr-x    2 200031 1003 7979008 May 17  2015 test/
    
    drwxr-xr-x 1002 200031 1003   65536 Sep 29  2014 train/
    
    drwxr-xr-x    2 200031 1003 4014080 May 17  2015 val/
    

    Below is the command for data pre-processing

    root@ddb5e2b7ceaa:/workspace/rn50# ./scripts/prepare_imagenet.sh /data/imagenet/train-val-recordio-passthrough/ILSVRC/Data/CLS-LOC/ /data/imagenet/train-val-recordio-passthrough/ILSVRC/Data/CLS-LOC/
    
     
    
    ^CTraceback (most recent call last):
    
     File "/opt/mxnet/tools/im2rec.py", line 329, in <module>
    
       make_list(args)
    
     File "/opt/mxnet/tools/im2rec.py", line 100, in make_list
    
       image_list = list(image_list)
    
     File "/opt/mxnet/tools/im2rec.py", line 60, in list_image
    
       if os.path.isfile(fpath) and (suffix in exts):
    
     File "/usr/lib/python3.8/genericpath.py", line 30, in isfile
    
       st = os.stat(path)
    
    KeyboardInterrupt
    
    opened by karanveersingh5623 0
  • Failed to use --pretrained-from-file due to KeyError: 'layer1.block0.se' error

    Failed to use --pretrained-from-file due to KeyError: 'layer1.block0.se' error

    It seems like the change in https://github.com/NVIDIA/DeepLearningExamples/commit/5843f4e5a1220dd98936e477d8597d8a77320666 didn't consider the case of --pretrained_from_file in this line https://github.com/NVIDIA/DeepLearningExamples/blob/ca5ae20e3d1af3464159754f758768052c41c607/PyTorch/Classification/ConvNets/image_classification/models/model.py#L123 which results in a failure to load the model file.

    The loading problem can be resolved by changing that line to

    if (pretrained or pretrained_from_file) and hasattr(model, "ngc_checkpoint_remap"):

    But this suggested addition is causing inference to fail (e.g., when is executed with a model I trained from scratch), so I'm probably missing something.

    Reproducing the issue -

    E.g., run - python ./launch.py --model efficientnet-widese-b4 --precision AMP --mode convergence --platform T4 ./imagenet --workspace ./workspace --raport-file raport.json --pretrained-from-file ./nvidia_efficientnet-widese-b4_210412.pth

    Output -

    ... => loading pretrained weights from './nvidia_efficientnet-widese-b4_210412.pth' Traceback (most recent call last): File "./launch.py", line 53, in main(args, model_args, model_arch) File "./main.py", line 623, in main ) = prepare_for_training(args, model_args, model_arch) File "./main.py", line 462, in prepare_for_training model = model_arch( File "./image_classification/models/model.py", line 138, in call state_dict = { File "./image_classification/models/model.py", line 142, in dict(model.named_modules())[".".join(k.split(".")[:-2])] KeyError: 'layer1.block0.se'

    opened by kfirlevari 0
  • [GNMT、NCF、TransformerXL/PyTorch] Run failed

    [GNMT、NCF、TransformerXL/PyTorch] Run failed

    Describe the bug

    I ran these deep learning examples using the PyTroch NGC Docker image (PyTorch NGC release ('pytorch:21.07-py3')), the device is Nvidia 3090 24G. All models worked successfully, except for gnmt, ncf, transformer-xl with weird bugs, I needed some help

    BUG

    • gnmt

    Run the following command:

    python3 -m torch.distributed.launch --nproc_per_node=1 train.py --dataset-dir "/data/gnmt/wmt16_de_en" --train-batch-size "288" --math "fp32" --epochs "2" --seed "2"

    Error:

    INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_6my8it11/none_hekk6ojr/attempt_1/0/error.json train.py:41: UserWarning: PyProf is unavailable warnings.warn('PyProf is unavailable') Traceback (most recent call last): File "train.py", line 667, in <module> main() File "train.py", line 388, in main affinity = gpu_affinity.set_affinity( File "/workspace/examples/gnmt/seq2seq/gpu_affinity.py", line 135, in set_affinity set_socket_unique_affinity(gpu_id, nproc_per_node, 'interleaved') File "/workspace/examples/gnmt/seq2seq/gpu_affinity.py", line 110, in set_socket_unique_affinity os.sched_setaffinity(0, affinity) OSError: [Errno 22] Invalid argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 47637) of binary: /opt/conda/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

    • ncf

    Run the following command:

    python -m torch.distributed.launch --nproc_per_node=1 ncf.py --data "/data/ncf/cache/ml-20m" --epochs "2" --batch_size "2516582" --opt_level "O0"

    Error:

    INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_a2r3m8o_/none_h135axnw/attempt_0/0/error.json :::NVLOGv0.1.0 ncf 1671438905.572438002 (ncf.py:171) cpu_info: {"num": 24, "name": "AMD EPYC 7773X 64-Core Processor"} :::NVLOGv0.1.0 ncf 1671438905.578721285 (ncf.py:171) mem_info: {"ram": "29Gi"} :::NVLOGv0.1.0 ncf 1671438905.720378399 (ncf.py:171) gpu_info: {"driver_version": "515.65.01", "num": 1, "name": ["NVIDIA GeForce RTX 3090"], "mem": ["24576 MiB"]} :::NVLOGv0.1.0 ncf 1671438905.721916914 (ncf.py:174) args: {"data": "/data/ncf/cache/ml-20m", "epochs": 2, "batch_size": 2516582, "valid_batch_size": 1048576, "factors": 64, "layers": [256, 256, 128, 64], "negative_samples": 4, "learning_rate": 0.0045, "topk": 10, "seed": 1, "threshold": 1.0, "beta1": 0.25, "beta2": 0.5, "eps": 1e-08, "dropout": 0.5, "checkpoint_dir": "/data/checkpoints/", "load_checkpoint_path": null, "mode": "train", "grads_accumulated": 1, "opt_level": "O0", "local_rank": 0, "distributed": false, "world_size": 1} Saving results to /data/checkpoints/ :::NVLOGv0.1.0 ncf 1671438905.722423792 (ncf.py:184) preproc_hp_sample_eval_replacement: true :::NVLOGv0.1.0 ncf 1671438905.722621918 (ncf.py:185) input_hp_sample_train_replacement: true :::NVLOGv0.1.0 ncf 1671438905.722801685 (ncf.py:186) input_step_eval_neg_gen :::NVLOGv0.1.0 ncf 1671438906.979937315 (ncf.py:194) run_start :::NVLOGv0.1.0 ncf 1671438907.882619858 (ncf.py:201) preproc_hp_num_eval: 100 :::NVLOGv0.1.0 ncf 1671438907.883869886 (ncf.py:207) input_size: 19861770 :::NVLOGv0.1.0 ncf 1671438907.905972481 (ncf.py:216) input_batch_size: 2516582 :::NVLOGv0.1.0 ncf 1671438907.906189203 (ncf.py:217) input_order :::NVLOGv0.1.0 ncf 1671438907.906588554 (/workspace/examples/ncf/neumf.py:54) model_hp_mf_dim: 64 :::NVLOGv0.1.0 ncf 1671438908.116574049 (/workspace/examples/ncf/neumf.py:62) model_hp_mlp_layer_sizes: [256, 256, 128, 64] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -8) local_rank: 0 (pid: 41723) of binary: /opt/conda/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group

    • transformer-xl

    Run the following command:

    python -m torch.distributed.launch --nproc_per_node=1 train.py --data "/data/transformer-xl/wikitext-103" --max_step "400" --batch_size "14" --dataset "wt103" --n_layer "16" --d_model "512" --n_head "8" --d_head "64" --d_inner "2048" --dropout "0.1" --dropatt "0.0" --optim "jitlamb" --lr "0.0" --eta_min "0.001" --warmup_step "1000" --tgt_len "192" --mem_len "192" --eval_tgt_len "192" --log_interval "10" --eval_interval "5000" --roll --cuda

    Error:

    INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_4w5qnet_/none_wit_ptbo/attempt_0/0/error.json train.py:41: UserWarning: PyProf is unavailable warnings.warn('PyProf is unavailable') Traceback (most recent call last): File "train.py", line 1102, in <module> main() File "train.py", line 690, in main affinity = utils.gpu_affinity.set_affinity( File "/workspace/examples/transformer-xl/pytorch/utils/gpu_affinity.py", line 135, in set_affinity set_socket_unique_affinity(gpu_id, nproc_per_node, 'interleaved') File "/workspace/examples/transformer-xl/pytorch/utils/gpu_affinity.py", line 110, in set_socket_unique_affinity os.sched_setaffinity(0, affinity) OSError: [Errno 22] Invalid argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 49668) of binary: /opt/conda/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

    Environment

    • Container version (pytorch:21.07-py3):
    • GPUs in the system: (1 x Nvidia 3090 24G):
    • CUDA driver version (515.65.01):
    bug 
    opened by zengxunli 0
  • [Kaldi/SpeechRecognition] Update Included Notebooks

    [Kaldi/SpeechRecognition] Update Included Notebooks

    Related to Kaldi/SpeechRecognition outdated Jupyter Notebooks

    Examples:

    • Kaldi/SpeechRecognition
    • Jupyter notebooks

    Is your feature request related to a problem? Please describe. The jupyter notebooks included in Kaldi/SpeechRecognition are outdated and doesn't work with new Triton server. Because these notebook are using older tensorrtserver apis.

    Describe the solution you'd like Updated jupyter notebooks which are compatible with existing version of Triton Server and use new tritonclient apis.

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Additional context I have tried to use new client tritonclient by following examples here. But every time I setup inputs and send request for Inference, I encounter the following error:

    ---------------------------------------------------------------------------
    _InactiveRpcError                         Traceback (most recent call last)
    <ipython-input-32-2617521bb391> in <module>
    ----> 1 response = grpc_stub.ModelInfer(request)
    
    /usr/local/lib/python3.8/dist-packages/grpc/_channel.py in __call__(self, request, timeout, metadata, credentials, wait_for_ready, compression)
        944         state, call, = self._blocking(request, timeout, metadata, credentials,
        945                                       wait_for_ready, compression)
    --> 946         return _end_unary_response_blocking(state, call, False, None)
        947 
        948     def with_call(self,
    
    /usr/local/lib/python3.8/dist-packages/grpc/_channel.py in _end_unary_response_blocking(state, call, with_call, deadline)
        847             return state.response
        848     else:
    --> 849         raise _InactiveRpcError(state)
        850 
        851 
    
    _InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    	status = StatusCode.UNIMPLEMENTED
    	details = "ModelInfer RPC doesn't support models with decoupled transaction policy"
    	debug_error_string = "{"created":"@1671696095.420704856","description":"Error received from peer ipv4:127.0.0.1:8001","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"ModelInfer RPC doesn't support models with decoupled transaction policy","grpc_status":12}"
    >
    
    enhancement 
    opened by InzamamAnwar 0
  • replace_static.sparsity_with_incubate.asp

    replace_static.sparsity_with_incubate.asp

    What happened?

    paddle.static.sparsity has been removed and is now replaced by paddle.incubate.asp. see pr for details. https://github.com/PaddlePaddle/Paddle/pull/48450

    What did I do?

    replace paddle.static.sparsity with paddle.incubate.asp

    What did you expect to happen?

    eliminate the impact of removing paddle.static.sparsity

    The specification of the pull request

    PR Specification from OSCS

    opened by GGBond8488 0
Owner
NVIDIA Corporation
NVIDIA Corporation
ColossalAI-Examples - Examples of training models with hybrid parallelism using ColossalAI

ColossalAI-Examples This repository contains examples of training models with Co

HPC-AI Tech 185 Jan 9, 2023
NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

NVIDIA Merlin NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs. It enables data scientists, machine

null 419 Jan 3, 2023
PyTorch framework, for reproducing experiments from the paper Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks

Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks. Code, based on the PyTorch framework, for reprodu

Asaf 3 Dec 27, 2022
Tutorial on active learning with the Nvidia Transfer Learning Toolkit (TLT).

Active Learning with the Nvidia TLT Tutorial on active learning with the Nvidia Transfer Learning Toolkit (TLT). In this tutorial, we will show you ho

Lightly 25 Dec 3, 2022
Self-Correcting Quantum Many-Body Control using Reinforcement Learning with Tensor Networks

Self-Correcting Quantum Many-Body Control using Reinforcement Learning with Tensor Networks This repository contains the code and data for the corresp

Friederike Metz 7 Apr 23, 2022
Simulating Sycamore quantum circuits classically using tensor network algorithm.

Simulating the Sycamore quantum supremacy circuit This repo contains data we have obtained in simulating the Sycamore quantum supremacy circuits with

Feng Pan 46 Nov 17, 2022
TuckER: Tensor Factorization for Knowledge Graph Completion

TuckER: Tensor Factorization for Knowledge Graph Completion This codebase contains PyTorch implementation of the paper: TuckER: Tensor Factorization f

Ivana Balazevic 296 Dec 6, 2022
FluidNet re-written with ATen tensor lib

fluidnet_cxx: Accelerating Fluid Simulation with Convolutional Neural Networks. A PyTorch/ATen Implementation. This repository is based on the paper,

JoliBrain 50 Jun 7, 2022
Pretty Tensor - Fluent Neural Networks in TensorFlow

Pretty Tensor provides a high level builder API for TensorFlow. It provides thin wrappers on Tensors so that you can easily build multi-layer neural networks.

Google 1.2k Dec 29, 2022
A torch.Tensor-like DataFrame library supporting multiple execution runtimes and Arrow as a common memory format

TorchArrow (Warning: Unstable Prototype) This is a prototype library currently under heavy development. It does not currently have stable releases, an

Facebook Research 536 Jan 6, 2023
Gradient-free global optimization algorithm for multidimensional functions based on the low rank tensor train format

ttopt Description Gradient-free global optimization algorithm for multidimensional functions based on the low rank tensor train (TT) format and maximu

null 5 May 23, 2022
(Py)TOD: Tensor-based Outlier Detection, A General GPU-Accelerated Framework

(Py)TOD: Tensor-based Outlier Detection, A General GPU-Accelerated Framework Background: Outlier detection (OD) is a key data mining task for identify

Yue Zhao 127 Jan 5, 2023
Code to reproduce the results in the paper "Tensor Component Analysis for Interpreting the Latent Space of GANs".

Tensor Component Analysis for Interpreting the Latent Space of GANs [ paper | project page ] Code to reproduce the results in the paper "Tensor Compon

James Oldfield 4 Jun 17, 2022
Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).

Adaptive Segmentation Mask Attack This repository contains the implementation of the Adaptive Segmentation Mask Attack (ASMA), a targeted adversarial

Utku Ozbulak 53 Jul 4, 2022
🦕 NanoSaur is a little tracked robot ROS2 enabled, made for an NVIDIA Jetson Nano

?? nanosaur NanoSaur is a little tracked robot ROS2 enabled, made for an NVIDIA Jetson Nano Website: nanosaur.ai Do you need an help? Discord For tech

NanoSaur 162 Dec 9, 2022
The 1st place solution of track2 (Vehicle Re-Identification) in the NVIDIA AI City Challenge at CVPR 2021 Workshop.

AICITY2021_Track2_DMT The 1st place solution of track2 (Vehicle Re-Identification) in the NVIDIA AI City Challenge at CVPR 2021 Workshop. Introduction

Hao Luo 91 Dec 21, 2022
Nvidia Semantic Segmentation monorepo

Paper | YouTube | Cityscapes Score Pytorch implementation of our paper Hierarchical Multi-Scale Attention for Semantic Segmentation. Please refer to t

NVIDIA Corporation 1.6k Jan 4, 2023
PyTorch implementation of the Quasi-Recurrent Neural Network - up to 16 times faster than NVIDIA's cuDNN LSTM

Quasi-Recurrent Neural Network (QRNN) for PyTorch Updated to support multi-GPU environments via DataParallel - see the the multigpu_dataparallel.py ex

Salesforce 1.3k Dec 28, 2022
A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

Eugenio Herrera 175 Dec 29, 2022