NVIDIA Deep Learning Examples for Tensor Cores

NVIDIA Corporation

Last update: Dec 31, 2022

Related tags

Deep Learning DeepLearningExamples

Overview

NVIDIA Deep Learning Examples for Tensor Cores

Introduction

This repository provides State-of-the-Art Deep Learning examples that are easy to train and deploy, achieving the best reproducible accuracy and performance with NVIDIA CUDA-X software stack running on NVIDIA Volta, Turing and Ampere GPUs.

NVIDIA GPU Cloud (NGC) Container Registry

These examples, along with our NVIDIA deep learning software stack, are provided in a monthly updated Docker container on the NGC container registry (https://ngc.nvidia.com). These containers include:

The latest NVIDIA examples from this repository
The latest NVIDIA contributions shared upstream to the respective framework
The latest NVIDIA Deep Learning software libraries, such as cuDNN, NCCL, cuBLAS, etc. which have all been through a rigorous monthly quality assurance process to ensure that they provide the best possible performance
Monthly release notes for each of the NVIDIA optimized containers

Computer Vision

Models	Framework	A100	AMP	Multi-GPU	Multi-Node	TRT	ONNX	Triton	DLC	NB
ResNet-50	PyTorch	Yes	Yes	Yes	-	Yes	-	Yes	Yes	-
ResNeXt-101	PyTorch	Yes	Yes	Yes	-	Yes	-	Yes	Yes	-
SE-ResNeXt-101	PyTorch	Yes	Yes	Yes	-	Yes	-	Yes	Yes	-
EfficientNet-B0	PyTorch	Yes	Yes	Yes	-	-	-	-	Yes	-
EfficientNet-B4	PyTorch	Yes	Yes	Yes	-	-	-	-	Yes	-
EfficientNet-WideSE-B0	PyTorch	Yes	Yes	Yes	-	-	-	-	Yes	-
EfficientNet-WideSE-B4	PyTorch	Yes	Yes	Yes	-	-	-	-	Yes	-
Mask R-CNN	PyTorch	Yes	Yes	Yes	-	-	-	-	-	Yes
nnUNet	PyTorch	Yes	Yes	Yes	-	-	-	-	Yes	-
SSD	PyTorch	Yes	Yes	Yes	-	-	-	-	-	Yes
ResNet-50	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
ResNeXt101	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
SE-ResNeXt-101	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
Mask R-CNN	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
SSD	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	Yes
U-Net Ind	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	Yes
U-Net Med	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
U-Net 3D	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
V-Net Med	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
U-Net Med	TensorFlow2	Yes	Yes	Yes	-	-	-	-	Yes	-
Mask R-CNN	TensorFlow2	Yes	Yes	Yes	-	-	-	-	Yes	-
EfficientNet	TensorFlow2	Yes	Yes	Yes	Yes	-	-	-	Yes	-
ResNet-50	MXNet	-	Yes	Yes	-	-	-	-	-	-

Natural Language Processing

Models	Framework	A100	AMP	Multi-GPU	Multi-Node	TRT	ONNX	Triton	DLC	NB
BERT	PyTorch	Yes	Yes	Yes	Yes	-	-	Yes	Yes	-
TransformerXL	PyTorch	Yes	Yes	Yes	Yes	-	-	-	Yes	-
GNMT	PyTorch	Yes	Yes	Yes	-	-	-	-	-	-
Transformer	PyTorch	Yes	Yes	Yes	-	-	-	-	-	-
ELECTRA	TensorFlow2	Yes	Yes	Yes	Yes	-	-	-	Yes	-
BERT	TensorFlow	Yes	Yes	Yes	Yes	Yes	-	Yes	Yes	Yes
BERT	TensorFlow2	Yes	Yes	Yes	Yes	-	-	-	Yes	-
BioBert	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	Yes
TransformerXL	TensorFlow	Yes	Yes	Yes	-	-	-	-	-	-
GNMT	TensorFlow	Yes	Yes	Yes	-	-	-	-	-	-
Faster Transformer	Tensorflow	-	-	-	-	Yes	-	-	-	-

Recommender Systems

Models	Framework	A100	AMP	Multi-GPU	Multi-Node	TRT	ONNX	Triton	DLC	NB
DLRM	PyTorch	Yes	Yes	Yes	-	-	Yes	Yes	Yes	Yes
DLRM	TensorFlow2	Yes	Yes	Yes	Yes	-	-	-	Yes	-
NCF	PyTorch	Yes	Yes	Yes	-	-	-	-	-	-
Wide&Deep	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
Wide&Deep	TensorFlow2	Yes	Yes	Yes	-	-	-	-	Yes	-
NCF	TensorFlow	Yes	Yes	Yes	-	-	-	-	Yes	-
VAE-CF	TensorFlow	Yes	Yes	Yes	-	-	-	-	-	-

Speech to Text

Models	Framework	A100	AMP	Multi-GPU	Multi-Node	TRT	ONNX	Triton	DLC	NB
Jasper	PyTorch	Yes	Yes	Yes	-	Yes	Yes	Yes	Yes	Yes
Hidden Markov Model	Kaldi	-	-	Yes	-	-	-	Yes	-	-

Text to Speech

Models	Framework	A100	AMP	Multi-GPU	Multi-Node	TRT	ONNX	Triton	DLC	NB
FastPitch	PyTorch	Yes	Yes	Yes	-	-	-	-	Yes	-
FastSpeech	PyTorch	-	Yes	Yes	-	Yes	-	-	-	-
Tacotron 2 and WaveGlow	PyTorch	Yes	Yes	Yes	-	Yes	Yes	Yes	Yes	-

Graph Neural Networks

Models	Framework	A100	AMP	Multi-GPU	Multi-Node	TRT	ONNX	Triton	DLC	NB
SE(3)-Transformer	PyTorch	Yes	Yes	Yes	-	-	-	-	-	-

NVIDIA support

In each of the network READMEs, we indicate the level of support that will be provided. The range is from ongoing updates and improvements to a point-in-time release for thought leadership.

Glossary

Multinode Training
Supported on a pyxis/enroot Slurm cluster.

Deep Learning Compiler (DLC)
TensorFlow XLA and PyTorch JIT and/or TorchScript

Accelerated Linear Algebra (XLA)
XLA is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes. The results are improvements in speed and memory usage.

PyTorch JIT and/or TorchScript
TorchScript is a way to create serializable and optimizable models from PyTorch code. TorchScript, an intermediate representation of a PyTorch model (subclass of nn.Module) that can then be run in a high-performance environment such as C++.

Automatic Mixed Precision (AMP)
Automatic Mixed Precision (AMP) enables mixed precision training on Volta, Turing, and NVIDIA Ampere GPU architectures automatically.

TensorFloat-32 (TF32)
TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.

Jupyter Notebooks (NB)
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.

Feedback / Contributions

We're posting these examples on GitHub to better support the community, facilitate feedback, as well as collect and implement contributions using GitHub Issues and pull requests. We welcome all contributions!

Known issues

In each of the network READMEs, we indicate any known issues and encourage the community to provide feedback.

Comments

Do you have pre-trained models to continue training?

I'm working on Tacotron 2

I've tried to continue training from provided checkpoints JoC_Tacotron2_FP32_PyT_20190306 and JoC_WaveGlow_FP32_PyT_20190306, but it didn't work out.

:::NVLOGv0.2.2 Tacotron2_PyT 1574863255.828569651 (/workspace/tacotron2/dllogger/logger.py:279) run_start
:::NVLOGv0.2.2 Tacotron2_PyT 1574863255.837591887 (/workspace/tacotron2/dllogger/logger.py:251) cpu_info: {"num": 16, "name": "Intel(R) Xeon(R) CPU @ 2.00GHz"}
:::NVLOGv0.2.2 Tacotron2_PyT 1574863255.845489264 (/workspace/tacotron2/dllogger/logger.py:251) mem_info: {"ram": "102G"}
:::NVLOGv0.2.2 Tacotron2_PyT 1574863255.917979240 (/workspace/tacotron2/dllogger/logger.py:251) gpu_info: {"driver_version": "418.87.00", "num": 1, "name": ["Tesla P100-PCIE-16GB"], "mem": ["16280 MiB"]}
:::NVLOGv0.2.2 Tacotron2_PyT 1574863255.921807289 (/workspace/tacotron2/dllogger/logger.py:251) args: {"output_directory": "./output/", "dataset_path": "./", "model_name": "Tacotron2", "log_file": "./output/nvlog.json", "anneal_steps": ["500", "1000", "1500"], "anneal_factor": 0.1, "epochs": 1501, "epochs_per_checkpoint": 50, "checkpoint_path": "./JoC_Tacotron2_FP32_PyT_20190306", "seed": 1234, "dynamic_loss_scaling": true, "amp_run": true, "cudnn_enabled": true, "cudnn_benchmark": false, "disable_uniform_initialize_bn_weight": false, "use_saved_learning_rate": false, "learning_rate": 0.001, "weight_decay": 1e-06, "grad_clip_thresh": 1.0, "batch_size": 128, "grad_clip": 5.0, "load_mel_from_disk": false, "training_files": "filelists/ljs_audio_text_train_filelist.txt", "validation_files": "filelists/ljs_audio_text_val_filelist.txt", "text_cleaners": ["english_cleaners"], "max_wav_value": 32768.0, "sampling_rate": 22050, "filter_length": 1024, "hop_length": 256, "win_length": 1024, "mel_fmin": 0.0, "mel_fmax": 8000.0, "rank": 0, "world_size": 1, "dist_url": "tcp://localhost:23456", "group_name": "group_name", "dist_backend": "nccl", "mask_padding": false, "n_mel_channels": 80, "n_symbols": 148, "symbols_embedding_dim": 512, "encoder_kernel_size": 5, "encoder_n_convolutions": 3, "encoder_embedding_dim": 512, "n_frames_per_step": 1, "decoder_rnn_dim": 1024, "prenet_dim": 256, "max_decoder_steps": 2000, "gate_threshold": 0.5, "p_attention_dropout": 0.1, "p_decoder_dropout": 0.1, "decoder_no_early_stopping": false, "attention_rnn_dim": 1024, "attention_dim": 128, "attention_location_n_filters": 32, "attention_location_kernel_size": 31, "postnet_embedding_dim": 512, "postnet_kernel_size": 5, "postnet_n_convolutions": 5}
:::NVLOGv0.2.2 Tacotron2_PyT 1574863255.922522545 (/workspace/tacotron2/dllogger/logger.py:251) run_start
Selected optimization level O1:  Insert automatic casts around Pytorch functions and Tensor methods.

Defaults for this optimization level are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Processing user overrides (additional kwargs that are not None)...
After processing overrides, optimization options are:
enabled                : True
opt_level              : O1
cast_model_type        : None
patch_torch_functions  : True
keep_batchnorm_fp32    : None
master_weights         : None
loss_scale             : dynamic
Traceback (most recent call last):
  File "train.py", line 501, in <module>
    main()
  File "train.py", line 350, in main
    args.amp_run, args.checkpoint_path)
  File "train.py", line 202, in load_checkpoint
    torch.cuda.set_rng_state_all(checkpoint['cuda_rng_state_all'])
KeyError: 'cuda_rng_state_all'

I guess that checkpoints are not made to continue training.

Do you have pre-trained models to continue training from them?

opened by maloyan 34

segmentation fault when running tensorflow op

➜  build git:(tf_multihead_attention) ✗ python ../sample/tensorflow/transformer_fp32.py 1 12 32 12 64
/home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:517: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:518: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:519: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:520: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:525: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
/home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:541: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:542: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:543: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:544: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:545: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/yongxian.zyx/.pyenv/versions/3.6.2/lib/python3.6/site-packages/tensorboard/compat/tensorflow_stub/dtypes.py:550: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Argumentlist: batch_size 1 num_layers 12 seq_len 32
WARNING: Logging before flag parsing goes to stderr.
W0819 10:26:09.628401 140044507706432 deprecation_wrapper.py:119] From ../sample/tensorflow/transformer_fp32.py:201: The name tf.variable_scope is deprecated. Please use tf.compat.v1.variable_scope instead.

W0819 10:26:09.628654 140044507706432 deprecation_wrapper.py:119] From ../sample/tensorflow/transformer_fp32.py:201: The name tf.AUTO_REUSE is deprecated. Please use tf.compat.v1.AUTO_REUSE instead.

W0819 10:26:09.629366 140044507706432 deprecation.py:323] From ../sample/tensorflow/transformer_fp32.py:109: dense (from tensorflow.python.layers.core) is deprecated and will be removed in a future version.
Instructions for updating:
Use keras.layers.dense instead.
W0819 10:26:10.424225 140044507706432 lazy_loader.py:50]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

W0819 10:26:12.424194 140044507706432 deprecation_wrapper.py:119] From ../sample/tensorflow/transformer_fp32.py:341: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0819 10:26:12.424426 140044507706432 deprecation_wrapper.py:119] From ../sample/tensorflow/transformer_fp32.py:342: The name tf.OptimizerOptions is deprecated. Please use tf.compat.v1.OptimizerOptions instead.

W0819 10:26:12.424533 140044507706432 deprecation_wrapper.py:119] From ../sample/tensorflow/transformer_fp32.py:343: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-08-19 10:26:12.424749: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2019-08-19 10:26:12.437915: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-08-19 10:26:12.639802: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-19 10:26:12.641082: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5461d20 executing computations on platform CUDA. Devices:
2019-08-19 10:26:12.641107: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2019-08-19 10:26:12.641115: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (1): Tesla T4, Compute Capability 7.5
2019-08-19 10:26:12.644194: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2500000000 Hz
2019-08-19 10:26:12.650989: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5677000 executing computations on platform Host. Devices:
2019-08-19 10:26:12.651016: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-08-19 10:26:12.653295: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:5e:00.0
2019-08-19 10:26:12.653364: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-08-19 10:26:12.654268: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 1 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:d8:00.0
2019-08-19 10:26:12.654320: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.0
2019-08-19 10:26:12.654350: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2019-08-19 10:26:12.654457: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2019-08-19 10:26:12.654504: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2019-08-19 10:26:12.654544: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2019-08-19 10:26:12.654584: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2019-08-19 10:26:12.658600: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2019-08-19 10:26:12.658624: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-08-19 10:26:12.658705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-08-19 10:26:12.658715: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 1
2019-08-19 10:26:12.658723: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N Y
2019-08-19 10:26:12.658729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 1:   Y N
2019-08-19 10:26:13.230271: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
0 layer_0/attention/self/query/kernel:0 (768, 768)
1 layer_0/attention/self/query/bias:0 (768,)
2 layer_0/attention/self/key/kernel:0 (768, 768)
3 layer_0/attention/self/key/bias:0 (768,)
4 layer_0/attention/self/value/kernel:0 (768, 768)
5 layer_0/attention/self/value/bias:0 (768,)
6 layer_0/attention/output/dense/kernel:0 (768, 768)
7 layer_0/attention/output/dense/bias:0 (768,)
8 layer_0/attention/output/LayerNorm/beta:0 (768,)
9 layer_0/attention/output/LayerNorm/gamma:0 (768,)
10 layer_0/intermediate/dense/kernel:0 (768, 3072)
...
180 layer_11/attention/self/value/kernel:0 (768, 768)
181 layer_11/attention/self/value/bias:0 (768,)
182 layer_11/attention/output/dense/kernel:0 (768, 768)
183 layer_11/attention/output/dense/bias:0 (768,)
184 layer_11/attention/output/LayerNorm/beta:0 (768,)
185 layer_11/attention/output/LayerNorm/gamma:0 (768,)
186 layer_11/intermediate/dense/kernel:0 (768, 3072)
187 layer_11/intermediate/dense/bias:0 (3072,)
188 layer_11/output/dense/kernel:0 (3072, 768)
189 layer_11/output/dense/bias:0 (768,)
190 layer_11/output/LayerNorm/beta:0 (768,)
191 layer_11/output/LayerNorm/gamma:0 (768,)
[1]    63119 segmentation fault (core dumped)  python ../sample/tensorflow/transformer_fp32.py 1 12 32 12 64

opened by duduscript 21

Is it possible to train voices and models with fastpitch using CPU only and/or without NVIDIA Container Toolkit/docker? (doubt)

I am new to these things and from everything i read from the instructions on the repository and google after a few hours, i am unsure if its worth all the trouble of trying to set up the nvidia docker, ubuntu, wsl2, and even joining insider programs in microsoft and nvidia, which i never had done before and seems a massive hassle to me, and i want to be sure it's worth joining all those programs.

I say this because apparently my new graphics card is not suited for training fastpitch/deep learning, perhaps my cpu would even be faster for fastpitch training ??

NVIDIA GEFORCE GTX 1060 SUPER 6GB INTEL CORE I9 9900KF

From what i seen online, unfortunately my card doesnt have tensor cores and not enough vram for deep learning, so i ask, it there a way to train fastpitch models without using gpu and all those requirements such as the nvidia toolkit, drivers, wsl, etc etc and using only CPU? Note that i dont mind if it's slow or not, i really want to be able to use text to speech, using fastpitch so i wonder if perhaps the cpu could be faster than this card OR if despite the card being weak for deep learning.

opened by cesm1980 20
CUDA error

encountering cuda runtime error File "/workspace/tacotron2/tacotron2/data_function.py", line 148, in batch_to_gpu max_len = torch.max(input_lengths.data).item() RuntimeError: cuda runtime error (209) : no kernel image is available for execution on the device at /tmp/pip-req-build-akjifb_7/aten/src/THC/generic/THCTensorMathReduce.cu:94

nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2019 NVIDIA Corporation Built on Sun_Jul_28_19:07:16_PDT_2019 Cuda compilation tools, release 10.1, V10.1.243

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | 0000E5F5:00:00.0 Off | 0 | | N/A 50C P0 59W / 149W | 0MiB / 11441MiB | 97% Default | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

opened by shradda8 20
NanLossDuringTrainingError when training BERT large model

I have been using the BERT with FP16 + XLA implementation for several weeks. It works great for BERT BASE model training. Recently I started to use it to train LARGE model with FP16+XLA. The training went well until around 344k step. It hit NanLossDuringTrainingError with error "Model diverged with loss = NaN.". The error stack with tf 1.13.1 is below. Can you provide some insights on what's wrong? Thanks.

Model diverged with loss = NaN. Error recorded from training_loop: NaN loss during training. training_loop marked as finished WARNING: Reraising captured error Traceback (most recent call last): File "run_pretraining.py", line 610, in File "/usr/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 125, in run _sys.exit(main(argv)) File "run_pretraining.py", line 582, in main File "/usr/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2457, in train rendezvous.raise_errors() File "/usr/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/error_handling.py", line 128, in raise_errors six.reraise(typ, value, traceback) File "/usr/lib/python2.7/site-packages/tensorflow/contrib/tpu/python/tpu/tpu_estimator.py", line 2452, in train saving_listeners=saving_listeners) File "/usr/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default saving_listeners) File "/usr/lib/python2.7/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss]) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 676, in run run_metadata=run_metadata) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1171, in run run_metadata=run_metadata) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1270, in run raise six.reraise(*original_exc_info) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1255, in run return self._sess.run(*args, **kwargs) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/monitored_session.py", line 1335, in run run_metadata=run_metadata)) File "/usr/lib/python2.7/site-packages/tensorflow/python/training/basic_session_run_hooks.py", line 753, in after_run raise NanLossDuringTrainingError

opened by LiweiPeng 17
When XLA is used, bert nvprof shows no compute and communication overlap

When I trained the bert model with horovod and XLA, although XLA can significantly improve the perf, it decreased the scalability significantly.

Our cluster has 16 nodes, 4 V100 32GB/node. The nodes are linked with 100GB Mellanox RoCE RDMA. There is no NVLINK/NVSwitch technology used.

When XLA is not used, the scalability from 1 to 16 GPU is 1.92x. When XLA is used, the scalability dropped to 1.8x from 1 to 16 GPUs.

Environment: Framework: (TensorFlow, Keras, PyTorch, MXNet) Tensorflow Framework version: 1.14 Horovod version: 0.16.4 MPI version: 3.1.1 CUDA version: 10.0 NCCL version: 2.4.7 Python version: 2.7.5 OS and version: CentOS 7.4 GCC version: 4.8.5

When XLA is not used, nvprof shows that there is good compute and communication (shown as 'mem' in the screenshot below) overlap, as expected.

However, when XLA is used, nvprof shows that there is little compute and communication overlap.

The problem is: What caused this no compute and communication overlap when XLA is used? Is this a bug or expected behavior for XLA?

opened by LiweiPeng 16
[Tacotron2-Waveglow/PyTorch] tacotron & waveglow trt engine inference ERROR

Related to Model/Framework(s) [Tacotron2-Waveglow/PyTorch]

Describe the bug

I try inference by pre-trained models is ok: python inference.py --tacotron2 nvidia_tacotron2pyt_fp16_20190427 --waveglow nvidia_waveglowpyt_fp16_20190427 -o output/ --include-warmup -i phrases/phrase_1_64.txt --fp16 --log-file=output/nvlog_fp16.json

Then I try to inference by trt engine,

The follow steps about tacotron + waveglow tensorRT inference, models are from NGC models repository

1. tacotron exports to onnx : python export_tacotron2_onnx.py --tacotron2 nvidia_tacotron2pyt_fp16_20190427 -o exports/ --fp16

2. waveglow 256 channel exports to onnx : python export_waveglow_onnx.py --waveglow audio/nvidia_waveglow256pyt_fp16 --wn-channels 256 -o exports/ --fp16

3. tacotron & waveglow exports onnx to trt engine : python export_onnx2trt.py -o audio/ --encoder exports/encoder.onnx --decoder exports/decoder_iter.onnx --postnet exports/postnet.onnx --waveglow exports/waveglow.onnx --fp16

4. tacotron & waveglow trt engine inference : python inference_trt.py -i test.txt -o audio/ --encoder audio/encoder_fp16.engine --decoder audio/decoder_iter_fp16.engine --postnet audio/postnet_fp16.engine --waveglow audio/waveglow_fp16.engine --fp16

Then I get the following error: [TensorRT] WARNING: TensorRT was linked against cuBLAS 10.2.2 but loaded cuBLAS 10.2.1 [TensorRT] WARNING: TensorRT was linked against cuBLAS 10.2.2 but loaded cuBLAS 10.2.1 [TensorRT] WARNING: TensorRT was linked against cuBLAS 10.2.2 but loaded cuBLAS 10.2.1 [TensorRT] WARNING: TensorRT was linked against cuBLAS 10.2.2 but loaded cuBLAS 10.2.1 Running Tacotron2 Encoder [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::resolveSlots::1092, condition: allInputDimensionsSpecified(routine) Running Tacotron2 Decoder [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::resolveSlots::1092, condition: allInputDimensionsSpecified(routine) [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::setBindingDimensions::948, condition: profileMaxDims.d[i] >= dimensions.d[i] [TensorRT] ERROR: Parameter check failed at: engine.cpp::resolveSlots::1092, condition: allInputDimensionsSpecified(routine)

steps 1~3 will output model success, but cannot inference.

Any suggestions about this problem? I try to search this error but it's seldom issues about this.

I will thankful if you answer that Orz...
bug

opened by RaymondTsao 14

[BERT/TF] Multi-node SQUAD fine tuning hangs

Environment

TensorFlow 1.15
Horovod 0.18.1
OpenMPI 4.0
CUDA 10.0
NCCL 2.4.8
Python 3.5

Issue

I tried to run multi-node SQUAD fine tuning on two VMs (each with 8 * V100) using the following command:

mpirun -np 16 -hostfile HOSTFILE -mca plm_rsh_no_tree_spawn 1 -bind-to socket -map-by slot -x NCCL_MIN_NRINGS=4 -x TF_CPP_MIN_LOG_LEVEL=0 -x NCCL_DEBUG=INFO -x -x NCCL_ALGO=Ring -x NCCL_BUFFSIZE=8388608 -mca pml ob1 -mca btl ^openib -mca btl_tcp_if_exclude lo,docker0 python3 DeepLearningExamples/TensorFlow/LanguageModeling/BERT/run_squad.py --vocab_file=DeepLearningExamples/TensorFlow/LanguageModeling/BERT/data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/vocab.txt --bert_config_file=DeepLearningExamples/TensorFlow/LanguageModeling/BERT/data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/bert_config.json --init_checkpoint=DeepLearningExamples/TensorFlow/LanguageModeling/BERT/data/download/google_pretrained_weights/uncased_L-24_H-1024_A-16/bert_model.ckpt --do_train=True --train_file=DeepLearningExamples/TensorFlow/LanguageModeling/BERT/data/download/squad/v1.1/train-v1.1.json --train_batch_size=2 --learning_rate=5e-6 --num_train_epochs=1 --max_seq_length=128 --doc_stride=128 --output_dir=/tmp/pkb --horovod --use_fp16

However, the master rank (rank 0) hangs with the following message (actually the iteration was not completely stalled. It progressed very slowly):

2020-01-23 23:25:03.863322: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-23 23:25:03.928604: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-23 23:25:03.940879: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-23 23:25:04.633817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-23 23:25:04.872561: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-23 23:25:04.924935: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-23 23:25:05.165986: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-01-23 23:25:05.313782: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
[2020-01-23 23:26:23. 26353: W horovod/common/stall_inspector.cc:105] One or more tensors were submitted to be reduced, gathered or broadcasted by
 subset of ranks and are waiting for remainder of ranks for more than 60 seconds. This may indicate that different ranks are trying to submit diff
erent tensors or that only subset of ranks is submitting tensors, which will cause deadlock. 
Stalled ranks:
8: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeigh
tDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allreduc
e/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Distr
ibutedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOpt
imizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_101
2_0 ...]
9: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeigh
tDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allreduc
e/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Distr
ibutedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOpt
imizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_101
2_0 ...]
10: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeig
htDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allredu
ce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Dist
ributedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOp
timizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_10
12_0 ...]
11: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeig
htDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allredu
ce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Dist
ributedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOp
timizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_10
12_0 ...]
12: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeig
htDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allredu
ce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Dist
ributedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOp
timizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_10
12_0 ...]
13: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeig
htDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allredu
ce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Dist
ributedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOp
timizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_10
12_0 ...]
14: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeig
htDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allredu
ce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Dist
ributedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOp
timizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_10
12_0 ...]
15: [DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1000_0, DistributedAdamWeig
htDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1003_0, DistributedAdamWeightDecayOptimizer_Allredu
ce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_1006_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_Dist
ributedAdamWeightDecayOptimizer_Allreduce_Cast_1009_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOp
timizer_Allreduce_Cast_100_0, DistributedAdamWeightDecayOptimizer_Allreduce/HorovodAllreduce_DistributedAdamWeightDecayOptimizer_Allreduce_Cast_10
12_0 ...]

I also tested on the single node with 8 GPUs. It worked without any issues.

bug

opened by changlan 13

[FastPitch 1.1/PyTorch] Advice/best practices for good alignment when fine-tuning

Related to FastPitch 1.1/PyTorch

Hi @alancucki @rafaelvalle. I've been experimenting with the FastPitch 1.1 update incorporating rad-tts into fastpitch, since the commit a while back. The alignment mechanism is amazing, and enables some great things (eg s2s).

However, I'm unfortunately having some issues with the convergence on some datasets with this approach, compared to 1.0. I've successfully converged an LJSpeech model quite well, but when fine-tuning a pre-trained model (like the one provided), it seems that the alignment is having some real trouble converging.

I have tried it on 4 datasets so far - 2 male, 2 female. I did get something ALMOST good with one of the female datasets (~9h), but they all converge to a fairly high KL loss (>=0.85 at 40k-120k its, compared to less than 0.35 where I stopped LJ at 25k its). I added soft and hard alignment plots to the logs, and they resemble plots c) and d) from Fig 2 in the RAD-TTS paper. I noticed also in Figure 2 from "One TTS Alignment To Rule Them All", that the convergence speed was lower using RAD-TTS with FastPitch (compared to Tacotron2 durs), before they arrived at a similar point - could this be exacerbated by smaller datasets?

I have tried experimenting with resuming the LJ optimizer (from the model I trained myself) as well as one newly initialized (from the provided LJ model), with and without including the KL weight warm-up stage. I also tried with and without arpabet, with and without energy conditioning, and also several tweaks to lr scheduling, and other such things, but I can never get anything as good as LJ (in the KL loss at least).

When running inference, the sentence composition quality varies between datasets, ranging from missing letters to missing words, and for the smaller datasets, quite difficult to understand speech, spoken very fast.

The same datasets worked very well in the previous Tacotron2+FastPitch set-up, so I'm confident that the data quality is high. Have you by any chance had any successes yourselves with something other than LJ? And would you have any tips/advice for how to better converge the alignment on smaller datasets (with transfer learning)?

Thank you for all your great work!
bug

opened by DanRuta 12
I got "killed" when I create instance from sharded files.

Related to Pytorch/LanguageModeling

I followed the instruction and preprocess the downloaded 'bookscorpus' dataset. The sharding action went well. But in the 'creat_hdf5_files' action, I got “killed” when creating instance from sharded files, for both sequence length 128 case and sequence length 512 case. Since there is no more information about the error, I have no idea how to fix it. Could you do me a favor? Thanks a lot.

opened by TonyTangYu 12
Why The Test Result of Transformer NMT Task with 4 GPUs Is Worse Than What Is Reported in Readme

In the readme file, 4 GPUs can achieve a BLEU of 28.35 and even 28.67 when training more epochs.

GPU count | Mixed precision BLEU | fp32 BLEU | Mixed precision training time | fp32 training time -- | -- | -- | -- | -- 8 | 28.69 | 28.43 | 446 min | 1896 min 4 | 28.35 | 28.31 | 834 min | 3733 min

GPU count | Precision | BLEU score | Epochs to train | Training time -- | -- | -- | -- | -- 4 | fp16 | 28.67 | 74 | 1925 min 4 | fp32 | 28.40 | 47 | 5478 min

However, I have run the code with 4 GPUs and I did not modify the code at all but The Best Result I got is 27.63 on my "checkpoint_best.pt" which is epoch 19 in my case. I have run totally 80 epochs and the best BLEU over all those epochs is 28.13 which is not considered as the "checkpoint_best.pt" in the validation process.

I used the following command line to train the model:

nohup python -m torch.distributed.launch --nproc_per_node 4 /workspace/translation/train.py /workspace/data-bin/wmt14_en_de_joined_dict
--arch transformer_wmt_en_de_big_t2t
--share-all-embeddings
--optimizer adam
--adam-betas '(0.9, 0.997)'
--adam-eps "1e-9"
--clip-norm 0.0
--lr-scheduler inverse_sqrt
--warmup-init-lr 0.0
--update-freq 2
--warmup-updates 8000
--lr 0.0006
--min-lr 0.0
--dropout 0.1
--weight-decay 0.0
--criterion label_smoothed_cross_entropy
--label-smoothing 0.1
--max-tokens 5120
--seed 1
--max-epoch 80
--ignore-case
--fp16
--save-dir /workspace/checkpoints
--distributed-init-method env:// > train.nohup.out &

I also tried different warmup-updates and lr, and the results are similar. The result I got is like:

Test Checkpoint1 | Translated 3003 sentences (84994 tokens) in 25.2s (119.35 sentences/s, 3377.84 tokens/s) | Generate test with beam=4: BLEU4 = 18.11, 50.2/23.5/12.7/7.2 (BP=1.000, ratio=1.041, syslen=67147, reflen=64512) Test Checkpoint2 | Translated 3003 sentences (87704 tokens) in 27.5s (109.17 sentences/s, 3188.43 tokens/s) | Generate test with beam=4: BLEU4 = 21.26, 52.5/26.7/15.5/9.4 (BP=1.000, ratio=1.061, syslen=68450, reflen=64512) Test Checkpoint3 | Translated 3003 sentences (86611 tokens) in 25.8s (116.61 sentences/s, 3363.17 tokens/s) | Generate test with beam=4: BLEU4 = 23.91, 55.5/29.5/17.8/11.2 (BP=1.000, ratio=1.040, syslen=67079, reflen=64512) Test Checkpoint4 | Translated 3003 sentences (86518 tokens) in 25.8s (116.61 sentences/s, 3359.54 tokens/s) | Generate test with beam=4: BLEU4 = 25.26, 56.7/30.9/19.0/12.3 (BP=1.000, ratio=1.035, syslen=66758, reflen=64512) Test Checkpoint5 | Translated 3003 sentences (86768 tokens) in 25.7s (116.96 sentences/s, 3379.47 tokens/s) | Generate test with beam=4: BLEU4 = 25.63, 56.8/31.2/19.4/12.5 (BP=1.000, ratio=1.034, syslen=66698, reflen=64512) Test Checkpoint6 | Translated 3003 sentences (87220 tokens) in 25.8s (116.21 sentences/s, 3375.30 tokens/s) | Generate test with beam=4: BLEU4 = 25.98, 56.9/31.5/19.8/12.9 (BP=1.000, ratio=1.042, syslen=67205, reflen=64512) Test Checkpoint7 | Translated 3003 sentences (87715 tokens) in 25.9s (115.80 sentences/s, 3382.54 tokens/s) | Generate test with beam=4: BLEU4 = 26.24, 57.2/31.8/20.0/13.0 (BP=1.000, ratio=1.045, syslen=67413, reflen=64512) Test Checkpoint8 | Translated 3003 sentences (87808 tokens) in 26.8s (111.88 sentences/s, 3271.39 tokens/s) | Generate test with beam=4: BLEU4 = 26.82, 57.6/32.3/20.5/13.6 (BP=1.000, ratio=1.045, syslen=67444, reflen=64512) Test Checkpoint9 | Translated 3003 sentences (87394 tokens) in 25.6s (117.26 sentences/s, 3412.38 tokens/s) | Generate test with beam=4: BLEU4 = 26.63, 57.8/32.2/20.3/13.3 (BP=1.000, ratio=1.039, syslen=67033, reflen=64512) Test Checkpoint10 | Translated 3003 sentences (86825 tokens) in 25.8s (116.31 sentences/s, 3362.82 tokens/s) | Generate test with beam=4: BLEU4 = 27.10, 58.1/32.7/20.7/13.7 (BP=1.000, ratio=1.031, syslen=66541, reflen=64512) Test Checkpoint11 | Translated 3003 sentences (86850 tokens) in 25.9s (116.11 sentences/s, 3358.03 tokens/s) | Generate test with beam=4: BLEU4 = 27.29, 58.1/32.8/20.9/13.9 (BP=1.000, ratio=1.032, syslen=66563, reflen=64512) Test Checkpoint12 | Translated 3003 sentences (87137 tokens) in 26.2s (114.74 sentences/s, 3329.31 tokens/s) | Generate test with beam=4: BLEU4 = 27.28, 58.2/32.9/20.9/13.8 (BP=1.000, ratio=1.035, syslen=66787, reflen=64512) Test Checkpoint13 | Translated 3003 sentences (86810 tokens) in 25.6s (117.41 sentences/s, 3393.98 tokens/s) | Generate test with beam=4: BLEU4 = 27.26, 58.3/32.9/20.9/13.8 (BP=1.000, ratio=1.031, syslen=66500, reflen=64512) Test Checkpoint14 | Translated 3003 sentences (87359 tokens) in 25.8s (116.30 sentences/s, 3383.15 tokens/s) | Generate test with beam=4: BLEU4 = 27.69, 58.3/33.2/21.3/14.3 (BP=1.000, ratio=1.036, syslen=66830, reflen=64512) Test Checkpoint15 | Translated 3003 sentences (87415 tokens) in 26.3s (114.33 sentences/s, 3327.98 tokens/s) | Generate test with beam=4: BLEU4 = 27.37, 58.1/32.9/21.0/14.0 (BP=1.000, ratio=1.038, syslen=66951, reflen=64512) Test Checkpoint16 | Translated 3003 sentences (87332 tokens) in 26.7s (112.51 sentences/s, 3272.10 tokens/s) | Generate test with beam=4: BLEU4 = 27.33, 58.1/32.9/21.0/13.9 (BP=1.000, ratio=1.039, syslen=66998, reflen=64512) Test Checkpoint17 | Translated 3003 sentences (86721 tokens) in 25.9s (116.06 sentences/s, 3351.62 tokens/s) | Generate test with beam=4: BLEU4 = 27.32, 58.4/33.0/20.9/13.8 (BP=1.000, ratio=1.029, syslen=66385, reflen=64512) Test Checkpoint18 | Translated 3003 sentences (87388 tokens) in 26.2s (114.71 sentences/s, 3338.08 tokens/s) | Generate test with beam=4: BLEU4 = 27.57, 58.3/33.1/21.2/14.2 (BP=1.000, ratio=1.038, syslen=66956, reflen=64512) Test Checkpoint19 | Translated 3003 sentences (86919 tokens) in 25.8s (116.28 sentences/s, 3365.50 tokens/s) | Generate test with beam=4: BLEU4 = 27.63, 58.6/33.3/21.2/14.1 (BP=1.000, ratio=1.033, syslen=66642, reflen=64512) Test Checkpoint20 | Translated 3003 sentences (87485 tokens) in 26.1s (115.24 sentences/s, 3357.16 tokens/s) | Generate test with beam=4: BLEU4 = 27.48, 58.1/33.0/21.1/14.1 (BP=1.000, ratio=1.037, syslen=66924, reflen=64512) Test Checkpoint21 | Translated 3003 sentences (86993 tokens) in 26.3s (114.07 sentences/s, 3304.46 tokens/s) | Generate test with beam=4: BLEU4 = 27.77, 58.5/33.3/21.4/14.3 (BP=1.000, ratio=1.032, syslen=66564, reflen=64512) Test Checkpoint22 | Translated 3003 sentences (87084 tokens) in 25.4s (118.07 sentences/s, 3424.04 tokens/s) | Generate test with beam=4: BLEU4 = 27.87, 58.6/33.3/21.5/14.4 (BP=1.000, ratio=1.032, syslen=66595, reflen=64512) Test Checkpoint23 | Translated 3003 sentences (87013 tokens) in 26.4s (113.92 sentences/s, 3300.98 tokens/s) | Generate test with beam=4: BLEU4 = 27.59, 58.4/33.2/21.2/14.1 (BP=1.000, ratio=1.033, syslen=66626, reflen=64512) Test Checkpoint24 | Translated 3003 sentences (86741 tokens) in 26.0s (115.49 sentences/s, 3335.84 tokens/s) | Generate test with beam=4: BLEU4 = 27.98, 58.7/33.5/21.6/14.4 (BP=1.000, ratio=1.029, syslen=66379, reflen=64512) Test Checkpoint25 | Translated 3003 sentences (86884 tokens) in 25.4s (118.05 sentences/s, 3415.42 tokens/s) | Generate test with beam=4: BLEU4 = 27.94, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.029, syslen=66392, reflen=64512) Test Checkpoint26 | Translated 3003 sentences (86840 tokens) in 26.4s (113.68 sentences/s, 3287.46 tokens/s) | Generate test with beam=4: BLEU4 = 27.91, 58.7/33.5/21.5/14.4 (BP=1.000, ratio=1.028, syslen=66344, reflen=64512) Test Checkpoint27 | Translated 3003 sentences (87050 tokens) in 26.2s (114.45 sentences/s, 3317.73 tokens/s) | Generate test with beam=4: BLEU4 = 27.88, 58.7/33.4/21.5/14.3 (BP=1.000, ratio=1.030, syslen=66451, reflen=64512) Test Checkpoint28 | Translated 3003 sentences (86981 tokens) in 25.8s (116.40 sentences/s, 3371.53 tokens/s) | Generate test with beam=4: BLEU4 = 27.80, 58.7/33.3/21.4/14.3 (BP=1.000, ratio=1.031, syslen=66488, reflen=64512) Test Checkpoint29 | Translated 3003 sentences (86219 tokens) in 25.6s (117.33 sentences/s, 3368.59 tokens/s) | Generate test with beam=4: BLEU4 = 27.82, 58.8/33.4/21.4/14.3 (BP=1.000, ratio=1.022, syslen=65941, reflen=64512) Test Checkpoint30 | Translated 3003 sentences (86879 tokens) in 26.9s (111.61 sentences/s, 3229.04 tokens/s) | Generate test with beam=4: BLEU4 = 27.88, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.031, syslen=66501, reflen=64512) Test Checkpoint31 | Translated 3003 sentences (87082 tokens) in 26.6s (112.83 sentences/s, 3271.95 tokens/s) | Generate test with beam=4: BLEU4 = 28.00, 58.8/33.6/21.6/14.4 (BP=1.000, ratio=1.032, syslen=66570, reflen=64512) Test Checkpoint32 | Translated 3003 sentences (86677 tokens) in 26.6s (112.93 sentences/s, 3259.43 tokens/s) | Generate test with beam=4: BLEU4 = 27.98, 58.8/33.5/21.6/14.4 (BP=1.000, ratio=1.028, syslen=66289, reflen=64512) Test Checkpoint33 | Translated 3003 sentences (87034 tokens) in 26.2s (114.54 sentences/s, 3319.61 tokens/s) | Generate test with beam=4: BLEU4 = 28.10, 58.8/33.6/21.7/14.5 (BP=1.000, ratio=1.032, syslen=66553, reflen=64512) Test Checkpoint34 | Translated 3003 sentences (87064 tokens) in 26.3s (114.28 sentences/s, 3313.16 tokens/s) | Generate test with beam=4: BLEU4 = 27.92, 58.4/33.3/21.6/14.4 (BP=1.000, ratio=1.031, syslen=66534, reflen=64512) Test Checkpoint35 | Translated 3003 sentences (86818 tokens) in 26.6s (112.86 sentences/s, 3262.78 tokens/s) | Generate test with beam=4: BLEU4 = 28.11, 58.9/33.7/21.7/14.5 (BP=1.000, ratio=1.028, syslen=66336, reflen=64512) Test Checkpoint36 | Translated 3003 sentences (87037 tokens) in 25.9s (115.89 sentences/s, 3358.98 tokens/s) | Generate test with beam=4: BLEU4 = 28.18, 58.8/33.6/21.8/14.6 (BP=1.000, ratio=1.031, syslen=66483, reflen=64512) Test Checkpoint37 | Translated 3003 sentences (86740 tokens) in 25.7s (116.91 sentences/s, 3376.92 tokens/s) | Generate test with beam=4: BLEU4 = 28.19, 58.9/33.7/21.8/14.6 (BP=1.000, ratio=1.026, syslen=66197, reflen=64512) Test Checkpoint38 | Translated 3003 sentences (87084 tokens) in 26.1s (115.05 sentences/s, 3336.24 tokens/s) | Generate test with beam=4: BLEU4 = 28.01, 58.7/33.5/21.6/14.5 (BP=1.000, ratio=1.032, syslen=66551, reflen=64512) Test Checkpoint39 | Translated 3003 sentences (86972 tokens) in 27.7s (108.47 sentences/s, 3141.58 tokens/s) | Generate test with beam=4: BLEU4 = 28.10, 58.7/33.5/21.7/14.6 (BP=1.000, ratio=1.030, syslen=66456, reflen=64512) Test Checkpoint40 | Translated 3003 sentences (86717 tokens) in 25.7s (116.94 sentences/s, 3376.78 tokens/s) | Generate test with beam=4: BLEU4 = 27.81, 58.7/33.4/21.4/14.2 (BP=1.000, ratio=1.028, syslen=66314, reflen=64512) Test Checkpoint41 | Translated 3003 sentences (86542 tokens) in 26.0s (115.52 sentences/s, 3329.06 tokens/s) | Generate test with beam=4: BLEU4 = 27.69, 58.9/33.3/21.3/14.1 (BP=1.000, ratio=1.025, syslen=66127, reflen=64512) Test Checkpoint42 | Translated 3003 sentences (86841 tokens) in 27.1s (110.96 sentences/s, 3208.64 tokens/s) | Generate test with beam=4: BLEU4 = 27.99, 58.7/33.5/21.6/14.5 (BP=1.000, ratio=1.028, syslen=66329, reflen=64512) Test Checkpoint43 | Translated 3003 sentences (86986 tokens) in 26.8s (111.92 sentences/s, 3241.95 tokens/s) | Generate test with beam=4: BLEU4 = 27.81, 58.6/33.3/21.4/14.3 (BP=1.000, ratio=1.031, syslen=66501, reflen=64512) Test Checkpoint44 | Translated 3003 sentences (86691 tokens) in 25.6s (117.24 sentences/s, 3384.53 tokens/s) | Generate test with beam=4: BLEU4 = 28.09, 58.8/33.6/21.7/14.6 (BP=1.000, ratio=1.026, syslen=66162, reflen=64512) Test Checkpoint45 | Translated 3003 sentences (86845 tokens) in 26.5s (113.44 sentences/s, 3280.52 tokens/s) | Generate test with beam=4: BLEU4 = 28.00, 58.8/33.5/21.6/14.4 (BP=1.000, ratio=1.029, syslen=66353, reflen=64512) Test Checkpoint46 | Translated 3003 sentences (86280 tokens) in 25.7s (116.75 sentences/s, 3354.46 tokens/s) | Generate test with beam=4: BLEU4 = 28.13, 59.0/33.6/21.7/14.6 (BP=1.000, ratio=1.021, syslen=65860, reflen=64512) Test Checkpoint47 | Translated 3003 sentences (86857 tokens) in 26.4s (113.64 sentences/s, 3286.92 tokens/s) | Generate test with beam=4: BLEU4 = 27.77, 58.6/33.3/21.4/14.3 (BP=1.000, ratio=1.029, syslen=66402, reflen=64512) Test Checkpoint48 | Translated 3003 sentences (87087 tokens) in 26.0s (115.65 sentences/s, 3353.93 tokens/s) | Generate test with beam=4: BLEU4 = 27.68, 58.4/33.2/21.3/14.2 (BP=1.000, ratio=1.032, syslen=66576, reflen=64512) Test Checkpoint49 | Translated 3003 sentences (86627 tokens) in 25.5s (117.97 sentences/s, 3402.95 tokens/s) | Generate test with beam=4: BLEU4 = 28.02, 59.0/33.6/21.6/14.4 (BP=1.000, ratio=1.026, syslen=66208, reflen=64512) Test Checkpoint50 | Translated 3003 sentences (86529 tokens) in 25.9s (116.09 sentences/s, 3345.07 tokens/s) | Generate test with beam=4: BLEU4 = 27.96, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.024, syslen=66049, reflen=64512) Test Checkpoint51 | Translated 3003 sentences (87095 tokens) in 26.2s (114.50 sentences/s, 3320.73 tokens/s) | Generate test with beam=4: BLEU4 = 27.80, 58.6/33.4/21.4/14.3 (BP=1.000, ratio=1.030, syslen=66471, reflen=64512) Test Checkpoint52 | Translated 3003 sentences (87160 tokens) in 27.2s (110.54 sentences/s, 3208.27 tokens/s) | Generate test with beam=4: BLEU4 = 27.89, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.032, syslen=66559, reflen=64512) Test Checkpoint53 | Translated 3003 sentences (86909 tokens) in 26.1s (114.96 sentences/s, 3326.93 tokens/s) | Generate test with beam=4: BLEU4 = 27.90, 58.8/33.5/21.5/14.3 (BP=1.000, ratio=1.029, syslen=66353, reflen=64512) Test Checkpoint54 | Translated 3003 sentences (86785 tokens) in 26.1s (114.94 sentences/s, 3321.61 tokens/s) | Generate test with beam=4: BLEU4 = 28.05, 58.8/33.6/21.6/14.5 (BP=1.000, ratio=1.028, syslen=66308, reflen=64512) Test Checkpoint55 | Translated 3003 sentences (86914 tokens) in 25.9s (115.95 sentences/s, 3355.82 tokens/s) | Generate test with beam=4: BLEU4 = 27.76, 58.5/33.3/21.4/14.2 (BP=1.000, ratio=1.029, syslen=66376, reflen=64512) Test Checkpoint56 | Translated 3003 sentences (86775 tokens) in 26.5s (113.27 sentences/s, 3273.16 tokens/s) | Generate test with beam=4: BLEU4 = 27.75, 58.5/33.2/21.4/14.3 (BP=1.000, ratio=1.028, syslen=66314, reflen=64512) Test Checkpoint57 | Translated 3003 sentences (86522 tokens) in 26.3s (114.39 sentences/s, 3295.88 tokens/s) | Generate test with beam=4: BLEU4 = 27.91, 58.9/33.4/21.5/14.3 (BP=1.000, ratio=1.024, syslen=66052, reflen=64512) Test Checkpoint58 | Translated 3003 sentences (86269 tokens) in 26.1s (114.94 sentences/s, 3301.85 tokens/s) | Generate test with beam=4: BLEU4 = 27.77, 58.7/33.3/21.4/14.2 (BP=1.000, ratio=1.021, syslen=65893, reflen=64512) Test Checkpoint59 | Translated 3003 sentences (86738 tokens) in 25.9s (115.78 sentences/s, 3344.27 tokens/s) | Generate test with beam=4: BLEU4 = 27.96, 58.5/33.4/21.6/14.5 (BP=1.000, ratio=1.029, syslen=66378, reflen=64512) Test Checkpoint60 | Translated 3003 sentences (86566 tokens) in 25.7s (116.92 sentences/s, 3370.48 tokens/s) | Generate test with beam=4: BLEU4 = 27.85, 58.7/33.4/21.5/14.3 (BP=1.000, ratio=1.025, syslen=66151, reflen=64512) Test Checkpoint61 | Translated 3003 sentences (86785 tokens) in 25.3s (118.91 sentences/s, 3436.47 tokens/s) | Generate test with beam=4: BLEU4 = 27.74, 58.7/33.3/21.3/14.2 (BP=1.000, ratio=1.028, syslen=66291, reflen=64512) Test Checkpoint62 | Translated 3003 sentences (86261 tokens) in 25.7s (116.79 sentences/s, 3354.79 tokens/s) | Generate test with beam=4: BLEU4 = 27.86, 58.8/33.4/21.5/14.3 (BP=1.000, ratio=1.021, syslen=65898, reflen=64512) Test Checkpoint63 | Translated 3003 sentences (86569 tokens) in 25.1s (119.58 sentences/s, 3447.32 tokens/s) | Generate test with beam=4: BLEU4 = 27.92, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.025, syslen=66155, reflen=64512) Test Checkpoint64 | Translated 3003 sentences (86583 tokens) in 25.8s (116.47 sentences/s, 3357.96 tokens/s) | Generate test with beam=4: BLEU4 = 27.59, 58.5/33.2/21.2/14.1 (BP=1.000, ratio=1.025, syslen=66146, reflen=64512) Test Checkpoint65 | Translated 3003 sentences (86707 tokens) in 26.2s (114.76 sentences/s, 3313.64 tokens/s) | Generate test with beam=4: BLEU4 = 27.78, 58.5/33.3/21.4/14.2 (BP=1.000, ratio=1.028, syslen=66294, reflen=64512) Test Checkpoint66 | Translated 3003 sentences (86478 tokens) in 26.0s (115.55 sentences/s, 3327.54 tokens/s) | Generate test with beam=4: BLEU4 = 27.63, 58.5/33.2/21.3/14.1 (BP=1.000, ratio=1.025, syslen=66114, reflen=64512) Test Checkpoint67 | Translated 3003 sentences (86564 tokens) in 25.8s (116.40 sentences/s, 3355.20 tokens/s) | Generate test with beam=4: BLEU4 = 27.92, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.026, syslen=66200, reflen=64512) Test Checkpoint68 | Translated 3003 sentences (86548 tokens) in 26.2s (114.58 sentences/s, 3302.20 tokens/s) | Generate test with beam=4: BLEU4 = 28.08, 58.8/33.6/21.7/14.5 (BP=1.000, ratio=1.024, syslen=66041, reflen=64512) Test Checkpoint69 | Translated 3003 sentences (86580 tokens) in 25.9s (116.08 sentences/s, 3346.72 tokens/s) | Generate test with beam=4: BLEU4 = 28.13, 58.8/33.7/21.7/14.6 (BP=1.000, ratio=1.026, syslen=66178, reflen=64512) Test Checkpoint70 | Translated 3003 sentences (86448 tokens) in 26.1s (115.01 sentences/s, 3310.94 tokens/s) | Generate test with beam=4: BLEU4 = 27.88, 58.8/33.5/21.5/14.3 (BP=1.000, ratio=1.023, syslen=65998, reflen=64512) Test Checkpoint71 | Translated 3003 sentences (86832 tokens) in 26.0s (115.69 sentences/s, 3345.26 tokens/s) | Generate test with beam=4: BLEU4 = 27.91, 58.6/33.4/21.5/14.4 (BP=1.000, ratio=1.029, syslen=66355, reflen=64512) Test Checkpoint72 | Translated 3003 sentences (86550 tokens) in 25.6s (117.18 sentences/s, 3377.25 tokens/s) | Generate test with beam=4: BLEU4 = 27.95, 58.8/33.5/21.5/14.4 (BP=1.000, ratio=1.024, syslen=66092, reflen=64512) Test Checkpoint73 | Translated 3003 sentences (86415 tokens) in 25.4s (118.17 sentences/s, 3400.41 tokens/s) | Generate test with beam=4: BLEU4 = 27.84, 58.8/33.4/21.4/14.3 (BP=1.000, ratio=1.023, syslen=65990, reflen=64512) Test Checkpoint74 | Translated 3003 sentences (86251 tokens) in 26.2s (114.65 sentences/s, 3292.82 tokens/s) | Generate test with beam=4: BLEU4 = 27.97, 58.8/33.5/21.6/14.4 (BP=1.000, ratio=1.021, syslen=65889, reflen=64512) Test Checkpoint75 | Translated 3003 sentences (86418 tokens) in 26.1s (115.03 sentences/s, 3310.16 tokens/s) | Generate test with beam=4: BLEU4 = 27.72, 58.6/33.2/21.3/14.2 (BP=1.000, ratio=1.023, syslen=65971, reflen=64512) Test Checkpoint76 | Translated 3003 sentences (86474 tokens) in 25.9s (116.04 sentences/s, 3341.50 tokens/s) | Generate test with beam=4: BLEU4 = 27.63, 58.6/33.2/21.2/14.1 (BP=1.000, ratio=1.023, syslen=66025, reflen=64512) Test Checkpoint77 | Translated 3003 sentences (86100 tokens) in 25.6s (117.20 sentences/s, 3360.35 tokens/s) | Generate test with beam=4: BLEU4 = 28.11, 59.1/33.7/21.7/14.5 (BP=1.000, ratio=1.018, syslen=65695, reflen=64512) Test Checkpoint78 | Translated 3003 sentences (86497 tokens) in 26.2s (114.53 sentences/s, 3298.82 tokens/s) | Generate test with beam=4: BLEU4 = 27.80, 58.7/33.4/21.4/14.3 (BP=1.000, ratio=1.024, syslen=66073, reflen=64512) Test Checkpoint79 | Translated 3003 sentences (86905 tokens) in 26.3s (114.22 sentences/s, 3305.35 tokens/s) | Generate test with beam=4: BLEU4 = 27.69, 58.5/33.2/21.3/14.2 (BP=1.000, ratio=1.028, syslen=66327, reflen=64512) Test Checkpoint80 | Translated 3003 sentences (86654 tokens) in 26.3s (114.36 sentences/s, 3300.06 tokens/s) | Generate test with beam=4: BLEU4 = 27.65, 58.5/33.2/21.3/14.1 (BP=1.000, ratio=1.026, syslen=66219, reflen=64512)

So, why I am not able to achieve the results as reported in the readme file? Could you tell me the command line that you use to run transformer on 4 GPUs?

Another question is that the "Attention is all you need" paper uses 0.1 as the initial learning rate whereas 0.0006 is used here. Why there is such a large difference on learning rate?

opened by yaoyiran 12

DeepLearningExamples/MxNet/Classification/RN50v1.5 --> prepare_imagenet.sh

HI Team

When I am trying to run prepare_imagenet.sh , nothing is happening . Its keep on running with no output and IOs to the disk I have downloaded Imagenet 150GB dataset and did tar -xvzf imagenet.tar.

Below is the folder structure I got

root@ddb5e2b7ceaa:/workspace/rn50# ll /data/imagenet/train-val-recordio-passthrough/

total 162373888

drwxr-xr-x 5 root root         4096 Jan  2 10:15 ./

drwxr-xr-x 3 root root           44 Jan  3 07:44 ../

drwxr-xr-x 5 root root         4096 Jan  2 10:15 ILSVRC/

-rw-r--r-- 1 root root 166022728827 Jan  2 03:09 ILSVRC2017_CLS-LOC.tar.gz

drwxrwxr-x 5 root root         4096 Feb  9  2015 tiny-imagenet-200/

-rw-r--r-- 1 root root    248100043 Jan  2 10:00 tiny-imagenet-200.zip

root@ddb5e2b7ceaa:/workspace/rn50# ll /data/imagenet/train-val-recordio-passthrough/ILSVRC/

total 20

drwxr-xr-x 5 root root 4096 Jan  2 10:15 ./

drwxr-xr-x 5 root root 4096 Jan  2 10:15 ../

drwxr-xr-x 3 root root 4096 Jan  2 04:49 Annotations/

drwxr-xr-x 3 root root 4096 Jan  2 06:14 Data/

drwxr-xr-x 3 root root 4096 Jan  2 10:15 ImageSets/

root@ddb5e2b7ceaa:/workspace/rn50# ll /data/imagenet/train-val-recordio-passthrough/ILSVRC/Data/

total 12

drwxr-xr-x 3 root   root 4096 Jan  2 06:14 ./

drwxr-xr-x 5 root   root 4096 Jan  2 10:15 ../

drwxr-xr-x 6 200031 1003 4096 Jan  3 06:21 CLS-LOC/

root@ddb5e2b7ceaa:/workspace/rn50# ll /data/imagenet/train-val-recordio-passthrough/ILSVRC/Data/CLS-LOC/

total 11808

drwxr-xr-x    6 200031 1003    4096 Jan  3 06:21 ./

drwxr-xr-x    3 root   root    4096 Jan  2 06:14 ../

drwxr-xr-x    2 root   root    4096 Jan  3 06:58 out/

drwxr-xr-x    2 200031 1003 7979008 May 17  2015 test/

drwxr-xr-x 1002 200031 1003   65536 Sep 29  2014 train/

drwxr-xr-x    2 200031 1003 4014080 May 17  2015 val/

Below is the command for data pre-processing

root@ddb5e2b7ceaa:/workspace/rn50# ./scripts/prepare_imagenet.sh /data/imagenet/train-val-recordio-passthrough/ILSVRC/Data/CLS-LOC/ /data/imagenet/train-val-recordio-passthrough/ILSVRC/Data/CLS-LOC/

 

^CTraceback (most recent call last):

 File "/opt/mxnet/tools/im2rec.py", line 329, in <module>

   make_list(args)

 File "/opt/mxnet/tools/im2rec.py", line 100, in make_list

   image_list = list(image_list)

 File "/opt/mxnet/tools/im2rec.py", line 60, in list_image

   if os.path.isfile(fpath) and (suffix in exts):

 File "/usr/lib/python3.8/genericpath.py", line 30, in isfile

   st = os.stat(path)

KeyboardInterrupt

opened by karanveersingh5623 0

Failed to use --pretrained-from-file due to KeyError: 'layer1.block0.se' error

It seems like the change in https://github.com/NVIDIA/DeepLearningExamples/commit/5843f4e5a1220dd98936e477d8597d8a77320666 didn't consider the case of --pretrained_from_file in this line https://github.com/NVIDIA/DeepLearningExamples/blob/ca5ae20e3d1af3464159754f758768052c41c607/PyTorch/Classification/ConvNets/image_classification/models/model.py#L123 which results in a failure to load the model file.

The loading problem can be resolved by changing that line to

if (pretrained or pretrained_from_file) and hasattr(model, "ngc_checkpoint_remap"):

But this suggested addition is causing inference to fail (e.g., when is executed with a model I trained from scratch), so I'm probably missing something.

Reproducing the issue -

E.g., run - python ./launch.py --model efficientnet-widese-b4 --precision AMP --mode convergence --platform T4 ./imagenet --workspace ./workspace --raport-file raport.json --pretrained-from-file ./nvidia_efficientnet-widese-b4_210412.pth

Output -

... => loading pretrained weights from './nvidia_efficientnet-widese-b4_210412.pth' Traceback (most recent call last): File "./launch.py", line 53, in main(args, model_args, model_arch) File "./main.py", line 623, in main ) = prepare_for_training(args, model_args, model_arch) File "./main.py", line 462, in prepare_for_training model = model_arch( File "./image_classification/models/model.py", line 138, in call state_dict = { File "./image_classification/models/model.py", line 142, in dict(model.named_modules())[".".join(k.split(".")[:-2])] KeyError: 'layer1.block0.se'

opened by kfirlevari 0
[GNMT、NCF、TransformerXL/PyTorch] Run failed
Describe the bug

I ran these deep learning examples using the PyTroch NGC Docker image (PyTorch NGC release ('pytorch:21.07-py3')), the device is Nvidia 3090 24G. All models worked successfully, except for gnmt, ncf, transformer-xl with weird bugs, I needed some help

BUG

gnmt

Run the following command:

python3 -m torch.distributed.launch --nproc_per_node=1 train.py --dataset-dir "/data/gnmt/wmt16_de_en" --train-batch-size "288" --math "fp32" --epochs "2" --seed "2"

Error：

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_6my8it11/none_hekk6ojr/attempt_1/0/error.json train.py:41: UserWarning: PyProf is unavailable warnings.warn('PyProf is unavailable') Traceback (most recent call last): File "train.py", line 667, in <module> main() File "train.py", line 388, in main affinity = gpu_affinity.set_affinity( File "/workspace/examples/gnmt/seq2seq/gpu_affinity.py", line 135, in set_affinity set_socket_unique_affinity(gpu_id, nproc_per_node, 'interleaved') File "/workspace/examples/gnmt/seq2seq/gpu_affinity.py", line 110, in set_socket_unique_affinity os.sched_setaffinity(0, affinity) OSError: [Errno 22] Invalid argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 47637) of binary: /opt/conda/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

ncf

Run the following command:

python -m torch.distributed.launch --nproc_per_node=1 ncf.py --data "/data/ncf/cache/ml-20m" --epochs "2" --batch_size "2516582" --opt_level "O0"

Error：

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_a2r3m8o_/none_h135axnw/attempt_0/0/error.json :::NVLOGv0.1.0 ncf 1671438905.572438002 (ncf.py:171) cpu_info: {"num": 24, "name": "AMD EPYC 7773X 64-Core Processor"} :::NVLOGv0.1.0 ncf 1671438905.578721285 (ncf.py:171) mem_info: {"ram": "29Gi"} :::NVLOGv0.1.0 ncf 1671438905.720378399 (ncf.py:171) gpu_info: {"driver_version": "515.65.01", "num": 1, "name": ["NVIDIA GeForce RTX 3090"], "mem": ["24576 MiB"]} :::NVLOGv0.1.0 ncf 1671438905.721916914 (ncf.py:174) args: {"data": "/data/ncf/cache/ml-20m", "epochs": 2, "batch_size": 2516582, "valid_batch_size": 1048576, "factors": 64, "layers": [256, 256, 128, 64], "negative_samples": 4, "learning_rate": 0.0045, "topk": 10, "seed": 1, "threshold": 1.0, "beta1": 0.25, "beta2": 0.5, "eps": 1e-08, "dropout": 0.5, "checkpoint_dir": "/data/checkpoints/", "load_checkpoint_path": null, "mode": "train", "grads_accumulated": 1, "opt_level": "O0", "local_rank": 0, "distributed": false, "world_size": 1} Saving results to /data/checkpoints/ :::NVLOGv0.1.0 ncf 1671438905.722423792 (ncf.py:184) preproc_hp_sample_eval_replacement: true :::NVLOGv0.1.0 ncf 1671438905.722621918 (ncf.py:185) input_hp_sample_train_replacement: true :::NVLOGv0.1.0 ncf 1671438905.722801685 (ncf.py:186) input_step_eval_neg_gen :::NVLOGv0.1.0 ncf 1671438906.979937315 (ncf.py:194) run_start :::NVLOGv0.1.0 ncf 1671438907.882619858 (ncf.py:201) preproc_hp_num_eval: 100 :::NVLOGv0.1.0 ncf 1671438907.883869886 (ncf.py:207) input_size: 19861770 :::NVLOGv0.1.0 ncf 1671438907.905972481 (ncf.py:216) input_batch_size: 2516582 :::NVLOGv0.1.0 ncf 1671438907.906189203 (ncf.py:217) input_order :::NVLOGv0.1.0 ncf 1671438907.906588554 (/workspace/examples/ncf/neumf.py:54) model_hp_mf_dim: 64 :::NVLOGv0.1.0 ncf 1671438908.116574049 (/workspace/examples/ncf/neumf.py:62) model_hp_mlp_layer_sizes: [256, 256, 128, 64] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -8) local_rank: 0 (pid: 41723) of binary: /opt/conda/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group

transformer-xl

Run the following command:

python -m torch.distributed.launch --nproc_per_node=1 train.py --data "/data/transformer-xl/wikitext-103" --max_step "400" --batch_size "14" --dataset "wt103" --n_layer "16" --d_model "512" --n_head "8" --d_head "64" --d_inner "2048" --dropout "0.1" --dropatt "0.0" --optim "jitlamb" --lr "0.0" --eta_min "0.001" --warmup_step "1000" --tgt_len "192" --mem_len "192" --eval_tgt_len "192" --log_interval "10" --eval_interval "5000" --roll --cuda

Error：

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_4w5qnet_/none_wit_ptbo/attempt_0/0/error.json train.py:41: UserWarning: PyProf is unavailable warnings.warn('PyProf is unavailable') Traceback (most recent call last): File "train.py", line 1102, in <module> main() File "train.py", line 690, in main affinity = utils.gpu_affinity.set_affinity( File "/workspace/examples/transformer-xl/pytorch/utils/gpu_affinity.py", line 135, in set_affinity set_socket_unique_affinity(gpu_id, nproc_per_node, 'interleaved') File "/workspace/examples/transformer-xl/pytorch/utils/gpu_affinity.py", line 110, in set_socket_unique_affinity os.sched_setaffinity(0, affinity) OSError: [Errno 22] Invalid argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 49668) of binary: /opt/conda/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

Environment

Container version (pytorch:21.07-py3):

GPUs in the system: (1 x Nvidia 3090 24G):

CUDA driver version (515.65.01):

bug
opened by zengxunli 0

[Kaldi/SpeechRecognition] Update Included Notebooks

Related to Kaldi/SpeechRecognition outdated Jupyter Notebooks

Examples:

Kaldi/SpeechRecognition
Jupyter notebooks

Is your feature request related to a problem? Please describe. The jupyter notebooks included in Kaldi/SpeechRecognition are outdated and doesn't work with new Triton server. Because these notebook are using older tensorrtserver apis.

Describe the solution you'd like Updated jupyter notebooks which are compatible with existing version of Triton Server and use new tritonclient apis.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context I have tried to use new client tritonclient by following examples here. But every time I setup inputs and send request for Inference, I encounter the following error:

---------------------------------------------------------------------------
_InactiveRpcError                         Traceback (most recent call last)
<ipython-input-32-2617521bb391> in <module>
----> 1 response = grpc_stub.ModelInfer(request)

/usr/local/lib/python3.8/dist-packages/grpc/_channel.py in __call__(self, request, timeout, metadata, credentials, wait_for_ready, compression)
    944         state, call, = self._blocking(request, timeout, metadata, credentials,
    945                                       wait_for_ready, compression)
--> 946         return _end_unary_response_blocking(state, call, False, None)
    947 
    948     def with_call(self,

/usr/local/lib/python3.8/dist-packages/grpc/_channel.py in _end_unary_response_blocking(state, call, with_call, deadline)
    847             return state.response
    848     else:
--> 849         raise _InactiveRpcError(state)
    850 
    851 

_InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNIMPLEMENTED
	details = "ModelInfer RPC doesn't support models with decoupled transaction policy"
	debug_error_string = "{"created":"@1671696095.420704856","description":"Error received from peer ipv4:127.0.0.1:8001","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"ModelInfer RPC doesn't support models with decoupled transaction policy","grpc_status":12}"
>

enhancement

opened by InzamamAnwar 0

replace_static.sparsity_with_incubate.asp

What happened？

paddle.static.sparsity has been removed and is now replaced by paddle.incubate.asp. see pr for details. https://github.com/PaddlePaddle/Paddle/pull/48450

What did I do？

replace paddle.static.sparsity with paddle.incubate.asp

What did you expect to happen？

eliminate the impact of removing paddle.static.sparsity

The specification of the pull request

PR Specification from OSCS

opened by GGBond8488 0

NVIDIA Deep Learning Examples for Tensor Cores

Related tags

Overview

NVIDIA Deep Learning Examples for Tensor Cores

Introduction

NVIDIA GPU Cloud (NGC) Container Registry

Computer Vision

Natural Language Processing

Recommender Systems

Speech to Text

Text to Speech

Graph Neural Networks

NVIDIA support

Glossary

Feedback / Contributions

Known issues

Comments

Environment

Issue

Reproducing the issue -

What happened？

What did I do？

What did you expect to happen？

The specification of the pull request

Owner

NVIDIA Corporation

ColossalAI-Examples - Examples of training models with hybrid parallelism using ColossalAI

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

PyTorch framework, for reproducing experiments from the paper Implicit Regularization in Hierarchical Tensor Factorization and Deep Convolutional Neural Networks

Tutorial on active learning with the Nvidia Transfer Learning Toolkit (TLT).

Self-Correcting Quantum Many-Body Control using Reinforcement Learning with Tensor Networks

Simulating Sycamore quantum circuits classically using tensor network algorithm.

TuckER: Tensor Factorization for Knowledge Graph Completion

FluidNet re-written with ATen tensor lib

Pretty Tensor - Fluent Neural Networks in TensorFlow

A torch.Tensor-like DataFrame library supporting multiple execution runtimes and Arrow as a common memory format

Gradient-free global optimization algorithm for multidimensional functions based on the low rank tensor train format

(Py)TOD: Tensor-based Outlier Detection, A General GPU-Accelerated Framework

Code to reproduce the results in the paper "Tensor Component Analysis for Interpreting the Latent Space of GANs".

Pre-trained model, code, and materials from the paper "Impact of Adversarial Examples on Deep Learning Models for Biomedical Image Segmentation" (MICCAI 2019).

🦕 NanoSaur is a little tracked robot ROS2 enabled, made for an NVIDIA Jetson Nano

The 1st place solution of track2 (Vehicle Re-Identification) in the NVIDIA AI City Challenge at CVPR 2021 Workshop.

Nvidia Semantic Segmentation monorepo

PyTorch implementation of the Quasi-Recurrent Neural Network - up to 16 times faster than NVIDIA's cuDNN LSTM

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.