NVIDIA Deep Learning Examples for Tensor Cores


This repository provides State-of-the-Art Deep Learning examples that are easy to train and deploy, achieving the best reproducible accuracy and performance with NVIDIA CUDA-X software stack running on NVIDIA Volta, Turing and Ampere GPUs.

NVIDIA GPU Cloud (NGC) Container Registry

These examples, along with our NVIDIA deep learning software stack, are provided in a monthly updated Docker container on the NGC container registry (https://ngc.nvidia.com). These containers include:

  • The latest NVIDIA examples from this repository
  • The latest NVIDIA contributions shared upstream to the respective framework
  • The latest NVIDIA Deep Learning software libraries, such as cuDNN, NCCL, cuBLAS, etc. which have all been through a rigorous monthly quality assurance process to ensure that they provide the best possible performance
  • Monthly release notes for each of the NVIDIA optimized containers

Computer Vision

Models Framework A100 AMP Multi-GPU Multi-Node TRT ONNX Triton DLC NB
ResNet-50 PyTorch Yes Yes Yes - Yes - Yes Yes -
ResNeXt-101 PyTorch Yes Yes Yes - Yes - Yes Yes -
SE-ResNeXt-101 PyTorch Yes Yes Yes - Yes - Yes Yes -
EfficientNet-B0 PyTorch Yes Yes Yes - - - - Yes -
EfficientNet-B4 PyTorch Yes Yes Yes - - - - Yes -
EfficientNet-WideSE-B0 PyTorch Yes Yes Yes - - - - Yes -
EfficientNet-WideSE-B4 PyTorch Yes Yes Yes - - - - Yes -
Mask R-CNN PyTorch Yes Yes Yes - - - - - Yes
nnUNet PyTorch Yes Yes Yes - - - - Yes -
SSD PyTorch Yes Yes Yes - - - - - Yes
ResNet-50 TensorFlow Yes Yes Yes - - - - Yes -
ResNeXt101 TensorFlow Yes Yes Yes - - - - Yes -
SE-ResNeXt-101 TensorFlow Yes Yes Yes - - - - Yes -
Mask R-CNN TensorFlow Yes Yes Yes - - - - Yes -
SSD TensorFlow Yes Yes Yes - - - - Yes Yes
U-Net Ind TensorFlow Yes Yes Yes - - - - Yes Yes
U-Net Med TensorFlow Yes Yes Yes - - - - Yes -
U-Net 3D TensorFlow Yes Yes Yes - - - - Yes -
V-Net Med TensorFlow Yes Yes Yes - - - - Yes -
U-Net Med TensorFlow2 Yes Yes Yes - - - - Yes -
Mask R-CNN TensorFlow2 Yes Yes Yes - - - - Yes -
EfficientNet TensorFlow2 Yes Yes Yes Yes - - - Yes -
ResNet-50 MXNet - Yes Yes - - - - - -

Natural Language Processing

Models Framework A100 AMP Multi-GPU Multi-Node TRT ONNX Triton DLC NB
BERT PyTorch Yes Yes Yes Yes - - Yes Yes -
TransformerXL PyTorch Yes Yes Yes Yes - - - Yes -
GNMT PyTorch Yes Yes Yes - - - - - -
Transformer PyTorch Yes Yes Yes - - - - - -
ELECTRA TensorFlow2 Yes Yes Yes Yes - - - Yes -
BERT TensorFlow Yes Yes Yes Yes Yes - Yes Yes Yes
BERT TensorFlow2 Yes Yes Yes Yes - - - Yes -
BioBert TensorFlow Yes Yes Yes - - - - Yes Yes
TransformerXL TensorFlow Yes Yes Yes - - - - - -
GNMT TensorFlow Yes Yes Yes - - - - - -
Faster Transformer Tensorflow - - - - Yes - - - -

Recommender Systems

Models Framework A100 AMP Multi-GPU Multi-Node TRT ONNX Triton DLC NB
DLRM PyTorch Yes Yes Yes - - Yes Yes Yes Yes
DLRM TensorFlow2 Yes Yes Yes Yes - - - Yes -
NCF PyTorch Yes Yes Yes - - - - - -
Wide&Deep TensorFlow Yes Yes Yes - - - - Yes -
Wide&Deep TensorFlow2 Yes Yes Yes - - - - Yes -
NCF TensorFlow Yes Yes Yes - - - - Yes -
VAE-CF TensorFlow Yes Yes Yes - - - - - -

Speech to Text

Models Framework A100 AMP Multi-GPU Multi-Node TRT ONNX Triton DLC NB
Jasper PyTorch Yes Yes Yes - Yes Yes Yes Yes Yes
Hidden Markov Model Kaldi - - Yes - - - Yes - -

Text to Speech

Models Framework A100 AMP Multi-GPU Multi-Node TRT ONNX Triton DLC NB
FastPitch PyTorch Yes Yes Yes - - - - Yes -
FastSpeech PyTorch - Yes Yes - Yes - - - -
Tacotron 2 and WaveGlow PyTorch Yes Yes Yes - Yes Yes Yes Yes -

Graph Neural Networks

Models Framework A100 AMP Multi-GPU Multi-Node TRT ONNX Triton DLC NB
SE(3)-Transformer PyTorch Yes Yes Yes - - - - - -

NVIDIA support

In each of the network READMEs, we indicate the level of support that will be provided. The range is from ongoing updates and improvements to a point-in-time release for thought leadership.


Multinode Training
Supported on a pyxis/enroot Slurm cluster.

Deep Learning Compiler (DLC)
TensorFlow XLA and PyTorch JIT and/or TorchScript

Accelerated Linear Algebra (XLA)
XLA is a domain-specific compiler for linear algebra that can accelerate TensorFlow models with potentially no source code changes. The results are improvements in speed and memory usage.

PyTorch JIT and/or TorchScript
TorchScript is a way to create serializable and optimizable models from PyTorch code. TorchScript, an intermediate representation of a PyTorch model (subclass of nn.Module) that can then be run in a high-performance environment such as C++.

Automatic Mixed Precision (AMP)
Automatic Mixed Precision (AMP) enables mixed precision training on Volta, Turing, and NVIDIA Ampere GPU architectures automatically.

TensorFloat-32 (TF32)
TensorFloat-32 (TF32) is the new math mode in NVIDIA A100 GPUs for handling the matrix math also called tensor operations. TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. TF32 is supported in the NVIDIA Ampere GPU architecture and is enabled by default.

Jupyter Notebooks (NB)
The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text.

Feedback / Contributions

We're posting these examples on GitHub to better support the community, facilitate feedback, as well as collect and implement contributions using GitHub Issues and pull requests. We welcome all contributions!

Known issues

In each of the network READMEs, we indicate any known issues and encourage the community to provide feedback.

  • DeepLearningExamples/MxNet/Classification/RN50v1.5 --> prepare_imagenet.sh

    DeepLearningExamples/MxNet/Classification/RN50v1.5 --> prepare_imagenet.sh

    HI Team

    When I am trying to run prepare_imagenet.sh , nothing is happening . Its keep on running with no output and IOs to the disk I have downloaded Imagenet 150GB dataset and did tar -xvzf imagenet.tar.

    Below is the folder structure I got

    root@ddb5e2b7ceaa:/workspace/rn50# ll /data/imagenet/train-val-recordio-passthrough/
    total 162373888
    drwxr-xr-x 5 root root         4096 Jan  2 10:15 ./
    drwxr-xr-x 3 root root           44 Jan  3 07:44 ../
    drwxr-xr-x 5 root root         4096 Jan  2 10:15 ILSVRC/
    -rw-r--r-- 1 root root 166022728827 Jan  2 03:09 ILSVRC2017_CLS-LOC.tar.gz
    drwxrwxr-x 5 root root         4096 Feb  9  2015 tiny-imagenet-200/
    -rw-r--r-- 1 root root    248100043 Jan  2 10:00 tiny-imagenet-200.zip
    root@ddb5e2b7ceaa:/workspace/rn50# ll /data/imagenet/train-val-recordio-passthrough/ILSVRC/
    total 20
    drwxr-xr-x 5 root root 4096 Jan  2 10:15 ./
    drwxr-xr-x 5 root root 4096 Jan  2 10:15 ../
    drwxr-xr-x 3 root root 4096 Jan  2 04:49 Annotations/
    drwxr-xr-x 3 root root 4096 Jan  2 06:14 Data/
    drwxr-xr-x 3 root root 4096 Jan  2 10:15 ImageSets/
    root@ddb5e2b7ceaa:/workspace/rn50# ll /data/imagenet/train-val-recordio-passthrough/ILSVRC/Data/
    total 12
    drwxr-xr-x 3 root   root 4096 Jan  2 06:14 ./
    drwxr-xr-x 5 root   root 4096 Jan  2 10:15 ../
    drwxr-xr-x 6 200031 1003 4096 Jan  3 06:21 CLS-LOC/
    root@ddb5e2b7ceaa:/workspace/rn50# ll /data/imagenet/train-val-recordio-passthrough/ILSVRC/Data/CLS-LOC/
    total 11808
    drwxr-xr-x    6 200031 1003    4096 Jan  3 06:21 ./
    drwxr-xr-x    3 root   root    4096 Jan  2 06:14 ../
    drwxr-xr-x    2 root   root    4096 Jan  3 06:58 out/
    drwxr-xr-x    2 200031 1003 7979008 May 17  2015 test/
    drwxr-xr-x 1002 200031 1003   65536 Sep 29  2014 train/
    drwxr-xr-x    2 200031 1003 4014080 May 17  2015 val/

    Below is the command for data pre-processing

    root@ddb5e2b7ceaa:/workspace/rn50# ./scripts/prepare_imagenet.sh /data/imagenet/train-val-recordio-passthrough/ILSVRC/Data/CLS-LOC/ /data/imagenet/train-val-recordio-passthrough/ILSVRC/Data/CLS-LOC/
    ^CTraceback (most recent call last):
     File "/opt/mxnet/tools/im2rec.py", line 329, in <module>
     File "/opt/mxnet/tools/im2rec.py", line 100, in make_list
       image_list = list(image_list)
     File "/opt/mxnet/tools/im2rec.py", line 60, in list_image
       if os.path.isfile(fpath) and (suffix in exts):
     File "/usr/lib/python3.8/genericpath.py", line 30, in isfile
       st = os.stat(path)
    opened by karanveersingh5623 0
  • Failed to use --pretrained-from-file due to KeyError: 'layer1.block0.se' error

    Failed to use --pretrained-from-file due to KeyError: 'layer1.block0.se' error

    It seems like the change in https://github.com/NVIDIA/DeepLearningExamples/commit/5843f4e5a1220dd98936e477d8597d8a77320666 didn't consider the case of --pretrained_from_file in this line https://github.com/NVIDIA/DeepLearningExamples/blob/ca5ae20e3d1af3464159754f758768052c41c607/PyTorch/Classification/ConvNets/image_classification/models/model.py#L123 which results in a failure to load the model file.

    The loading problem can be resolved by changing that line to

    if (pretrained or pretrained_from_file) and hasattr(model, "ngc_checkpoint_remap"):

    But this suggested addition is causing inference to fail (e.g., when is executed with a model I trained from scratch), so I'm probably missing something.

    Reproducing the issue -

    E.g., run - python ./launch.py --model efficientnet-widese-b4 --precision AMP --mode convergence --platform T4 ./imagenet --workspace ./workspace --raport-file raport.json --pretrained-from-file ./nvidia_efficientnet-widese-b4_210412.pth

    Output -

    ... => loading pretrained weights from './nvidia_efficientnet-widese-b4_210412.pth' Traceback (most recent call last): File "./launch.py", line 53, in main(args, model_args, model_arch) File "./main.py", line 623, in main ) = prepare_for_training(args, model_args, model_arch) File "./main.py", line 462, in prepare_for_training model = model_arch( File "./image_classification/models/model.py", line 138, in call state_dict = { File "./image_classification/models/model.py", line 142, in dict(model.named_modules())[".".join(k.split(".")[:-2])] KeyError: 'layer1.block0.se'

    opened by kfirlevari 0
  • [GNMT、NCF、TransformerXL/PyTorch] Run failed

    [GNMT、NCF、TransformerXL/PyTorch] Run failed

    Describe the bug

    I ran these deep learning examples using the PyTroch NGC Docker image (PyTorch NGC release ('pytorch:21.07-py3')), the device is Nvidia 3090 24G. All models worked successfully, except for gnmt, ncf, transformer-xl with weird bugs, I needed some help


    • gnmt

    Run the following command:

    python3 -m torch.distributed.launch --nproc_per_node=1 train.py --dataset-dir "/data/gnmt/wmt16_de_en" --train-batch-size "288" --math "fp32" --epochs "2" --seed "2"


    INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_6my8it11/none_hekk6ojr/attempt_1/0/error.json train.py:41: UserWarning: PyProf is unavailable warnings.warn('PyProf is unavailable') Traceback (most recent call last): File "train.py", line 667, in <module> main() File "train.py", line 388, in main affinity = gpu_affinity.set_affinity( File "/workspace/examples/gnmt/seq2seq/gpu_affinity.py", line 135, in set_affinity set_socket_unique_affinity(gpu_id, nproc_per_node, 'interleaved') File "/workspace/examples/gnmt/seq2seq/gpu_affinity.py", line 110, in set_socket_unique_affinity os.sched_setaffinity(0, affinity) OSError: [Errno 22] Invalid argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 47637) of binary: /opt/conda/bin/python3 ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed

    • ncf

    Run the following command:

    python -m torch.distributed.launch --nproc_per_node=1 ncf.py --data "/data/ncf/cache/ml-20m" --epochs "2" --batch_size "2516582" --opt_level "O0"


    INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_a2r3m8o_/none_h135axnw/attempt_0/0/error.json :::NVLOGv0.1.0 ncf 1671438905.572438002 (ncf.py:171) cpu_info: {"num": 24, "name": "AMD EPYC 7773X 64-Core Processor"} :::NVLOGv0.1.0 ncf 1671438905.578721285 (ncf.py:171) mem_info: {"ram": "29Gi"} :::NVLOGv0.1.0 ncf 1671438905.720378399 (ncf.py:171) gpu_info: {"driver_version": "515.65.01", "num": 1, "name": ["NVIDIA GeForce RTX 3090"], "mem": ["24576 MiB"]} :::NVLOGv0.1.0 ncf 1671438905.721916914 (ncf.py:174) args: {"data": "/data/ncf/cache/ml-20m", "epochs": 2, "batch_size": 2516582, "valid_batch_size": 1048576, "factors": 64, "layers": [256, 256, 128, 64], "negative_samples": 4, "learning_rate": 0.0045, "topk": 10, "seed": 1, "threshold": 1.0, "beta1": 0.25, "beta2": 0.5, "eps": 1e-08, "dropout": 0.5, "checkpoint_dir": "/data/checkpoints/", "load_checkpoint_path": null, "mode": "train", "grads_accumulated": 1, "opt_level": "O0", "local_rank": 0, "distributed": false, "world_size": 1} Saving results to /data/checkpoints/ :::NVLOGv0.1.0 ncf 1671438905.722423792 (ncf.py:184) preproc_hp_sample_eval_replacement: true :::NVLOGv0.1.0 ncf 1671438905.722621918 (ncf.py:185) input_hp_sample_train_replacement: true :::NVLOGv0.1.0 ncf 1671438905.722801685 (ncf.py:186) input_step_eval_neg_gen :::NVLOGv0.1.0 ncf 1671438906.979937315 (ncf.py:194) run_start :::NVLOGv0.1.0 ncf 1671438907.882619858 (ncf.py:201) preproc_hp_num_eval: 100 :::NVLOGv0.1.0 ncf 1671438907.883869886 (ncf.py:207) input_size: 19861770 :::NVLOGv0.1.0 ncf 1671438907.905972481 (ncf.py:216) input_batch_size: 2516582 :::NVLOGv0.1.0 ncf 1671438907.906189203 (ncf.py:217) input_order :::NVLOGv0.1.0 ncf 1671438907.906588554 (/workspace/examples/ncf/neumf.py:54) model_hp_mf_dim: 64 :::NVLOGv0.1.0 ncf 1671438908.116574049 (/workspace/examples/ncf/neumf.py:62) model_hp_mlp_layer_sizes: [256, 256, 128, 64] ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -8) local_rank: 0 (pid: 41723) of binary: /opt/conda/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group

    • transformer-xl

    Run the following command:

    python -m torch.distributed.launch --nproc_per_node=1 train.py --data "/data/transformer-xl/wikitext-103" --max_step "400" --batch_size "14" --dataset "wt103" --n_layer "16" --d_model "512" --n_head "8" --d_head "64" --d_inner "2048" --dropout "0.1" --dropatt "0.0" --optim "jitlamb" --lr "0.0" --eta_min "0.001" --warmup_step "1000" --tgt_len "192" --mem_len "192" --eval_tgt_len "192" --log_interval "10" --eval_interval "5000" --roll --cuda


    INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_4w5qnet_/none_wit_ptbo/attempt_0/0/error.json train.py:41: UserWarning: PyProf is unavailable warnings.warn('PyProf is unavailable') Traceback (most recent call last): File "train.py", line 1102, in <module> main() File "train.py", line 690, in main affinity = utils.gpu_affinity.set_affinity( File "/workspace/examples/transformer-xl/pytorch/utils/gpu_affinity.py", line 135, in set_affinity set_socket_unique_affinity(gpu_id, nproc_per_node, 'interleaved') File "/workspace/examples/transformer-xl/pytorch/utils/gpu_affinity.py", line 110, in set_socket_unique_affinity os.sched_setaffinity(0, affinity) OSError: [Errno 22] Invalid argument ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 49668) of binary: /opt/conda/bin/python ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed


    • Container version (pytorch:21.07-py3):
    • GPUs in the system: (1 x Nvidia 3090 24G):
    • CUDA driver version (515.65.01):
    opened by zengxunli 0
  • [Kaldi/SpeechRecognition] Update Included Notebooks

    [Kaldi/SpeechRecognition] Update Included Notebooks

    Related to Kaldi/SpeechRecognition outdated Jupyter Notebooks


    • Kaldi/SpeechRecognition
    • Jupyter notebooks

    Is your feature request related to a problem? Please describe. The jupyter notebooks included in Kaldi/SpeechRecognition are outdated and doesn't work with new Triton server. Because these notebook are using older tensorrtserver apis.

    Describe the solution you'd like Updated jupyter notebooks which are compatible with existing version of Triton Server and use new tritonclient apis.

    Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

    Additional context I have tried to use new client tritonclient by following examples here. But every time I setup inputs and send request for Inference, I encounter the following error:

    _InactiveRpcError                         Traceback (most recent call last)
    <ipython-input-32-2617521bb391> in <module>
    ----> 1 response = grpc_stub.ModelInfer(request)
    /usr/local/lib/python3.8/dist-packages/grpc/_channel.py in __call__(self, request, timeout, metadata, credentials, wait_for_ready, compression)
        944         state, call, = self._blocking(request, timeout, metadata, credentials,
        945                                       wait_for_ready, compression)
    --> 946         return _end_unary_response_blocking(state, call, False, None)
        948     def with_call(self,
    /usr/local/lib/python3.8/dist-packages/grpc/_channel.py in _end_unary_response_blocking(state, call, with_call, deadline)
        847             return state.response
        848     else:
    --> 849         raise _InactiveRpcError(state)
    _InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    	status = StatusCode.UNIMPLEMENTED
    	details = "ModelInfer RPC doesn't support models with decoupled transaction policy"
    	debug_error_string = "{"created":"@1671696095.420704856","description":"Error received from peer ipv4:","file":"src/core/lib/surface/call.cc","file_line":1067,"grpc_message":"ModelInfer RPC doesn't support models with decoupled transaction policy","grpc_status":12}"
    opened by InzamamAnwar 0
  • replace_static.sparsity_with_incubate.asp


    What happened?

    paddle.static.sparsity has been removed and is now replaced by paddle.incubate.asp. see pr for details. https://github.com/PaddlePaddle/Paddle/pull/48450

    What did I do?

    replace paddle.static.sparsity with paddle.incubate.asp

    What did you expect to happen?

    eliminate the impact of removing paddle.static.sparsity

    The specification of the pull request

    PR Specification from OSCS

    opened by GGBond8488 0
NVIDIA Corporation
NVIDIA Corporation
