DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning.

Overview

DirectML

DirectML is a high-performance, hardware-accelerated DirectX 12 library for machine learning. DirectML provides GPU acceleration for common machine learning tasks across a broad range of supported hardware and drivers, including all DirectX 12-capable GPUs from vendors such as AMD, Intel, NVIDIA, and Qualcomm.

When used standalone, the DirectML API is a low-level DirectX 12 library and is suitable for high-performance, low-latency applications such as frameworks, games, and other real-time applications. The seamless interoperability of DirectML with Direct3D 12 as well as its low overhead and conformance across hardware makes DirectML ideal for accelerating machine learning when both high performance is desired, and the reliability and predictability of results across hardware is critical.

More information about DirectML can be found in Introduction to DirectML.

Visit the DirectX Landing Page for more resources for DirectX developers.

Getting Started with DirectML

DirectML is distributed as a system component of Windows 10, and is available as part of the Windows 10 operating system (OS) in Windows 10, version 1903 (10.0; Build 18362), and newer.

Starting with DirectML version 1.4.0, DirectML is also available as a standalone redistributable package (see Microsoft.AI.DirectML), which is useful for applications that wish to use a fixed version of DirectML, or when running on older versions of Windows 10.

Hardware requirements

DirectML requires a DirectX 12 capable device. Almost all commercially-available graphics cards released in the last several years support DirectX 12. Examples of compatible hardware include:

  • AMD GCN 1st Gen (Radeon HD 7000 series) and above
  • Intel Haswell (4th-gen core) HD Integrated Graphics and above
  • NVIDIA Kepler (GTX 600 series) and above
  • Qualcomm Adreno 600 and above

For application developers

DirectML exposes a native C++ DirectX 12 API. The header and library (DirectML.h/DirectML.lib) are available as part of the redistributable NuGet package, and are also included in the Windows 10 SDK version 10.0.18362 or newer.

For users, data scientists, and researchers

DirectML is built-in as a backend to several frameworks such as Windows ML, ONNX Runtime, and TensorFlow.

See the following sections for more information:

DirectML Samples

DirectML C++ sample code is available under Samples.

  • HelloDirectML: A minimal "hello world" application that executes a single DirectML operator.
  • DirectMLSuperResolution: A sample that uses DirectML to execute a basic super-resolution model to upscale video from 540p to 1080p in real time.
  • yolov4: YOLOv4 is an object detection model capable of recognizing up to 80 different classes of objects in an image. This sample contains a complete end-to-end implementation of the model using DirectML, and is able to run in real time on a user-provided video stream.

DirectML Python sample code is available under Python/samples. The samples require PyDirectML, an open source Python projection library for DirectML, which can be built and installed to a Python executing environment from Python/src. Refer to the Python/README.md file for more details.

Windows ML on DirectML

Windows ML (WinML) is a high-performance, reliable API for deploying hardware-accelerated ML inferences on Windows devices. DirectML provides the GPU backend for Windows ML.

DirectML acceleration can be enabled in Windows ML using the LearningModelDevice with any one of the DirectX DeviceKinds.

For more information, see Get Started with Windows ML.

ONNX Runtime on DirectML

ONNX Runtime is a cross-platform inferencing and training accelerator compatible with many popular ML/DNN frameworks, including PyTorch, TensorFlow/Keras, scikit-learn, and more.

DirectML is available as an optional execution provider for ONNX Runtime that provides hardware acceleration when running on Windows 10.

For more information about getting started, see Using the DirectML execution provider.

TensorFlow with DirectML

TensorFlow is a popular open source platform for machine learning and is a leading framework for training of machine learning models.

DirectML acceleration for TensorFlow 1.15 is currently available for Public Preview. TensorFlow on DirectML enables training and inference of complex machine learning models on a wide range of DirectX 12-compatible hardware.

TensorFlow on DirectML is supported on both the latest versions of Windows 10 and the Windows Subsystem for Linux, and is available for download as a PyPI package. For more information about getting started, see GPU accelerated ML training (docs.microsoft.com)

PyTorch with DirectML

DirectML acceleration for PyTorch 1.8.0 is currently available for Public Preview. PyTorch with DirectML enables training and inference of complex machine learning models on a wide range of DirectX 12-compatible hardware.

PyTorch on DirectML is supported on both the latest versions of Windows 10 and the Windows Subsystem for Linux, and is available for download as a PyPI package. For more information about getting started, see GPU accelerated ML training (docs.microsoft.com)

Feedback

We look forward to hearing from you!

External Links

Documentation

DirectML programming guide
DirectML API reference

More information

Introducing DirectML (Game Developers Conference '19)
Accelerating GPU Inferencing with DirectML and DirectX 12 (SIGGRAPH '18)
Windows AI: hardware-accelerated ML on Windows devices (Microsoft Build '20)
Gaming with Windows ML (DirectX Developer Blog)
DirectML at GDC 2019 (DirectX Developer Blog)
DirectX Linux (DirectX Developer Blog)

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Comments
  • DirectML is x2.8 slower than CUDA

    DirectML is x2.8 slower than CUDA

    I tested training the same deepfake model on the same hardware using tensorflow-cuda and tensorflow-directml. (my project https://github.com/iperov/DeepFaceLab)

    DirectML: avg iter time 626ms DMLvsCUDA1

    CUDA: avg iter time 222ms DMLvsCUDA2

    DirectML is x2.8 slower :-(

    I think that's what I was talking about here https://github.com/microsoft/DirectML/issues/104

    So what is the point of using DirectML if every millisecond of training acceleration is important in today's world?

    x2.8 slower is serious performance degradation. I reached the same speed in my weekend OpenCL NN library in pure python (https://github.com/iperov/litenn)

    But you are guys from microsoft company. Don't you think there is no point in further development of DirectML until you reach the level of CUDA performance?

    opened by iperov 36
  • Could not load dynamic library 'libcuda.so.1'

    Could not load dynamic library 'libcuda.so.1'

    Followed the instructions here

    ~ » cat /proc/version                                                                                                                                                             1 ↵ jlam@MAKERPC
    Linux version 4.4.0-20150-Microsoft ([email protected]) (gcc version 5.4.0 (GCC) ) #1000-Microsoft Thu Jun 12 17:34:00 PST 2020
    

    I'm running build 20150, but am getting this error:

    Python 3.6.10 |Anaconda, Inc.| (default, May  8 2020, 02:54:21)
    [GCC 7.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import tensorflow.compat.v1 as tf
    >>>
    >>> tf.enable_eager_execution(tf.ConfigProto(log_device_placement=True))
    >>>
    >>> print(tf.add([1.0, 2.0], [3.0, 4.0]))
    2020-06-17 16:36:05.469811: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
    2020-06-17 16:36:05.469926: E tensorflow/stream_executor/cuda/cuda_driver.cc:313] failed call to cuInit: UNKNOWN ERROR (303)
    2020-06-17 16:36:05.470029: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (MAKERPC): /proc/driver/nvidia/version does not exist
    2020-06-17 16:36:05.470532: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
    2020-06-17 16:36:05.483133: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3400000000 Hz
    2020-06-17 16:36:05.487879: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fffe52ac420 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
    2020-06-17 16:36:05.488038: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
    tf.Tensor([4. 6.], shape=(2,), dtype=float32)
    
    opened by jflam 23
  • [installation] Could not find a version that satisfies the requirement tensorflow-directml (from versions: none)

    [installation] Could not find a version that satisfies the requirement tensorflow-directml (from versions: none)

    Hi,

    After following the steps described in https://docs.microsoft.com/en-us/windows/win32/direct3d12/gpu-tensorflow-wsl till pip install tensorflow-directml,

    the error appeared as

    ERROR: Could not find a version that satisfies the requirement tensorflow-directml (from versions: none) ERROR: No matching distribution found for tensorflow-directml

    BTW, I am using python 3.8

    and I did python list tensorflow*, which outputed

    Package Version


    certifi 2020.6.20 pip 20.1.1 setuptools 49.2.0.post20200714 wheel 0.34.2

    opened by shuwang1 19
  • How to get available devices and set a specific device in Pytorch-DML?

    How to get available devices and set a specific device in Pytorch-DML?

    Hi, For accessing available devices in Pytorch we'd normally do :

        print(f'available devices: {torch.cuda.device_count()}')
        print(f'current device: { torch.cuda.current_device()}')
    

    However, I noticed this fails (AssertionError: Torch not compiled with CUDA enabled).
    I thought the transition would be minimal, and stuff like this would work out of the box! especially so, after noting we cant write:

        print(f'available devices: {torch.dml.device_count()}')
        print(f'current device: { torch.dml.current_device()}')
    

    as it fails with the error :

    AttributeError: module 'torch.dml' has no attribute 'device_count'
    

    Apart from this, trying to specify a device using the form "dml:number" fails if number>1! that is this fails for "dml:1":

    import torch 
    import time
    def bench(device ='cpu'):
        print(f'running on {device}:')
        a = torch.randn(size=(2000,2000)).to(device=device)
        b = torch.randn(size=(2000,2000)).to(device=device)
       
        start = time.time()
        c = a+b
        end = time.time()
        
        # print(f'available devices: {torch.dml.device_count()}')
        # print(f'current device: { torch.dml.current_device()}')
        print(f'--took {end-start:.2f} seconds')
    
    bench('cpu')
    bench('dml')
    bench('dml:0')
    bench('dml:1')    
    

    it outputs :

    running on cpu:
    --took 0.00 seconds
    running on dml:
    --took 0.01 seconds
    running on dml:0:
    --took 0.00 seconds
    running on dml:1:
    

    and thats it, it doesnt execute when it comes to "dml:1".

    also trying to do :

    import torch 
    import time
    def bench(device ='cpu'):
        print(f'running on {device}:')
        a = torch.randn(size=(2000,2000)).to(device=device)
        b = torch.randn_like(a).to(device=device)
        
        start = time.time()
        c = a+b
        end = time.time()
        
        # print(f'available devices: {torch.dml.device_count()}')
        # print(f'current device: { torch.dml.current_device()}')
        print(f'--took {end-start:.2f} seconds')
    
    bench('cpu')
    bench('dml')
    bench('dml:0')
    bench('dml:1')    
    

    Fails with the following error :

    running on cpu:
    --took 0.00 seconds
    running on dml:
    Traceback (most recent call last):
      File "g:\tests.py", line 1246, in <module>
        bench('dml')
      File "g:\tests.py", line 1235, in bench
        b = torch.randn_like(a).to(device=device)
    RuntimeError: Could not run 'aten::normal_' with arguments from the 'UNKNOWN_TENSOR_TYPE_ID' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom 
    build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::normal_' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradNestedTensor, UNKNOWN_TENSOR_TYPE_ID, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].
    
    CPU: registered at D:\a\_work\1\s\build\aten\src\ATen\RegisterCPU.cpp:5926 [kernel]
    BackendSelect: fallthrough registered at D:\a\_work\1\s\aten\src\ATen\core\BackendSelectFallbackKernel.cpp:3 [backend fallback]
    Named: fallthrough registered at D:\a\_work\1\s\aten\src\ATen\core\NamedRegistrations.cpp:11 [kernel]
    AutogradOther: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    AutogradCPU: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    AutogradCUDA: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    AutogradXLA: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    AutogradNestedTensor: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    UNKNOWN_TENSOR_TYPE_ID: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    AutogradPrivateUse1: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    AutogradPrivateUse2: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    AutogradPrivateUse3: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    Tracer: registered at D:\a\_work\1\s\torch\csrc\autograd\generated\TraceType_4.cpp:10612 [kernel]
    Autocast: fallthrough registered at D:\a\_work\1\s\aten\src\ATen\autocast_mode.cpp:250 [backend fallback]
    Batched: registered at D:\a\_work\1\s\aten\src\ATen\BatchingRegistrations.cpp:1016 [backend fallback]
    VmapMode: registered at D:\a\_work\1\s\aten\src\ATen\VmapModeRegistrations.cpp:37 [kernel]
    
    
    pytorch-directml 
    opened by Coderx7 11
  • Conv2D-Fail: internal compiler error, abnormal program termination

    Conv2D-Fail: internal compiler error, abnormal program termination

    I ran across directML a few hours ago and am currently playing around with it on a Surface Pro 6 with an Intel HD Graphics 620. To set it all up, I followed this article to the letter: https://docs.microsoft.com/en-us/windows/win32/direct3d12/gpu-tensorflow-windows

    For testing purposes, I used a slightly modified version of my small go-to script:

    import tensorflow.compat.v1 as tf 
    
    tf.enable_eager_execution(tf.ConfigProto(log_device_placement=False)) 
    
    fashion_mnist = tf.keras.datasets.fashion_mnist
    (train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
    
    
    class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',
                   'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']
    
    train_images = train_images.reshape(60000, 28, 28, 1)
    train_images = train_images / 255.0
    
    test_images = test_images.reshape(10000, 28, 28, 1)
    test_images = test_images / 255.0
    
    #model = tf.keras.Sequential([
    #    tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
    #    tf.keras.layers.Dense(128, activation=tf.nn.relu),
    #    tf.keras.layers.Dense(10, activation=tf.nn.softmax)
    #])
    
    model = tf.keras.Sequential([
        tf.keras.layers.Conv2D(64, (3,3), activation=tf.nn.relu, input_shape=(28, 28, 1)),
        tf.keras.layers.MaxPooling2D(2,2),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(128, activation=tf.nn.relu),
        tf.keras.layers.Dense(10, activation=tf.nn.softmax)
    ])
    
    model.compile(optimizer='adam',
                  loss='sparse_categorical_crossentropy',
                  metrics=['accuracy'])
    
    model.fit(train_images, train_labels, epochs=5)
    
    test_loss, test_acc = model.evaluate(test_images,  test_labels, verbose=2)
    
    print('Test accuracy:', test_acc)
    

    The version of the model without convolutions runs absolutely fine. But as soon as I add the Conv2D layer, nothing works anymore.

    The entire output I get is:

    2021-04-23 21:23:05.241248: I tensorflow/stream_executor/platform/default/dso_loader.cc:99] Successfully opened dynamic library C:\Users\cyphus309\.conda\envs\directml\lib\site-packages\tensorflow_core\python/directml.b6e3bc69b89cfca5486e178bb9d51724d0c4a94a.dll
    2021-04-23 21:23:05.298554: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:249] DirectML device enumeration: found 1 compatible adapters.
    2021-04-23 21:23:05.299189: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
    2021-04-23 21:23:05.331743: I tensorflow/core/common_runtime/dml/dml_device_cache.cc:185] DirectML: creating device on adapter 0 (Intel(R) HD Graphics 620)
    2021-04-23 21:23:05.363568: I tensorflow/stream_executor/platform/default/dso_loader.cc:99] Successfully opened dynamic library Kernel32.dll
    Train on 60000 samples
    Epoch 1/5
    
    internal compiler error, abnormal program termination
    
    

    Any ideas?

    bug 
    opened by kampfhamster309 11
  • Tensorflow directml crashes my python session

    Tensorflow directml crashes my python session

    Hi,

    I've recently purchased a 6900 xt GPU which I would like to use with tensorflow. I followed the installation guide on https://docs.microsoft.com/en-us/windows/win32/direct3d12/gpu-tensorflow-windows which worked but the issue I have now is that whenever I try to use tensorflow it closes my python environment.

    I've attached an image to show what I mean. I can import tensorflow fine and it shows me that I have version 1.15.5 available. The problem is when I want to check if my GPU is available I get two messages and then it crashes me out of my python environment.

    Does anybody know how to solve this issue and what is going on?

    Thank you in advance!

    amd_tf_problem

    bug 
    opened by bwintertkb 9
  • C++ DirectML.dll causes crash in debug x64 mode when using NuGet package Microsoft.AI.MachineLearning 1.5.2

    C++ DirectML.dll causes crash in debug x64 mode when using NuGet package Microsoft.AI.MachineLearning 1.5.2

    Hello,

    I'm experiencing a runtime crash with the C++ DirectML API in Debug x64 mode after upgrading my NuGet package Microsoft.AI.MachineLearning from version 1.4.0 to 1.5.2. There is no error in Release x64 mode.

    The reason why I'm using this package is because the included DirectML.dll improves DirectML performance greatly. There seems to be an issue when creating a DirectMLOperator. The operator type is DML_OPERATOR_JOIN.

    Can you please help me identify the issue? Also how can I find the latest DirectML.dll file without downloading the package?

    DirectML dll error

    opened by momower1 9
  • Performance will be improved by setting input strides=output strides for Clip in DirectMLX

    Performance will be improved by setting input strides=output strides for Clip in DirectMLX

    I am investigating for the performance of MobileNet V2 from TFLite models with "nhwc" layout and MobileNet V2 from ONNX models with "nchw" layout on the implementation with DirectML and DirectMLX API.

    I find that nhwc MobileNetV2 model has lots of Clip after Conv2d, the Clip will cost much time on inference. I guess that the Clip will do memory copy and hasn't be optimized in compilation stage.

    I have a workaround to resolve this problem: set Clip's input strides same as its' output strides by changing this lineto TensorDesc outputTensor = inputTensor in DirectMLX.h, the Clip will be optimized just like fused into Conv2d, and then the inference time will be significantly reduced to be as same as nchw MobileNetV2.

    When building nhwc MobileNetV2 model, we need append Identity after each Conv2d to transpose output tensor from default nchw to nhwc, then transpose this output tensor from nhwc to nchw as the next Conv2d's input tensor. In my opinion, I suppose that the Identity and Reinterpret can be optimized by DML in this model like: Conv0->Identity(nchw->nhwc)->Reinterpret strides(nhwc->nchw)->Conv1 just like transpose sinking in OpenVINO backend.

    I guess that the Identity and Reinterpret sinking may be blocked when there is Clip like: Conv0->Identity(nchw->nhwc)->Clip->Reinterpret strides(nhwc->nchw)->Conv1 . I verified that if I remove Identity to run Conv0->Reinterpret strides(nchw->nhwc)->Clip(input strides = output strides)->Reinterpret strides(nhwc->nchw)->Conv1, the inference time will be much lower than before.

    So in conclusion, I suggest setting Clip's input strides same as its' output strides by changing this line to TensorDesc outputTensor = inputTensor in DirectMLX.h.

    opened by mingmingtasd 8
  • TensorFlow & DirectML & ROCm  performance and roadmap

    TensorFlow & DirectML & ROCm performance and roadmap

    The current DirectML library for GPU is more 2x slower than the TensorFlow CPU library. When DirectML team will improve the performance of the library? Could you share a roadmap of DirectML? Will DirectML team cooperate with ROCm team (https://github.com/RadeonOpenCompute/ROCm), Intel and Nvidia for improving performance?

    opened by YuriyTigiev 8
  • pytorch-directml simple command error

    pytorch-directml simple command error

    just trying simple command with pytorch-directml 1.8.0a0.dev220224 and getting error

    >>> torch.tensor([1], dtype=torch.float32, device='dml')
    
    Traceback (most recent call last):
      File "<console>", line 1, in <module>
      File "D:\DevelopPPP\projects\DeepFakeBox\_internal\python\lib\site-packages\torch\tensor.py", line 193, in __repr__
        return torch._tensor_str._str(self)
      File "D:\DevelopPPP\projects\DeepFakeBox\_internal\python\lib\site-packages\torch\_tensor_str.py", line 383, in _str
        return _str_intern(self)
      File "D:\DevelopPPP\projects\DeepFakeBox\_internal\python\lib\site-packages\torch\_tensor_str.py", line 358, in _str_intern
        tensor_str = _tensor_str(self, indent)
      File "D:\DevelopPPP\projects\DeepFakeBox\_internal\python\lib\site-packages\torch\_tensor_str.py", line 242, in _tensor_str
        formatter = _Formatter(get_summarized_data(self) if summarize else self)
      File "D:\DevelopPPP\projects\DeepFakeBox\_internal\python\lib\site-packages\torch\_tensor_str.py", line 90, in __init__
        nonzero_finite_vals = torch.masked_select(tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0))
    RuntimeError: Could not run 'aten::masked_select' with arguments from the 'UNKNOWN_TENSOR_TYPE_ID' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build). If you are a Facebook employee using PyTorch on mobile, please visit https://fburl.com/ptmfixes for possible resolutions. 'aten::masked_select' is only available for these backends: [CPU, BackendSelect, Named, AutogradOther, AutogradCPU, AutogradCUDA, AutogradXLA, AutogradNestedTensor, UNKNOWN_TENSOR_TYPE_ID, AutogradPrivateUse1, AutogradPrivateUse2, AutogradPrivateUse3, Tracer, Autocast, Batched, VmapMode].
    
    CPU: registered at D:\a\_work\1\s\pytorch-directml\build\aten\src\ATen\RegisterCPU.cpp:5926 [kernel]
    BackendSelect: fallthrough registered at D:\a\_work\1\s\pytorch-directml\aten\src\ATen\core\BackendSelectFallbackKernel.cpp:3 [backend fallback]
    Named: fallthrough registered at D:\a\_work\1\s\pytorch-directml\aten\src\ATen\core\NamedRegistrations.cpp:11 [kernel]
    AutogradOther: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    AutogradCPU: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    AutogradCUDA: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    AutogradXLA: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    AutogradNestedTensor: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    UNKNOWN_TENSOR_TYPE_ID: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    AutogradPrivateUse1: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    AutogradPrivateUse2: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    AutogradPrivateUse3: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\VariableType_4.cpp:8893 [autograd kernel]
    Tracer: registered at D:\a\_work\1\s\pytorch-directml\torch\csrc\autograd\generated\TraceType_4.cpp:10612 [kernel]
    Autocast: fallthrough registered at D:\a\_work\1\s\pytorch-directml\aten\src\ATen\autocast_mode.cpp:250 [backend fallback]
    Batched: registered at D:\a\_work\1\s\pytorch-directml\aten\src\ATen\BatchingRegistrations.cpp:1016 [backend fallback]
    VmapMode: fallthrough registered at D:\a\_work\1\s\pytorch-directml\aten\src\ATen\VmapModeRegistrations.cpp:33 [backend fallback]
    

    cpu is fine

    >>> torch.tensor([1], dtype=torch.float32, device='cpu')
    tensor([1.])
    
    pytorch-directml 
    opened by iperov 7
  • Is there any low power mode for DirectML

    Is there any low power mode for DirectML

    hi, now I have a quick enough model (120fps) and will run at 20fps, what i need is use as low as possible gpu power. but i find the gpu frequency jump to 1150mhz too many times. as compare to "https://voovmeeting.com/download-center.html?from=1001" tencent meeting , I found when I enable human segmentation , in a 8xxx laptop, the gpu frequency hold below 400mhz , but GPU load over 75%, that is strange for frequency policy.
    so I guess , maybe directx12 or dx11 has some low power mode ? or some other ways, for ex. add some wait in each OP (for ex. convolution op)

    opened by liyuming1978 7
  • pytorch-directml produce

    pytorch-directml produce "[W dml_heap_allocator.cc:97] DML allocator out of memory!"

    I was trying to run the simple code below:

    import torch import torch_directml dml = torch_directml.device()

    print(f"dml={dml}")

    tensor1 = torch.tensor([1]) print(tensor1) tensor1=tensor1.to(dml)

    when runing tensor1.to(dml), i got the following error: [W dml_heap_allocator.cc:97] DML allocator out of memory! Traceback (most recent call last): File "/home/fnz/workspace/direct-ml/main.py", line 9, in tensor1=tensor1.to(dml) RuntimeError: Unknown error -2147024882

    It seems that my pytorch-directml doesn't work at all.

    below is my package in conda: (direct_ml) fnz@fnz-lenovo:~/workspace/direct-ml$ conda list | grep torch torch 1.13.1 pypi_0 pypi torch-directml 0.1.13.dev221216 pypi_0 pypi

    BTW, my environment is wsl2 on top of windows 11 pro .

    The tensorflow directml seems working well.

    any idea ?

    thanks

    Feng

    opened by virtual-feng 1
  • torch-directml : torch.div with trunc rounding on int64 fails with RuntimeError

    torch-directml : torch.div with trunc rounding on int64 fails with RuntimeError

    Hi, Because 'aten::fmod.Tensor_out' is not implemented, I tried to implement it myself. I encountered a new error when using the rounding mode trunc with a int64 tensor.

    Code:

    import torch
    import torch_directml
    dml = torch_directml.device()
    
    a = torch.tensor([1,2,3]).to(dml) #
    b = 2
    a = a - torch.div(a, b, rounding_mode="trunc") * b
    
    opened by Theucalyptus 0
  • Very low validation and testing accuracy on CNN

    Very low validation and testing accuracy on CNN

    Hello everyone. I am facing an issue. I am explaining what I am trying to do. I have a Traffic and Road sign dataset that contains 43 classes. I am trying to classify the images. I am using the resnet34 pre-trained model. I have AMD RX6600 GPU that I use for running the model. For running the model on my AMD GPU I am using Pytorch Directml. Until now everything has worked fine. Training speed is fast enough, and GPU utilization is near 100%. Training loss decreases per epoch. But when I check the model using validation data after one training phase, validation loss increases and validation accuracy is too low. But training is ok. When I run the same code on my friend’s PC who has NVIDIA GPU, all is ok. Validation loss decreases and it converges. And I got an accuracy of 98% when running the same code on NVIDIA GPU. I can not figure out what the problem is. I also tune the hyperparameter but had no luck. And one strange thing is that this problem arises when I use CNN based model. I had run NLP pre-trained model BERT on my AMD GPU and there is no Issue. Validation loss decreases and it converges. Can anyone help me with this issue? I am giving the code below. Thanks in advance. Screenshot 2023-01-03 221733

    opened by AtiqurRahmanAni 0
  • Spacy seems outdated + problems running attention...

    Spacy seems outdated + problems running attention...

    Disclaimer: NOT a coder. Generally curious individual with just enough copy-paste and google skills. I may not know what I'm talking about.

    Just playing around with the repo. The install failed because of spacy version in requirements.txt for me. Using python 3.10 on Ubuntu 22.10. Changing Spacy to 3.4.4 (which I had cached, so I just did pip install spacy - to see whichever worked)

    It installed, but gave further warnings like ⚠ As of spaCy v3.0, shortcuts like 'en' are deprecated. Please use the full pipeline package name 'en_core_web_sm' instead. Collecting en-core-web-sm==3.4.1... and

    ⚠ As of spaCy v3.0, shortcuts like 'de' are deprecated. Please use the full pipeline package name 'de_core_news_sm' instead. Collecting de-core-news-sm==3.4.0

    opened by Vidyut 0
  • Operator 'aten::amax.out' is not currently supported on the DML backend.

    Operator 'aten::amax.out' is not currently supported on the DML backend.

    C:\ProgramData\Anaconda3\envs\torchdml\lib\site-packages\torch\optim\adamax.py:231: UserWarning: The operator 'aten::amax.out' is not currently supported on the DML backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at D:\a_work\1\s\pytorch-directml-plugin\torch_directml\csrc\dml\dml_cpu_fallback.cpp:16.) torch.amax(norm_buf, 0, keepdim=False, out=exp_inf)

    opened by rmskmr05 0
Releases(tensorflow-directml-1.15.3.dev200626)
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 7, 2023
High performance implementation of Extreme Learning Machines (fast randomized neural networks).

High Performance toolbox for Extreme Learning Machines. Extreme learning machines (ELM) are a particular kind of Artificial Neural Networks, which sol

Anton Akusok 174 Dec 7, 2022
PyTorch extensions for high performance and large scale training.

Description FairScale is a PyTorch extension library for high performance and large scale training on one or multiple machines/nodes. This library ext

Facebook Research 2k Dec 28, 2022
A high performance and generic framework for distributed DNN training

BytePS BytePS is a high performance and general distributed training framework. It supports TensorFlow, Keras, PyTorch, and MXNet, and can run on eith

Bytedance Inc. 3.3k Dec 28, 2022
Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices. It bridges the gap between any machine learning models you just trained and the efficient online service API.

null 164 Jan 4, 2023
High performance Python GLMs with all the features!

High performance Python GLMs with all the features!

QuantCo 200 Dec 14, 2022
SIMD-accelerated bitwise hamming distance Python module for hexidecimal strings

hexhamming What does it do? This module performs a fast bitwise hamming distance of two hexadecimal strings. This looks like: DEADBEEF = 1101111010101

Michael Recachinas 12 Oct 14, 2022
AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.

AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just a few lines of code, you can train and deploy high-accuracy machine learning and deep learning models tabular data.

Robin 55 Dec 27, 2022
AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications.

AutoTabular AutoTabular automates machine learning tasks enabling you to easily achieve strong predictive performance in your applications. With just

wenqi 2 Jun 26, 2022
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 9, 2023
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

Vowpal Wabbit 8.1k Dec 30, 2022
CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

CML with cloud compute This repository contains a sample project using CML with Terraform (via the cml-runner function) to launch an AWS EC2 instance

Iterative 19 Oct 3, 2022
cuML - RAPIDS Machine Learning Library

cuML - GPU Machine Learning Algorithms cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions t

RAPIDS 3.1k Dec 28, 2022
mlpack: a scalable C++ machine learning library --

a fast, flexible machine learning library Home | Documentation | Doxygen | Community | Help | IRC Chat Download: current stable version (3.4.2) mlpack

mlpack 4.2k Jan 1, 2023
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 29, 2022
MLBox is a powerful Automated Machine Learning python library.

MLBox is a powerful Automated Machine Learning python library. It provides the following features: Fast reading and distributed data preprocessing/cle

Axel 1.4k Jan 6, 2023
Library for machine learning stacking generalization.

stacked_generalization Implemented machine learning *stacking technic[1]* as handy library in Python. Feature weighted linear stacking is also availab

null 114 Jul 19, 2022
Uber Open Source 1.6k Dec 31, 2022