CUDA integration for Python, plus shiny features

Overview

PyCUDA lets you access Nvidia's CUDA parallel computation API from Python. Several wrappers of the CUDA API already exist-so what's so special about PyCUDA?

https://badge.fury.io/py/pycuda.png
  • Object cleanup tied to lifetime of objects. This idiom, often called RAII in C++, makes it much easier to write correct, leak- and crash-free code. PyCUDA knows about dependencies, too, so (for example) it won't detach from a context before all memory allocated in it is also freed.
  • Convenience. Abstractions like pycuda.driver.SourceModule and pycuda.gpuarray.GPUArray make CUDA programming even more convenient than with Nvidia's C-based runtime.
  • Completeness. PyCUDA puts the full power of CUDA's driver API at your disposal, if you wish. It also includes code for interoperability with OpenGL.
  • Automatic Error Checking. All CUDA errors are automatically translated into Python exceptions.
  • Speed. PyCUDA's base layer is written in C++, so all the niceties above are virtually free.
  • Helpful Documentation and a Wiki.

Relatedly, like-minded computing goodness for OpenCL is provided by PyCUDA's sister project PyOpenCL.

Comments
  • Shipped Boost.Python is incompatible with Python 3.11

    Shipped Boost.Python is incompatible with Python 3.11

    Hello,

    in order to test Python 3.11 with pycuda, i have just installed latest pycuda version : 2022.1

    I use CUDA 11.6.2 with a windows 11 laptop.

    When i try my program (using pycuda), i get this error message :

    Traceback (most recent call last):
      File "D:\Alain\Astro\Soft\PC\PC_Traitement_Video_V7_51g.py", line 19, in <module>
        import pycuda.driver as drv
      File "C:\Users\apail\AppData\Local\Programs\Python\Python311\Lib\site-packages\pycuda\driver.py", line 65, in <module>
        from pycuda._driver import *  # noqa
    SystemError: type Boost.Python.enum has the Py_TPFLAGS_HAVE_GC flag but has no traverse function
    

    Any help will be appreciate.

    Alain

    bug 
    opened by easybob95 22
  • DeviceMemoryPool limited to 2**32 bytes (4GB)

    DeviceMemoryPool limited to 2**32 bytes (4GB)

    Working with arrays larger than 4 GB leads to a memory error when using a DeviceMemoryPool (cuMemcpyHtoD failed: invalid argument). The allocation itself (without a pool) is not a problem above 4GB. Tested under debian9/x86_64 (P6000) and ubuntu20.04/ppc64le with a V100.

    A simple test:

    import pycuda.autoinit
    import pycuda.gpuarray as cua
    import pycuda.tools as cut
    import numpy as np
    
    m = cut.DeviceMemoryPool()
    
    a= np.ones(2**30-1,dtype=np.float32)
    b= cua.to_gpu(a, allocator=m.allocate)  # Passes
    
    a= np.ones(2**30,dtype=np.float32)
    b= cua.to_gpu(a, allocator=m.allocate)  # Error: cuMemcpyHtoD failed: invalid argument
    

    This is going to to become a more frequent issue as data sizes grow along with the cards memory.

    The reason seems to be the use of boost' Allocator::size_type which apparently maps to uint32. However given how intricate boost headers are, I failed to find exactly where that is defined, so I am not sure how to tackle this issue.

    @inducer I'd be happy to work on a PR but I'd really need some pointers where to search for the origin of size_type. Also, is the boost subset up-to-date ?

    opened by vincefn 14
  • Allocating arrays greater than 4GB

    Allocating arrays greater than 4GB

    Setup:

    • pycuda: 2018.1.1+cuda100,
    • Python 3.6.5 |Anaconda custom (64-bit)| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
    • Windows Server 2012 R2 Standard

    When I try move an np.ndarray to the gpu as follows:

    x = np.random.rand(900, 700, 2000)
    gpuarray.to_gpu(x)
    

    I get:

    C:\ProgramData\Anaconda3\lib\site-packages\pycuda\gpuarray.py in __init__(self,
    shape, dtype, allocator, base, gpudata, strides, order)
        208         if gpudata is None:
        209             if self.size:
    --> 210                 self.gpudata = self.allocator(self.size * self.dtype.itemsize)
        211             else:
        212                 self.gpudata = None
    
    OverflowError: Python int too large to convert to C unsigned long
    

    It appears I can't allocate any arrays larger 4GB. Is there any way to fix this issue?

    opened by rhacking 14
  • Fix operations issues on large arrays

    Fix operations issues on large arrays

    About

    This MR fixes a "pycuda hanging forever" issue when array sizes exceed 2**34 bytes. It's done by replacing some occurrences of unsigned (int) with size_t in template kernels (element-wise, reduction, scan).

    Close #375

    The tests had to be done on arrays of double to avoid numerical issues.

    ElementWise

    import numpy as np
    import pycuda.autoinit
    import pycuda.gpuarray as garray
    from pycuda.elementwise import ElementwiseKernel
    
    eltwise = ElementwiseKernel("double* d_arr", "d_arr[i] = i", "linspace")
    d_arr = garray.empty((512, 2048, 2048), np.float64)
    eltwise(d_arr)
    result = d_arr.get()[()]
    reference = np.arange(d_arr.size, dtype=np.float64).reshape(d_arr.shape)
    assert np.allclose(result, reference)
    

    Reduction

    import numpy as np
    import pycuda.autoinit
    import pycuda.gpuarray as garray
    from pycuda.reduction import ReductionKernel
    
    reduction = ReductionKernel(np.float64, neutral="0", reduce_expr="a+b", map_expr="x[i]", arguments="double* x")
    d_arr = garray.zeros((512, 2048, 2048), np.float64)
    d_arr.fill(1) # elementwise
    result = reduction(d_arr.ravel()).get()[()]
    assert result == d_arr.size
    

    Scan

    import numpy as np
    import pycuda.autoinit
    import pycuda.gpuarray as garray
    from pycuda.scan import InclusiveScanKernel
    
    cumsum = InclusiveScanKernel(np.float64, "a+b")
    d_arr = garray.zeros((512, 2048, 2048), np.float64)
    d_arr.fill(1)
    result = cumsum(d_arr.ravel()).get()[()]
    assert result[-1] == d_arr.size
    
    opened by pierrepaleo 13
  • pycuda._driver.Error: cuInit failed: unknown error

    pycuda._driver.Error: cuInit failed: unknown error

    Enviament: WSL, Ubuntu 18, And I make sure my cuda is connected. image nvidia-smi: image

    I got an error when I use the pycuda. There is no error when I input such code image And then I use the command sudo nvidia-modprobe -u and reboot my device, It doesn't work. What should I do?

    opened by Mulbetty 13
  • copy() for some discontiguous arrays; __setitem__; get2() provisional…

    copy() for some discontiguous arrays; __setitem__; get2() provisional…

    Adds a private function _copy() that copies either a GPUArray/ndarray to another GPUArray/ndarray. The two arrays must have the same shape and dtype. They must be <= 3d. They must have the same order and must be contiguous along the minor axis, but otherwise don't have to have the same strides. Sorry that it's verbose; I can compact it later if it's decided to keep it.

    This function is used in copy() and setitem(), and a dumbly-named get2() method which doesn't automatically reshape arrays with the same size but different shape. I wasn't sure what the right thing to do here was.

    There isn't an asynchronous version because I'm not familiar yet with how that works.

    opened by davidweichiang 13
  • pycuda with python mutliple host thread

    pycuda with python mutliple host thread

    I want use mutliple thread with python on host. I tried following methods:

    1. use autoinit in main thread, and try trt inference in python ThreadPoolExecutor, but get "no activity context" error when use cuda API:cuda.mem_alloc
    2. then I try create ThreadPoolExecutor with initializer like this:
    def cuda_ctx_init():
        _device = cuda.Device(0)
        ctx = _device.make_context()
    

    after done this, I can successfully allocate gpu memory with cuda.mem_alloc in the thread,but get another error when do trt inference:

    Parameter check failed at: ../rtSafe/safeContext.cpp::terminateCommonContext::165, condition: cudaEventDestroy(context.start) failure.
    Parameter check failed at: ../rtSafe/safeContext.cpp::terminateCommonContext::170, condition: cudaEventDestroy(context.stop) failure.
     ../rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 77 (an illegal memory access was encountered)
    terminate called after throwing an instance of 'nvinfer1::CudaError'
    

    I found a solution with cpp:DLA and GPU cores at the same time and a python solution : How to use TensorRT by the multi-threading package of python

    I noticed that unlike cpp version, the Python version requires additional operations :

    
    ctx.push()
    
    ctx.pop()
    

    In my test,this operations spend too much time each interface.And I want ask is there any other way to use pycuda with python mutlipe thread?

    Thanks!

    opened by trobr 10
  • Add test for reverse slicing

    Add test for reverse slicing

    Hi,

    I am currently trying to use reverse slicing with a gpuarray (e.g. [end:start:-step]), but I am getting an error. I have written quickly a new test in order to reproduce the bug (feel free to discard my merge request if you are able to work on it).

    >       copy.src_pitch = src_strides[1]
    E       OverflowError: can't convert negative value to unsigned int
    
    /usr/local/lib/python3.5/dist-packages/pycuda-2017.1.1-py3.5-linux-x86_64.egg/pycuda/gpuarray.py:1300: OverflowError
    

    I have been through the code looking for the definition of src_pitch and I suppose that it comes from cudaMemcpy defined by Nvidia, right? Therefore, it would not be possible to change the unsigned int to a signed one.

    opened by loikki 10
  • Windows+Py3.8: A dynamic link library (DLL) initialization routine failed.

    Windows+Py3.8: A dynamic link library (DLL) initialization routine failed.

    @cgohlke reported in #213, moved here:

    ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed

    > py -3.8 -c"import pycuda._driver"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed.
    

    It seems that this error is not due to CUDA, boost, or pybind11 since the simple _pvt_struct extension and pyopencl-2019.1 also fail:

    > py -3.8 -c"import pycuda._pvt_struct"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed.
    
    > py -3.8 -c"import pyopencl"
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "X:\Python38\lib\site-packages\pyopencl\__init__.py", line 39, in <module>
        import pyopencl._cl as _cl
    ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed.
    
    opened by inducer 9
  • Kernel cache not working  (hexdigest broken?)

    Kernel cache not working (hexdigest broken?)

    On my systems the kernel caching mechanic is not working. Have tested this on both linux and windows.

    The checksum.hexdigest() seems to return a different value when passed the same kernel. I have no idea why... does anyone else see this behaviour?

    opened by pwalsh0 9
  • Add in-place poisson random-number generation

    Add in-place poisson random-number generation

    With this version, the input array can be used to supply the per-element lamba value. This is similar to what numpy.random.poisson() allows, either supplying a shape and one lambda value, or an array of lambda values.

    This is very useful when simulating detector data for imaging, where each point has a different expected value.

    Let me know if you want examples, there are currently none for the random-number generators.

    opened by vincefn 8
  • Update `_pvt_struct` based on Python 3.8

    Update `_pvt_struct` based on Python 3.8

    This was a heavy-handed attempt at #395 until the simpler solution there emerged. This PR serves to preserve that effort, although there is not currently a need for this. Worse, the new struct module causes test failures.

    opened by inducer 0
  • rebase demo_cdpSimplePrint.py on the BSD licensed cdpSimplePrint.cu

    rebase demo_cdpSimplePrint.py on the BSD licensed cdpSimplePrint.cu

    examples/demo_cdpSimplePrint.py comes with a problematic license:

     * Adapted from NVIDIA's "cdpSimplePrint - Simple Print (CUDA Dynamic Parallelism)" sample
     * http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-print--cuda-dynamic-parallelism-
     * http://ecee.colorado.edu/~siewerts/extra/code/example_code_archive/a490dmis_code/CUDA/cuda_work/samples/0_Simple/cdpSimplePrint/cdpSimplePrint.cu
     *
     * From cdpSimplePrint.cu (not sure if this is Ok with NVIDIA's 38-page EULA though...):
     * ---------------------------------------------------------------------------
     * Copyright 1993-2012 NVIDIA Corporation.  All rights reserved.
     *
     * Please refer to the NVIDIA end user license agreement (EULA) associated
     * with this source code for terms and conditions that govern your use of
     * this software. Any use, reproduction, disclosure, or distribution of
     * this software and related documentation outside the terms of the EULA
     * is strictly prohibited.
     * ---------------------------------------------------------------------------
    

    Fortunately NVIDIA recently released their code samples under the 3-clause BSD license: https://github.com/NVIDIA/cuda-samples

    Please rebase that example on the updated free source code https://github.com/NVIDIA/cuda-samples/blob/master/Samples/3_CUDA_Features/cdpSimplePrint/cdpSimplePrint.cu

    bug 
    opened by anbe42 0
  • ```__add__``` fails for empty array inputs

    ```__add__``` fails for empty array inputs

    Here's the MWE

    >>> import pycuda.autoinit
    >>> import pycuda.gpuarray as gpuarray
    >>> import numpy as np
    >>> empty_array = np.array([])
    >>> empty_array_gpu = gpuarray.to_gpu(empty_array)
    >>> result_array = empty_array + 0 # array([], dtype=float64)
    >>> result_array_gpu = -empty_array_gpu + 0 # Fails
    

    Here's the error trace

    ---------------------------------------------------------------------------
    ArgumentError                             Traceback (most recent call last)
    Input In [29], in <cell line: 1>()
    ----> 1 empty_array_gpu + 0
    
    File ~/pycuda/pycuda/gpuarray.py:593, in GPUArray.__add__(self, other)
        590 elif np.isscalar(other):
        591     # add a scalar
        592     if other == 0:
    --> 593         return self.copy()
        594     else:
        595         result = self._new_like_me(_get_common_dtype(self, other))
    
    File ~/pycuda/pycuda/gpuarray.py:393, in GPUArray.copy(self)
        391 def copy(self):
        392     new = GPUArray(self.shape, self.dtype, self.allocator)
    --> 393     _memcpy_discontig(new, self)
        394     return new
    
    File ~/pycuda/pycuda/gpuarray.py:1570, in _memcpy_discontig(dst, src, async_, stream)
       1566         drv.memcpy_dtod_async(
       1567             dst.gpudata, src.gpudata, src.nbytes, stream=stream
       1568         )
       1569     else:
    -> 1570         drv.memcpy_dtod(dst.gpudata, src.gpudata, src.nbytes)
       1571 else:
       1572     # The arrays might be contiguous in the sense of
       1573     # having no gaps, but the axes could be transposed
       1574     # so that the order is neither Fortran or C.
       1575     # So, we attempt to get a contiguous view of dst.
       1576     dst = _as_strided(dst, shape=(dst.size,), strides=(dst.dtype.itemsize,))
    
    ArgumentError: Python argument types in
        pycuda._driver.memcpy_dtod(NoneType, NoneType, int)
    did not match C++ signature:
        memcpy_dtod(unsigned long long dest, unsigned long long src, unsigned long size)
    
    bug 
    opened by mitkotak 0
  • ```__neg__``` failing for empty arrays

    ```__neg__``` failing for empty arrays

    Here's the MWE

    >>> import pycuda.autoinit
    >>> import pycuda.gpuarray as gpuarray
    >>> import numpy as np
    >>> empty_array = np.array([])
    >>> empty_array_gpu = gpuarray.to_gpu(empty_array)
    >>> neg_empty_array = -empty_array # array([], dtype=float64)
    >>> neg_empty_array_gpu = -empty_array_gpu # Fails
    

    Here's the error trace

    ---------------------------------------------------------------------------
    error                                     Traceback (most recent call last)
    Input In [17], in <cell line: 1>()
    ----> 1 -gpuarray.to_gpu(a)
    
    File ~/pycuda/pycuda/gpuarray.py:643, in GPUArray.__neg__(self)
        641 def __neg__(self):
        642     result = self._new_like_me()
    --> 643     return self._axpbz(-1, 0, result)
    
    File ~/pycuda/pycuda/gpuarray.py:468, in GPUArray._axpbz(self, selffac, other, out, stream)
        463     raise RuntimeError(
        464         "only contiguous arrays may " "be used as arguments to this operation"
        465     )
        467 func = elementwise.get_axpbz_kernel(self.dtype, out.dtype)
    --> 468 func.prepared_async_call(
        469     self._grid,
        470     self._block,
        471     stream,
        472     selffac,
        473     self.gpudata,
        474     other,
        475     out.gpudata,
        476     self.mem_size,
        477 )
        479 return out
    
    File ~/pycuda/pycuda/driver.py:626, in _add_functionality.<locals>.function_prepared_async_call(func, grid, block, stream, *args, **kwargs)
        620     raise TypeError(
        621         "unknown keyword arguments: " + ", ".join(kwargs.keys())
        622     )
        624 from pycuda._pvt_struct import pack
    --> 626 arg_buf = pack(func.arg_format, *args)
        628 for texref in func.texrefs:
        629     func.param_set_texref(texref)
    
    error: required argument is not an integer
    
    bug 
    opened by mitkotak 2
  • ``GPUArray.zeros_like | ones_like`` failing for scalar inputs

    ``GPUArray.zeros_like | ones_like`` failing for scalar inputs

    Here's the MWE

    >>> import pycuda.autoinit
    >>> import pycuda.gpuarray as gpuarray
    >>> import numpy as np
    >>> c = 42.0
    >>> zero_array = np.zeros_like(c) # array(0.)
    >>> zero_array_gpu = gpuarray.zeros_like(c) # Fails
    >>> one_array = np.ones_like(c) # array(1.)
    >>> one_array_gpu = gpuarray.ones_like(c) # Fails
    

    Here's the error trace

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    Input In [15], in <cell line: 1>()
    ----> 1 gpuarray.zeros_like(c)
    
    File ~/pycuda/pycuda/gpuarray.py:1422, in zeros_like(other_ary, dtype, order)
       1421 def zeros_like(other_ary, dtype=None, order="K"):
    -> 1422     dtype, order, strides = _array_like_helper(other_ary, dtype, order)
       1423     result = GPUArray(
       1424         other_ary.shape, dtype, other_ary.allocator, order=order, strides=strides
       1425     )
       1426     zero = np.zeros((), result.dtype)
    
    File ~/pycuda/pycuda/gpuarray.py:1394, in _array_like_helper(other_ary, dtype, order)
       1392         order = "C"
       1393 elif order == "K":
    -> 1394     if other_ary.flags.c_contiguous or (other_ary.ndim <= 1):
       1395         order = "C"
       1396     elif other_ary.flags.f_contiguous:
    
    AttributeError: 'float' object has no attribute 'flags'
    
    bug 
    opened by mitkotak 0
Releases(v2022.2)
A Python module for getting the GPU status from NVIDA GPUs using nvidia-smi programmically in Python

GPUtil GPUtil is a Python module for getting the GPU status from NVIDA GPUs using nvidia-smi. GPUtil locates all GPUs on the computer, determines thei

Anders Krogh Mortensen 927 Dec 8, 2022
Python interface to GPU-powered libraries

Package Description scikit-cuda provides Python interfaces to many of the functions in the CUDA device/runtime, CUBLAS, CUFFT, and CUSOLVER libraries

Lev E. Givon 924 Dec 26, 2022
Python 3 Bindings for NVML library. Get NVIDIA GPU status inside your program.

py3nvml Documentation also available at readthedocs. Python 3 compatible bindings to the NVIDIA Management Library. Can be used to query the state of

Fergal Cotter 212 Jan 4, 2023
BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

A lightweight, GPU accelerated, SQL engine built on the RAPIDS.ai ecosystem. Get Started on app.blazingsql.com Getting Started | Documentation | Examp

BlazingSQL 1.8k Jan 2, 2023
Python 3 Bindings for the NVIDIA Management Library

====== pyNVML ====== *** Patched to support Python 3 (and Python 2) *** ------------------------------------------------ Python bindings to the NVID

Nicolas Hennion 95 Jan 1, 2023
A Python function for Slurm, to monitor the GPU information

Gpu-Monitor A Python function for Slurm, where I couldn't use nvidia-smi to monitor the GPU information. whole repo is not finish Installation TODO Mo

Squidward Tentacles 2 Feb 11, 2022
Free-duolingo-plus - Duolingo account creator that uses your invite code to get you free duolingo plus

free-duolingo-plus duolingo account creator that uses your invite code to get yo

null 1 Jan 6, 2022
Strawberry-django-plus - Enhanced Strawberry GraphQL integration with Django

strawberry-django-plus Enhanced Strawberry integration with Django. Built on top

BLB Ventures 138 Dec 28, 2022
Enigma-Plus - Python based Enigma machine simulator with some extra features

Enigma-Plus Python based Enigma machine simulator with some extra features Examp

null 1 Jan 5, 2022
It is a personal assistant chatbot, capable to perform many tasks same as Google Assistant plus more extra features...

PersonalAssistant It is an Personal Assistant, capable to perform many tasks with some unique features, that you haven'e seen yet.... Features / Tasks

Roshan Kumar 95 Dec 21, 2022
Plone Interface contracts, plus basic features and utilities

plone.base This package is the base package of the CMS Plone <https://plone.org>. It contains only interface contracts and basic features and utilitie

Plone Foundation 1 Oct 3, 2022
Shiny Wechat Pay SDK for Python

WeChat third-party Python SDK master: Read the Documentation Features Common public platforms passively respond and actively call APIs WeChat Pay API

Obrisk 18 Sep 5, 2022
Create standalone, installable R Shiny apps using Electron

Create standalone, installable R Shiny apps using Electron

Chase Clark 5 Dec 24, 2021
Create standalone, installable R Shiny apps using Electron

WARNING This is still very much a work in progress and nothing can be assumed stable in any way Temp notes: Two types of created installer, based on w

Chase Clark 5 Dec 24, 2021
Just another Shiny and Greninja-ash killing preventor for Myuu

Myuu-Anti-Shiny-Discord-Bot Why I made it? Since, I was legit fed up of NebbyBot's lag (not criticising it), I decided to make my own but in python an

null 5 Nov 12, 2022
This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures

Introduction This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures. @inproceedings{Wa

Jiaqi Wang 42 Jan 7, 2023
Face Recognition plus identification simply and fast | Python

PyFaceDetection Face Recognition plus identification simply and fast Ubuntu Setup sudo pip3 install numpy sudo pip3 install cmake sudo pip3 install dl

Peyman Majidi Moein 16 Sep 22, 2022
A cross platform package to do curses-like operations, plus higher level APIs and widgets to create text UIs and ASCII art animations

ASCIIMATICS Asciimatics is a package to help people create full-screen text UIs (from interactive forms to ASCII animations) on any platform. It is li

null 3.2k Jan 9, 2023
A3C LSTM Atari with Pytorch plus A3G design

NEWLY ADDED A3G A NEW GPU/CPU ARCHITECTURE OF A3C FOR SUBSTANTIALLY ACCELERATED TRAINING!! RL A3C Pytorch NEWLY ADDED A3G!! New implementation of A3C

David Griffis 532 Jan 2, 2023
这是一个deeplabv3-plus-pytorch的源码,可以用于训练自己的模型。

DeepLabv3+:Encoder-Decoder with Atrous Separable Convolution语义分割模型在Pytorch当中的实现 目录 性能情况 Performance 所需环境 Environment 注意事项 Attention 文件下载 Download 训练步骤

Bubbliiiing 350 Dec 28, 2022