CUDA integration for Python, plus shiny features

Andreas Klöckner

Last update: Jan 2, 2023

Related tags

GPU Utilities python gpu array cuda scientific-computing gpu-computing multidimensional-arrays pycuda

Overview

PyCUDA lets you access Nvidia's CUDA parallel computation API from Python. Several wrappers of the CUDA API already exist-so what's so special about PyCUDA?

Object cleanup tied to lifetime of objects. This idiom, often called RAII in C++, makes it much easier to write correct, leak- and crash-free code. PyCUDA knows about dependencies, too, so (for example) it won't detach from a context before all memory allocated in it is also freed.
Convenience. Abstractions like pycuda.driver.SourceModule and pycuda.gpuarray.GPUArray make CUDA programming even more convenient than with Nvidia's C-based runtime.
Completeness. PyCUDA puts the full power of CUDA's driver API at your disposal, if you wish. It also includes code for interoperability with OpenGL.
Automatic Error Checking. All CUDA errors are automatically translated into Python exceptions.
Speed. PyCUDA's base layer is written in C++, so all the niceties above are virtually free.
Helpful Documentation and a Wiki.

Relatedly, like-minded computing goodness for OpenCL is provided by PyCUDA's sister project PyOpenCL.

Comments

Shipped Boost.Python is incompatible with Python 3.11

Hello,

in order to test Python 3.11 with pycuda, i have just installed latest pycuda version : 2022.1

I use CUDA 11.6.2 with a windows 11 laptop.

When i try my program (using pycuda), i get this error message :

Traceback (most recent call last):
  File "D:\Alain\Astro\Soft\PC\PC_Traitement_Video_V7_51g.py", line 19, in <module>
    import pycuda.driver as drv
  File "C:\Users\apail\AppData\Local\Programs\Python\Python311\Lib\site-packages\pycuda\driver.py", line 65, in <module>
    from pycuda._driver import *  # noqa
SystemError: type Boost.Python.enum has the Py_TPFLAGS_HAVE_GC flag but has no traverse function

Any help will be appreciate.

Alain

bug

opened by easybob95 22

DeviceMemoryPool limited to 2**32 bytes (4GB)
Working with arrays larger than 4 GB leads to a memory error when using a DeviceMemoryPool (cuMemcpyHtoD failed: invalid argument). The allocation itself (without a pool) is not a problem above 4GB. Tested under debian9/x86_64 (P6000) and ubuntu20.04/ppc64le with a V100.

A simple test:

import pycuda.autoinit import pycuda.gpuarray as cua import pycuda.tools as cut import numpy as np m = cut.DeviceMemoryPool() a= np.ones(2**30-1,dtype=np.float32) b= cua.to_gpu(a, allocator=m.allocate) # Passes a= np.ones(2**30,dtype=np.float32) b= cua.to_gpu(a, allocator=m.allocate) # Error: cuMemcpyHtoD failed: invalid argument

This is going to to become a more frequent issue as data sizes grow along with the cards memory.

The reason seems to be the use of boost' Allocator::size_type which apparently maps to uint32. However given how intricate boost headers are, I failed to find exactly where that is defined, so I am not sure how to tackle this issue.

@inducer I'd be happy to work on a PR but I'd really need some pointers where to search for the origin of size_type. Also, is the boost subset up-to-date ?
opened by vincefn 14

Allocating arrays greater than 4GB

Setup:

pycuda: 2018.1.1+cuda100,
Python 3.6.5 |Anaconda custom (64-bit)| (default, Mar 29 2018, 13:32:41) [MSC v.1900 64 bit (AMD64)]
Windows Server 2012 R2 Standard

When I try move an np.ndarray to the gpu as follows:

x = np.random.rand(900, 700, 2000)
gpuarray.to_gpu(x)

I get:

C:\ProgramData\Anaconda3\lib\site-packages\pycuda\gpuarray.py in __init__(self,
shape, dtype, allocator, base, gpudata, strides, order)
    208         if gpudata is None:
    209             if self.size:
--> 210                 self.gpudata = self.allocator(self.size * self.dtype.itemsize)
    211             else:
    212                 self.gpudata = None

OverflowError: Python int too large to convert to C unsigned long

It appears I can't allocate any arrays larger 4GB. Is there any way to fix this issue?

opened by rhacking 14

Fix operations issues on large arrays

About

This MR fixes a "pycuda hanging forever" issue when array sizes exceed 2**34 bytes. It's done by replacing some occurrences of unsigned (int) with size_t in template kernels (element-wise, reduction, scan).

Close #375

The tests had to be done on arrays of double to avoid numerical issues.

ElementWise

import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as garray
from pycuda.elementwise import ElementwiseKernel

eltwise = ElementwiseKernel("double* d_arr", "d_arr[i] = i", "linspace")
d_arr = garray.empty((512, 2048, 2048), np.float64)
eltwise(d_arr)
result = d_arr.get()[()]
reference = np.arange(d_arr.size, dtype=np.float64).reshape(d_arr.shape)
assert np.allclose(result, reference)

Reduction

import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as garray
from pycuda.reduction import ReductionKernel

reduction = ReductionKernel(np.float64, neutral="0", reduce_expr="a+b", map_expr="x[i]", arguments="double* x")
d_arr = garray.zeros((512, 2048, 2048), np.float64)
d_arr.fill(1) # elementwise
result = reduction(d_arr.ravel()).get()[()]
assert result == d_arr.size

Scan

import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as garray
from pycuda.scan import InclusiveScanKernel

cumsum = InclusiveScanKernel(np.float64, "a+b")
d_arr = garray.zeros((512, 2048, 2048), np.float64)
d_arr.fill(1)
result = cumsum(d_arr.ravel()).get()[()]
assert result[-1] == d_arr.size

opened by pierrepaleo 13

pycuda._driver.Error: cuInit failed: unknown error

Enviament: WSL, Ubuntu 18, And I make sure my cuda is connected. nvidia-smi:

I got an error when I use the pycuda. There is no error when I input such code And then I use the command sudo nvidia-modprobe -u and reboot my device, It doesn't work. What should I do?

opened by Mulbetty 13
copy() for some discontiguous arrays; __setitem__; get2() provisional…

Adds a private function _copy() that copies either a GPUArray/ndarray to another GPUArray/ndarray. The two arrays must have the same shape and dtype. They must be <= 3d. They must have the same order and must be contiguous along the minor axis, but otherwise don't have to have the same strides. Sorry that it's verbose; I can compact it later if it's decided to keep it.

This function is used in copy() and setitem(), and a dumbly-named get2() method which doesn't automatically reshape arrays with the same size but different shape. I wasn't sure what the right thing to do here was.

There isn't an asynchronous version because I'm not familiar yet with how that works.

opened by davidweichiang 13
pycuda with python mutliple host thread
I want use mutliple thread with python on host. I tried following methods:

use autoinit in main thread, and try trt inference in python ThreadPoolExecutor, but get "no activity context" error when use cuda API:cuda.mem_alloc

then I try create ThreadPoolExecutor with initializer like this:

def cuda_ctx_init(): _device = cuda.Device(0) ctx = _device.make_context()

after done this, I can successfully allocate gpu memory with cuda.mem_alloc in the thread,but get another error when do trt inference:

Parameter check failed at: ../rtSafe/safeContext.cpp::terminateCommonContext::165, condition: cudaEventDestroy(context.start) failure. Parameter check failed at: ../rtSafe/safeContext.cpp::terminateCommonContext::170, condition: cudaEventDestroy(context.stop) failure. ../rtSafe/safeRuntime.cpp (32) - Cuda Error in free: 77 (an illegal memory access was encountered) terminate called after throwing an instance of 'nvinfer1::CudaError'

I found a solution with cpp:DLA and GPU cores at the same time and a python solution : How to use TensorRT by the multi-threading package of python

I noticed that unlike cpp version, the Python version requires additional operations ：

ctx.push() ctx.pop()

In my test,this operations spend too much time each interface.And I want ask is there any other way to use pycuda with python mutlipe thread?

Thanks!
opened by trobr 10
Add test for reverse slicing
Hi,

I am currently trying to use reverse slicing with a gpuarray (e.g. [end:start:-step]), but I am getting an error. I have written quickly a new test in order to reproduce the bug (feel free to discard my merge request if you are able to work on it).

> copy.src_pitch = src_strides[1] E OverflowError: can't convert negative value to unsigned int /usr/local/lib/python3.5/dist-packages/pycuda-2017.1.1-py3.5-linux-x86_64.egg/pycuda/gpuarray.py:1300: OverflowError

I have been through the code looking for the definition of src_pitch and I suppose that it comes from cudaMemcpy defined by Nvidia, right? Therefore, it would not be possible to change the unsigned int to a signed one.
opened by loikki 10

Windows+Py3.8: A dynamic link library (DLL) initialization routine failed.

@cgohlke reported in #213, moved here:

ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed

> py -3.8 -c"import pycuda._driver"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed.

It seems that this error is not due to CUDA, boost, or pybind11 since the simple _pvt_struct extension and pyopencl-2019.1 also fail:

> py -3.8 -c"import pycuda._pvt_struct"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed.

> py -3.8 -c"import pyopencl"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "X:\Python38\lib\site-packages\pyopencl\__init__.py", line 39, in <module>
    import pyopencl._cl as _cl
ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed.

opened by inducer 9

Kernel cache not working (hexdigest broken?)

On my systems the kernel caching mechanic is not working. Have tested this on both linux and windows.

The checksum.hexdigest() seems to return a different value when passed the same kernel. I have no idea why... does anyone else see this behaviour?

opened by pwalsh0 9
Add in-place poisson random-number generation

With this version, the input array can be used to supply the per-element lamba value. This is similar to what numpy.random.poisson() allows, either supplying a shape and one lambda value, or an array of lambda values.

This is very useful when simulating detector data for imaging, where each point has a different expected value.

Let me know if you want examples, there are currently none for the random-number generators.

opened by vincefn 8
Update `_pvt_struct` based on Python 3.8

This was a heavy-handed attempt at #395 until the simpler solution there emerged. This PR serves to preserve that effort, although there is not currently a need for this. Worse, the new struct module causes test failures.

opened by inducer 0

rebase demo_cdpSimplePrint.py on the BSD licensed cdpSimplePrint.cu

examples/demo_cdpSimplePrint.py comes with a problematic license:

 * Adapted from NVIDIA's "cdpSimplePrint - Simple Print (CUDA Dynamic Parallelism)" sample
 * http://docs.nvidia.com/cuda/cuda-samples/index.html#simple-print--cuda-dynamic-parallelism-
 * http://ecee.colorado.edu/~siewerts/extra/code/example_code_archive/a490dmis_code/CUDA/cuda_work/samples/0_Simple/cdpSimplePrint/cdpSimplePrint.cu
 *
 * From cdpSimplePrint.cu (not sure if this is Ok with NVIDIA's 38-page EULA though...):
 * ---------------------------------------------------------------------------
 * Copyright 1993-2012 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 * ---------------------------------------------------------------------------

Fortunately NVIDIA recently released their code samples under the 3-clause BSD license: https://github.com/NVIDIA/cuda-samples

Please rebase that example on the updated free source code https://github.com/NVIDIA/cuda-samples/blob/master/Samples/3_CUDA_Features/cdpSimplePrint/cdpSimplePrint.cu

bug

opened by anbe42 0

```add``` fails for empty array inputs

Here's the MWE

>>> import pycuda.autoinit
>>> import pycuda.gpuarray as gpuarray
>>> import numpy as np
>>> empty_array = np.array([])
>>> empty_array_gpu = gpuarray.to_gpu(empty_array)
>>> result_array = empty_array + 0 # array([], dtype=float64)
>>> result_array_gpu = -empty_array_gpu + 0 # Fails

Here's the error trace

---------------------------------------------------------------------------
ArgumentError                             Traceback (most recent call last)
Input In [29], in <cell line: 1>()
----> 1 empty_array_gpu + 0

File ~/pycuda/pycuda/gpuarray.py:593, in GPUArray.__add__(self, other)
    590 elif np.isscalar(other):
    591     # add a scalar
    592     if other == 0:
--> 593         return self.copy()
    594     else:
    595         result = self._new_like_me(_get_common_dtype(self, other))

File ~/pycuda/pycuda/gpuarray.py:393, in GPUArray.copy(self)
    391 def copy(self):
    392     new = GPUArray(self.shape, self.dtype, self.allocator)
--> 393     _memcpy_discontig(new, self)
    394     return new

File ~/pycuda/pycuda/gpuarray.py:1570, in _memcpy_discontig(dst, src, async_, stream)
   1566         drv.memcpy_dtod_async(
   1567             dst.gpudata, src.gpudata, src.nbytes, stream=stream
   1568         )
   1569     else:
-> 1570         drv.memcpy_dtod(dst.gpudata, src.gpudata, src.nbytes)
   1571 else:
   1572     # The arrays might be contiguous in the sense of
   1573     # having no gaps, but the axes could be transposed
   1574     # so that the order is neither Fortran or C.
   1575     # So, we attempt to get a contiguous view of dst.
   1576     dst = _as_strided(dst, shape=(dst.size,), strides=(dst.dtype.itemsize,))

ArgumentError: Python argument types in
    pycuda._driver.memcpy_dtod(NoneType, NoneType, int)
did not match C++ signature:
    memcpy_dtod(unsigned long long dest, unsigned long long src, unsigned long size)

bug

opened by mitkotak 0

```neg``` failing for empty arrays

Here's the MWE

>>> import pycuda.autoinit
>>> import pycuda.gpuarray as gpuarray
>>> import numpy as np
>>> empty_array = np.array([])
>>> empty_array_gpu = gpuarray.to_gpu(empty_array)
>>> neg_empty_array = -empty_array # array([], dtype=float64)
>>> neg_empty_array_gpu = -empty_array_gpu # Fails

Here's the error trace

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
Input In [17], in <cell line: 1>()
----> 1 -gpuarray.to_gpu(a)

File ~/pycuda/pycuda/gpuarray.py:643, in GPUArray.__neg__(self)
    641 def __neg__(self):
    642     result = self._new_like_me()
--> 643     return self._axpbz(-1, 0, result)

File ~/pycuda/pycuda/gpuarray.py:468, in GPUArray._axpbz(self, selffac, other, out, stream)
    463     raise RuntimeError(
    464         "only contiguous arrays may " "be used as arguments to this operation"
    465     )
    467 func = elementwise.get_axpbz_kernel(self.dtype, out.dtype)
--> 468 func.prepared_async_call(
    469     self._grid,
    470     self._block,
    471     stream,
    472     selffac,
    473     self.gpudata,
    474     other,
    475     out.gpudata,
    476     self.mem_size,
    477 )
    479 return out

File ~/pycuda/pycuda/driver.py:626, in _add_functionality.<locals>.function_prepared_async_call(func, grid, block, stream, *args, **kwargs)
    620     raise TypeError(
    621         "unknown keyword arguments: " + ", ".join(kwargs.keys())
    622     )
    624 from pycuda._pvt_struct import pack
--> 626 arg_buf = pack(func.arg_format, *args)
    628 for texref in func.texrefs:
    629     func.param_set_texref(texref)

error: required argument is not an integer

bug

opened by mitkotak 2

``GPUArray.zeros_like | ones_like`` failing for scalar inputs

Here's the MWE

>>> import pycuda.autoinit
>>> import pycuda.gpuarray as gpuarray
>>> import numpy as np
>>> c = 42.0
>>> zero_array = np.zeros_like(c) # array(0.)
>>> zero_array_gpu = gpuarray.zeros_like(c) # Fails
>>> one_array = np.ones_like(c) # array(1.)
>>> one_array_gpu = gpuarray.ones_like(c) # Fails

Here's the error trace

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [15], in <cell line: 1>()
----> 1 gpuarray.zeros_like(c)

File ~/pycuda/pycuda/gpuarray.py:1422, in zeros_like(other_ary, dtype, order)
   1421 def zeros_like(other_ary, dtype=None, order="K"):
-> 1422     dtype, order, strides = _array_like_helper(other_ary, dtype, order)
   1423     result = GPUArray(
   1424         other_ary.shape, dtype, other_ary.allocator, order=order, strides=strides
   1425     )
   1426     zero = np.zeros((), result.dtype)

File ~/pycuda/pycuda/gpuarray.py:1394, in _array_like_helper(other_ary, dtype, order)
   1392         order = "C"
   1393 elif order == "K":
-> 1394     if other_ary.flags.c_contiguous or (other_ary.ndim <= 1):
   1395         order = "C"
   1396     elif other_ary.flags.f_contiguous:

AttributeError: 'float' object has no attribute 'flags'

bug

opened by mitkotak 0

Releases(v2022.2)

v2022.2(Nov 21, 2022)

Source code(tar.gz)
Source code(zip)
v2022.1(Jun 24, 2022)

Source code(tar.gz)
Source code(zip)

Owner

Andreas Klöckner

GitHub http://mathema.tician.de/software/pycuda

A Python module for getting the GPU status from NVIDA GPUs using nvidia-smi programmically in Python

GPUtil GPUtil is a Python module for getting the GPU status from NVIDA GPUs using nvidia-smi. GPUtil locates all GPUs on the computer, determines thei

927 Dec 8, 2022

Python interface to GPU-powered libraries

Package Description scikit-cuda provides Python interfaces to many of the functions in the CUDA device/runtime, CUBLAS, CUFFT, and CUSOLVER libraries

924 Dec 26, 2022

Python 3 Bindings for NVML library. Get NVIDIA GPU status inside your program.

py3nvml Documentation also available at readthedocs. Python 3 compatible bindings to the NVIDIA Management Library. Can be used to query the state of

212 Jan 4, 2023

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

A lightweight, GPU accelerated, SQL engine built on the RAPIDS.ai ecosystem. Get Started on app.blazingsql.com Getting Started | Documentation | Examp

1.8k Jan 2, 2023

Python 3 Bindings for the NVIDIA Management Library

====== pyNVML ====== *** Patched to support Python 3 (and Python 2) *** ------------------------------------------------ Python bindings to the NVID

95 Jan 1, 2023

A Python function for Slurm, to monitor the GPU information

Gpu-Monitor A Python function for Slurm, where I couldn't use nvidia-smi to monitor the GPU information. whole repo is not finish Installation TODO Mo

2 Feb 11, 2022

Free-duolingo-plus - Duolingo account creator that uses your invite code to get you free duolingo plus

free-duolingo-plus duolingo account creator that uses your invite code to get yo

1 Jan 6, 2022

Strawberry-django-plus - Enhanced Strawberry GraphQL integration with Django

strawberry-django-plus Enhanced Strawberry integration with Django. Built on top

138 Dec 28, 2022

Enigma-Plus - Python based Enigma machine simulator with some extra features

Enigma-Plus Python based Enigma machine simulator with some extra features Examp

1 Jan 5, 2022

It is a personal assistant chatbot, capable to perform many tasks same as Google Assistant plus more extra features...

PersonalAssistant It is an Personal Assistant, capable to perform many tasks with some unique features, that you haven'e seen yet.... Features / Tasks

95 Dec 21, 2022

Plone Interface contracts, plus basic features and utilities

plone.base This package is the base package of the CMS Plone <https://plone.org>. It contains only interface contracts and basic features and utilitie

1 Oct 3, 2022

Shiny Wechat Pay SDK for Python

WeChat third-party Python SDK master: Read the Documentation Features Common public platforms passively respond and actively call APIs WeChat Pay API

18 Sep 5, 2022

Create standalone, installable R Shiny apps using Electron

5 Dec 24, 2021

Create standalone, installable R Shiny apps using Electron

WARNING This is still very much a work in progress and nothing can be assumed stable in any way Temp notes: Two types of created installer, based on w

5 Dec 24, 2021

Just another Shiny and Greninja-ash killing preventor for Myuu

Myuu-Anti-Shiny-Discord-Bot Why I made it? Since, I was legit fed up of NebbyBot's lag (not criticising it), I decided to make my own but in python an

5 Nov 12, 2022

This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures

Introduction This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures. @inproceedings{Wa

42 Jan 7, 2023

Face Recognition plus identification simply and fast | Python

PyFaceDetection Face Recognition plus identification simply and fast Ubuntu Setup sudo pip3 install numpy sudo pip3 install cmake sudo pip3 install dl

16 Sep 22, 2022

A cross platform package to do curses-like operations, plus higher level APIs and widgets to create text UIs and ASCII art animations

ASCIIMATICS Asciimatics is a package to help people create full-screen text UIs (from interactive forms to ASCII animations) on any platform. It is li

3.2k Jan 9, 2023

A3C LSTM Atari with Pytorch plus A3G design

NEWLY ADDED A3G A NEW GPU/CPU ARCHITECTURE OF A3C FOR SUBSTANTIALLY ACCELERATED TRAINING!! RL A3C Pytorch NEWLY ADDED A3G!! New implementation of A3C

532 Jan 2, 2023

这是一个deeplabv3-plus-pytorch的源码，可以用于训练自己的模型。

DeepLabv3+：Encoder-Decoder with Atrous Separable Convolution语义分割模型在Pytorch当中的实现目录性能情况 Performance 所需环境 Environment 注意事项 Attention 文件下载 Download 训练步骤

350 Dec 28, 2022

CUDA integration for Python, plus shiny features

Related tags

Overview

Comments

About

ElementWise

Reduction

Scan

ImportError: DLL load failed: A dynamic link library (DLL) initialization routine failed

Releases(v2022.2)

v2022.2(Nov 21, 2022)

v2022.1(Jun 24, 2022)

Owner

Andreas Klöckner

A Python module for getting the GPU status from NVIDA GPUs using nvidia-smi programmically in Python

Python interface to GPU-powered libraries

Python 3 Bindings for NVML library. Get NVIDIA GPU status inside your program.

BlazingSQL is a lightweight, GPU accelerated, SQL engine for Python. Built on RAPIDS cuDF.

Python 3 Bindings for the NVIDIA Management Library

A Python function for Slurm, to monitor the GPU information

Free-duolingo-plus - Duolingo account creator that uses your invite code to get you free duolingo plus

Strawberry-django-plus - Enhanced Strawberry GraphQL integration with Django

Enigma-Plus - Python based Enigma machine simulator with some extra features

It is a personal assistant chatbot, capable to perform many tasks same as Google Assistant plus more extra features...

Plone Interface contracts, plus basic features and utilities

Shiny Wechat Pay SDK for Python

Create standalone, installable R Shiny apps using Electron

Create standalone, installable R Shiny apps using Electron

Just another Shiny and Greninja-ash killing preventor for Myuu

This Repo is the official CUDA implementation of ICCV 2019 Oral paper for CARAFE: Content-Aware ReAssembly of FEatures

Face Recognition plus identification simply and fast | Python

A cross platform package to do curses-like operations, plus higher level APIs and widgets to create text UIs and ASCII art animations

A3C LSTM Atari with Pytorch plus A3G design

这是一个deeplabv3-plus-pytorch的源码，可以用于训练自己的模型。