## About

This MR fixes a "pycuda hanging forever" issue when array sizes exceed `2**34`

bytes.
It's done by replacing some occurrences of `unsigned (int)`

with `size_t`

in template kernels (element-wise, reduction, scan).

Close #375

The tests had to be done on arrays of `double`

to avoid numerical issues.

## ElementWise

```
import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as garray
from pycuda.elementwise import ElementwiseKernel
eltwise = ElementwiseKernel("double* d_arr", "d_arr[i] = i", "linspace")
d_arr = garray.empty((512, 2048, 2048), np.float64)
eltwise(d_arr)
result = d_arr.get()[()]
reference = np.arange(d_arr.size, dtype=np.float64).reshape(d_arr.shape)
assert np.allclose(result, reference)
```

## Reduction

```
import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as garray
from pycuda.reduction import ReductionKernel
reduction = ReductionKernel(np.float64, neutral="0", reduce_expr="a+b", map_expr="x[i]", arguments="double* x")
d_arr = garray.zeros((512, 2048, 2048), np.float64)
d_arr.fill(1) # elementwise
result = reduction(d_arr.ravel()).get()[()]
assert result == d_arr.size
```

## Scan

```
import numpy as np
import pycuda.autoinit
import pycuda.gpuarray as garray
from pycuda.scan import InclusiveScanKernel
cumsum = InclusiveScanKernel(np.float64, "a+b")
d_arr = garray.zeros((512, 2048, 2048), np.float64)
d_arr.fill(1)
result = cumsum(d_arr.ravel()).get()[()]
assert result[-1] == d_arr.size
```