An Aspiring Drop-In Replacement for NumPy at Scale

Related tags

legate.numpy
Overview

Legate NumPy

Legate NumPy is a Legate library that aims to provide a distributed and accelerated drop-in replacement for the NumPy API on top of the Legion runtime. Using Legate NumPy you do things like run the final example of the Python CFD course completely unmodified on 2048 A100 GPUs in a DGX SuperPOD and achieve good weak scaling.

drawing

Legate NumPy works best for programs that have very large arrays of data that cannot fit in the memory of a single GPU or a single node and need to span multiple nodes and GPUs. While our implementation of the current NumPy API is still incomplete, programs that use unimplemented features will still work (assuming enough memory) by falling back to the canonical NumPy implementation.

  1. Dependencies
  2. Usage and Execution
  3. Supported and Planned Features
  4. Supported Types and Dimensions
  5. Documentation
  6. Future Directions
  7. Known Bugs

Dependencies

Users must have a working installation of the Legate Core library prior to installing Legate NumPy.

Legate NumPy requires Python >= 3.6. We provide a conda environment file that installs all needed dependencies in one step. Use the following command to create a conda environment with it:

conda env create -n legate -f conda/legate_numpy_dev.yml

Installation

Installation of Legate NumPy is done with either setup.py for simple uses cases or install.py for more advanced use cases. The most common installation command is:

python setup.py --with-core 

This will build Legate NumPy against the Legate Core installation and then install Legate NumPy into the same location. Users can also install Legate NumPy into an alternative location with the canonical --prefix flag as well.

python setup.py --prefix  --with-core 

Note that after the first invocation of setup.py this repository will remember which Legate Core installation to use and the --with-core option can be omitted unless the user wants to change it.

Advanced users can also invoke install.py --help to see options for configuring Legate NumPy by invoking the install.py script directly.

Of particular interest to Legate NumPy users will likely be the option for specifying an installation of OpenBLAS to use. If you already have an installation of OpenBLAS on your machine you can inform the install.py script about its location using the --with-openblas flag:

python setup.py --with-openblas /path/to/open/blas/

Usage and Execution

Using Legate NumPy as a replacement for NumPy is easy. Users only need to replace:

import numpy as np

with:

import legate.numpy as np

These programs can then be run by the Legate driver script described in the Legate Core documentation.

legate legate_numpy_program.py

For execution with multiple nodes (assuming Legate Core is installed with GASNet support) users can supply the --nodes flag. For execution with GPUs, users can use the --gpus flags to specify the number of GPUs to use per node. We encourage all users to familiarize themselves with these resource flags as described in the Legate Core documentation or simply by passing --help to the legate driver script.

Supported and Planned Features

Legate NumPy is currently a work in progress and we are gradually adding support for additional NumPy operators. Unsupported NumPy operations will provide a warning that we are falling back to canonical NumPy. Please report unimplemented features that are necessary for attaining good performance so that we can triage them and prioritize implementation appropriately. The more users that report an unimplemented feature, the more we will prioritize it. Please include a pointer to your code if possible too so we can see how you are using the feature in context.

Supported Types and Dimensions

Legate NumPy currently supports the following NumPy types: float16, float32, float64, int16, int32, int64, uint16, uint32, uint64, bool, complex64, and complex128. Legate currently also only works on up to 3D arrays at the moment. We're currently working on support for N-D arrays. If you have a need for arrays with more than three dimensions please let us know about it.

Documentation

A complete list of available features can is provided in the API reference.

Future Directions

There are three primary directions that we plan to investigate with Legate NumPy going forward:

  • More features: we plan to identify a few key lighthouse applications and use the demands of these applications to drive the addition of new features to Legate NumPy.
  • We plan to add support for sharded file I/O for loading and storing large data sets that could never be loaded on a single node. Initially this will begin with native support for h5py but will grow to accommodate other formats needed by our lighthouse applications.
  • Strong scaling: while Legate NumPy is currently implemented in a way that enables weak scaling of codes on larger data sets, we would also like to make it possible to strong-scale Legate applications for a single problem size. This will require leveraging some of the more advanced features of Legion from inside the Python interpreter.

We are open to comments, suggestions, and ideas.

Known Bugs

  • Legate NumPy can exercise a bug in OpenBLAS when it is run with multiple OpenMP processors
  • On Mac OSX, Legate NumPy can trigger a bug in Apple's implementation of libc++. The bug has since been fixed but likely will not show up on most Apple machines for quite some time. You may have to manually patch your implementation of libc++. If you have trouble doing this please contact us and we will be able to help you.
Issues
  • legate numpy very slow compared to Python+Numpy

    legate numpy very slow compared to Python+Numpy

    I've been testing a simple Laplace Eq. solver to compare Python+Numpy to legate.numpy and legate is hugely slower than Numpy.

    The code is taken from: https://barbagroup.github.io/essential_skills_RRC/laplace/1/ . The code I actually run is the following:

    import numpy as np
    import time
    
    
    def L2_error(p, pn):
        return np.sqrt(np.sum((p - pn)**2)/np.sum(pn**2))
    # end if
    
    
    def laplace2d(p, l2_target):
        '''Iteratively solves the Laplace equation using the Jacobi method
    
        Parameters:
        ----------
        p: 2D array of float
            Initial potential distribution
        l2_target: float
            target for the difference between consecutive solutions
    
        Returns:
        -------
        p: 2D array of float
            Potential distribution after relaxation
        '''
    
        l2norm = 1.0
        icount = 0
        tot_time = 0.0
        pn = np.empty_like(p)
        while l2norm > l2_target:
    
            start = time.perf_counter()
    
            icount = icount + 1
            pn = p.copy()
            p[1:-1,1:-1] = .25 * (pn[1:-1,2:] + pn[1:-1, :-2] \
                                  + pn[2:, 1:-1] + pn[:-2, 1:-1])
    
            ##Neumann B.C. along x = L
            p[1:-1, -1] = p[1:-1, -2]     # 1st order approx of a derivative 
            l2norm = L2_error(p, pn)
            end = time.perf_counter()
    
            tot_time = tot_time + (end-start)
    
        # end while
    
        print("l2norm = ",l2norm)
        print("icount = ",icount)
        print("Total Iteration Time = ",tot_time)
        print("   Time per iteration = ",tot_time/icount)
    
        return p
    # end if
    
    
    
    if __name__ == "__main__":
    
        nx = 401
        ny = 401
    
        # Initial conditions
        p = np.zeros((ny,nx)) ##create a XxY vector of 0's
    
        # Dirichlet boundary conditions
        x = np.linspace(0,1,nx)
        p[-1,:] = np.sin(1.5*np.pi*x/x[-1])
        del x
    
    
        start = time.time()
        p = laplace2d(p.copy(), 1e-8)
        stop = time.time()
    
        print("Elapsed time = ",(stop-start)," secs")
        print(" ")
    
    
    # end if
    

    When I run it on my laptop with Anaconda Python3 and Numpy I get the following:

    $ python3 jacobi.py 
    l2norm =  9.99986062249016e-09
    icount =  153539
    Total Iteration Time =  127.02529454990054
       Time per iteration =  0.0008273161512703648
    Elapsed time =  127.14257955551147  secs
    

    When I change the import line to legate.numpy, I usually stop the code after 15 minutes of wall time. I have let it run for up to 60 minutes and it never converges.

    As a check, I've run the Numpy code with legate itself and it exactly matches the Numpy results.

    I have been experimenting with replacing the l2norm computations with numpy specific functions (np.subtract, np.square, etc.) but I have achieved no increase in performance.

    Does anyone have any recommendations?

    Thanks!

    Jeff

    (edit by Manolis: added some formatting for the code sections)

    enhancement 
    opened by laytonjbgmail 15
  • use OpenBLAS develop branch

    use OpenBLAS develop branch

    This is clearly an issue in OpenBLAS but it blocks my Legate Numpy install and is unexpected, based on my experience with OpenBLAS in other contexts.

    [email protected]:~/LEGATE/np$ python3 ./install.py --install-dir $HOME/LEGATE --with-core $HOME/LEGATE 2>&1 | tee log
    Verbose build is  off
    Legate is installing OpenBLAS into a local directory...
    Cloning into '/tmp/tmpm780ryjm'...
    Note: switching to 'd2b11c47774b9216660e76e2fc67e87079f26fa1'.
    
    You are in 'detached HEAD' state. You can look around, make experimental
    changes and commit them, and you can discard any commits you make in this
    state without impacting any branches by switching back to a branch.
    
    If you want to create a new branch to retain commits you create, you may
    do so (now or later) by using -c with the switch command. Example:
    
      git switch -c <new-branch-name>
    
    Or undo this operation with:
    
      git switch -
    
    Turn off this advice by setting config variable advice.detachedHead to false
    
    Switched to a new branch 'master'
    getarch_2nd.c: In function ‘main’:
    getarch_2nd.c:14:35: error: ‘SGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_UNROLL_M’?
       14 |     printf("SGEMM_UNROLL_M=%d\n", SGEMM_DEFAULT_UNROLL_M);
          |                                   ^~~~~~~~~~~~~~~~~~~~~~
          |                                   SBGEMM_DEFAULT_UNROLL_M
    getarch_2nd.c:14:35: note: each undeclared identifier is reported only once for each function it appears in
    getarch_2nd.c:15:35: error: ‘SGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_UNROLL_N’?
       15 |     printf("SGEMM_UNROLL_N=%d\n", SGEMM_DEFAULT_UNROLL_N);
          |                                   ^~~~~~~~~~~~~~~~~~~~~~
          |                                   SBGEMM_DEFAULT_UNROLL_N
    getarch_2nd.c:16:35: error: ‘DGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_M’?
       16 |     printf("DGEMM_UNROLL_M=%d\n", DGEMM_DEFAULT_UNROLL_M);
          |                                   ^~~~~~~~~~~~~~~~~~~~~~
          |                                   XGEMM_DEFAULT_UNROLL_M
    getarch_2nd.c:17:35: error: ‘DGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘QGEMM_DEFAULT_UNROLL_N’?
       17 |     printf("DGEMM_UNROLL_N=%d\n", DGEMM_DEFAULT_UNROLL_N);
          |                                   ^~~~~~~~~~~~~~~~~~~~~~
          |                                   QGEMM_DEFAULT_UNROLL_N
    getarch_2nd.c:21:35: error: ‘CGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_M’?
       21 |     printf("CGEMM_UNROLL_M=%d\n", CGEMM_DEFAULT_UNROLL_M);
          |                                   ^~~~~~~~~~~~~~~~~~~~~~
          |                                   XGEMM_DEFAULT_UNROLL_M
    getarch_2nd.c:22:35: error: ‘CGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘QGEMM_DEFAULT_UNROLL_N’?
       22 |     printf("CGEMM_UNROLL_N=%d\n", CGEMM_DEFAULT_UNROLL_N);
          |                                   ^~~~~~~~~~~~~~~~~~~~~~
          |                                   QGEMM_DEFAULT_UNROLL_N
    getarch_2nd.c:23:35: error: ‘ZGEMM_DEFAULT_UNROLL_M’ undeclared (first use in this function); did you mean ‘XGEMM_DEFAULT_UNROLL_M’?
       23 |     printf("ZGEMM_UNROLL_M=%d\n", ZGEMM_DEFAULT_UNROLL_M);
          |                                   ^~~~~~~~~~~~~~~~~~~~~~
          |                                   XGEMM_DEFAULT_UNROLL_M
    getarch_2nd.c:24:35: error: ‘ZGEMM_DEFAULT_UNROLL_N’ undeclared (first use in this function); did you mean ‘QGEMM_DEFAULT_UNROLL_N’?
       24 |     printf("ZGEMM_UNROLL_N=%d\n", ZGEMM_DEFAULT_UNROLL_N);
          |                                   ^~~~~~~~~~~~~~~~~~~~~~
          |                                   QGEMM_DEFAULT_UNROLL_N
    getarch_2nd.c:71:50: error: ‘SGEMM_DEFAULT_Q’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_Q’?
       71 |     printf("#define SLOCAL_BUFFER_SIZE\t%ld\n", (SGEMM_DEFAULT_Q * SGEMM_DEFAULT_UNROLL_N * 4 * 1 *  sizeof(float)));
          |                                                  ^~~~~~~~~~~~~~~
          |                                                  SBGEMM_DEFAULT_Q
    getarch_2nd.c:72:50: error: ‘DGEMM_DEFAULT_Q’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_Q’?
       72 |     printf("#define DLOCAL_BUFFER_SIZE\t%ld\n", (DGEMM_DEFAULT_Q * DGEMM_DEFAULT_UNROLL_N * 2 * 1 *  sizeof(double)));
          |                                                  ^~~~~~~~~~~~~~~
          |                                                  SBGEMM_DEFAULT_Q
    getarch_2nd.c:73:50: error: ‘CGEMM_DEFAULT_Q’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_Q’?
       73 |     printf("#define CLOCAL_BUFFER_SIZE\t%ld\n", (CGEMM_DEFAULT_Q * CGEMM_DEFAULT_UNROLL_N * 4 * 2 *  sizeof(float)));
          |                                                  ^~~~~~~~~~~~~~~
          |                                                  SBGEMM_DEFAULT_Q
    getarch_2nd.c:74:50: error: ‘ZGEMM_DEFAULT_Q’ undeclared (first use in this function); did you mean ‘SBGEMM_DEFAULT_Q’?
       74 |     printf("#define ZLOCAL_BUFFER_SIZE\t%ld\n", (ZGEMM_DEFAULT_Q * ZGEMM_DEFAULT_UNROLL_N * 2 * 2 *  sizeof(double)));
          |                                                  ^~~~~~~~~~~~~~~
          |                                                  SBGEMM_DEFAULT_Q
    make: *** [Makefile.prebuild:74: getarch_2nd] Error 1
    Makefile:154: *** OpenBLAS: Detecting CPU failed. Please set TARGET explicitly, e.g. make TARGET=your_cpu_target. Please read README for the detail..  Stop.
    Traceback (most recent call last):
      File "./install.py", line 543, in <module>
        driver()
      File "./install.py", line 539, in driver
        install_legate_numpy(unknown=unknown, **vars(args))
      File "./install.py", line 359, in install_legate_numpy
        install_openblas(openblas_dir, thread_count, verbose)
      File "./install.py", line 143, in install_openblas
        execute_command(
      File "./install.py", line 62, in execute_command
        subprocess.check_call(args, cwd=cwd, shell=shell)
      File "/usr/lib/python3.8/subprocess.py", line 364, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['make', '-j', '8', 'USE_THREAD=1', 'NO_STATIC=1', 'USE_OPENMP=1', 'NUM_PARALLEL=32', 'LIBNAMESUFFIX=legate']' returned non-zero exit status 2.
    
    opened by jeffhammond 14
  • Realm not completing gather copy on the GPU

    Realm not completing gather copy on the GPU

    Problem

    Advanced indexing of a relatively huge (e.g., length 10K) 1D array returns UnboundLocalError: local variable 'shardfn' referenced before assignment, rather than NotImplementedError.

    I understand that advanced indexing is mostly not yet implemented. Most related routines raise NotImplementedError to let users know about this situation. However, this particular use case raises this different error, which seems to be a bug to me.

    To reproduce

    1. step 1: prepare test.py:
      from legate import numpy
      a = numpy.arange(10000)
      print(a[(1, 2, 3), ]) 
      
    2. step 2: run with, for example
      $ legate --cpus 1 test.py
      

    Output

    Traceback (most recent call last):
      File "<blahblah>/lib/python3.8/site-packages/legion_top.py", line 394, in legion_python_main
        run_path(args[start], run_name='__main__')
      File "<blahblah>/lib/python3.8/site-packages/legion_top.py", line 193, in run_path
        exec(code, module.__dict__, module.__dict__)
      File "./test.py", line 3, in <module>
        print(a[(1, 2, 3), ])
      File "<blahblah>/lib/python3.8/site-packages/legate/numpy/array.py", line 381, in __getitem__
        shape=None, thunk=self._thunk.get_item(key, stacklevel=2)
      File "<blahblah>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 414, in get_item
        copy = Copy(mapper=self.runtime.mapper_id, tag=shardfn)
    UnboundLocalError: local variable 'shardfn' referenced before assignment
    

    Expected results

    Either [1, 2, 3] or NotImplementedError.

    Notes

    • Interestingly, smaller arrays do not have this issue. For example, if a = numpy.arange(100), the code works fine.
    • Another way to make it works is to use GPUs instead of CPUs. For example, legate --gpus 1 test.py works fine. This is interesting, as the GPU implementation seems to be more stable than CPU implementation?
    bug in progress 
    opened by piyueh 13
  • Using a scalar in `allclose` raises an `AttributeError`

    Using a scalar in `allclose` raises an `AttributeError`

    Problem

    Using a scalar in allclose raises AttributeError: PROJ_1D_1D_.

    To reproduce

    1. step 1: create test.py
      from legate import numpy as lnp
      import numpy as realnp
      
      # vanilla numpy works
      a = realnp.full(10, 1e-1)
      print(realnp.allclose(a, 1e-1))
      
      # legate numpy not working
      la = lnp.full(10, 1e-1)
      print(lnp.allclose(la, 1e-1))
      
    2. step 2: run test.py with legate --cpus 1 ./test.py -lg:numpy:test

    Output

    The first part that uses vanilla NumPy prints True.

    The second part that uses Legate NumPy raises:

    Traceback (most recent call last):
      File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 408, in legion_python_main
        run_path(args[start], run_name='__main__')
      File "<prefix>/lib/python3.8/site-packages/legion_top.py", line 200, in run_path
        exec(code, module.__dict__, module.__dict__)
      File "./test.py", line 10, in <module>
        print(lnp.allclose(la, 1e-1))
      File "<prefix>/lib/python3.8/site-packages/legate/numpy/module.py", line 459, in allclose
        return ndarray.perform_binary_reduction(
      File "<prefix>/lib/python3.8/site-packages/legate/numpy/array.py", line 2068, in perform_binary_reduction
        dst._thunk.binary_reduction(
      File "<prefix>/lib/python3.8/site-packages/legate/numpy/deferred.py", line 5167, in binary_reduction
        ) = self.runtime.compute_broadcast_transform(
      File "<prefix>/lib/python3.8/site-packages/legate/numpy/runtime.py", line 2500, in compute_broadcast_transform
        self.first_proj_id + getattr(NumPyProjCode, proj_name),
      File "<prefix>/lib/python3.8/enum.py", line 384, in __getattr__
        raise AttributeError(name) from None
    AttributeError: PROJ_1D_1D_
    

    Expected behavior

    Working like vanilla NumPy, or raising an exception with a clear message of what is not supported.

    opened by piyueh 10
  • Garbage collection not working properly

    Garbage collection not working properly

    Problem

    During loops, memory usage keeps growing when it should keep constant. Looks like a garbage-collection issue.

    To reproduce

    Option 1: using examples/stencil.py (take more time to see the crash)

    1. step 1: go to legate.numpy/examples/stencil.py
    2. step 2: run legate --cpus 1 --sysmem 1500 --eager-alloc-percentage 1 ./stencil.py --num 3000 --benchmark 20 (I lower down the system memory to make the out-of-memory happen faster.)

    The crash happens at about the 14th benchmark iteration, so to get a profiling result, change --benchmark to 13. That is, legate --profile --cpus 1 --sysmem 1500 --eager-alloc-percentage 1 ./stencil.py --num 3000 --benchmark 13 -lg:numpy:test -lg:inorder.

    Here is the profiling result: legate_prof.tar.gz

    Option 2: using custom code (faster to see the crash)

    1. step 1: create test.py
      from legate import numpy
      
      a0 = numpy.random.random((1004, 1004))
      b0 = numpy.random.random((1004, 1004))
      c0 = numpy.random.random((1004, 1004))
      
      counter = 0
      while True:
          a = a0.copy()
          b = b0.copy()
          c = c0.copy()
      
          for i in range(2):
              a[2:-2, i] = a[2:-2, 2].copy()
              b[2:-2, i] = b[2:-2, 2].copy()
              c[2:-2, i] = c[2:-2, 2].copy()
      
          for i in range(-3):
              a[2:-2, i] = a[2:-2, -3].copy()
              b[2:-2, i] = b[2:-2, -3].copy()
              c[2:-2, i] = c[2:-2, -3].copy()
      
          for i in range(2):
              a[i, 2:-2] = a[2, 2:-2].copy()
              b[i, 2:-2] = b[2, 2:-2].copy()
              c[i, 2:-2] = c[2, 2:-2].copy()
      
          for i in range(-3):
              a[i, 2:-2] = a[-3, 2:-2,].copy()
              b[i, 2:-2] = b[-3, 2:-2,].copy()
              c[i, 2:-2] = c[-3, 2:-2,].copy()
      
          counter += 1
          print(counter)
      
    2. step 2: run the script with legate --cpus 1 --sysmem 750 --eager-alloc-percentage 1 ./test.py.

    The out-of-memory happened at 935th iteration. So to get a profiling output, add if counter % 934 == 0: break after print(counter), and then do legate --profile --cpus 1 --sysmem 750 --eager-alloc-percentage 1 ./test.py -lg:numpy:test -lg:inorder.

    Here's the profiling result: legate_prof.tar.gz

    bug in progress 
    opened by piyueh 9
  • Copying a slice to another slice in the same array fails silently

    Copying a slice to another slice in the same array fails silently

    Bug report due to @piyueh

    The following code, when run with -lg:numpy:test, prints [False], indicating that the slice has not been updated:

    from legate import numpy
    a = numpy.random.random((3, 3))
    a[:, 0] = a[:, 2]
    print(numpy.allclose(a[:, 0], a[:, 2]))
    

    After some digging I found that we skip copies between sub-regions if they're backed by the same field, which is actually only safe if the slices are equivalent: https://github.com/nv-legate/legate.numpy/blob/2b460c5dfdd60b673e37e25231bf625fdf3ead0e/legate/numpy/deferred.py#L101-L105

    If we simply skip this check then the copy ends up happening through a CopyTask, which works with subregions of the same base region.

    However, the runtime errors out if the two slices overlap, e.g. if we do a[0,0:2] = a[0,1:3] (vanilla NumPy accepts this, and does the expected thing). We should at least check for overlaps in python and produce a reasonable error message.

    We also want to add a case for this to the test suite.

    bug in progress 
    opened by manopapad 9
  • ndarray.__legate_data_interface__ crashes

    ndarray.__legate_data_interface__ crashes

    I tried to access the legate data using__legate_data_interface__, but I am getting an error: File "./gemm.py", line 62, in run_gemm C_data = C.legate_data_interface File "/home/wwu/legate.numpy/legate/numpy/array.py", line 90, in legate_data_interface array = Array(arrow_type, [None, self._thunk]) File "/home/wwu/legate.core/legate/core/legate.py", line 73, in init self._region = stores[1].storage[0] File "/home/wwu/legate.numpy/legate/numpy/eager.py", line 46, in storage return self.deferred.storage File "/home/wwu/legate.numpy/legate/numpy/deferred.py", line 169, in storage return (self.base.region, self.base.field.field_id) AttributeError: 'Store' object has no attribute 'region'

    Here is my code:

    def run_gemm(N, I, ft):  # noqa: E741
        print("Problem Size:     M=" + str(N) + " N=" + str(N) + " K=" + str(N))
        print("Total Iterations: " + str(I))
        flops = total_flops(N, N, N)
        print("Total Flops:      " + str(flops / 1e9) + " GFLOPS/iter")
        space = total_space(N, N, N, ft)
        print("Total Size:       " + str(space / 1e6) + " MB")
        A, B, C = initialize(N, N, N, ft)
        # Compute some sums and check for NaNs to force synchronization
        # before we start the timing
        assert not math.isnan(np.sum(A))
        assert not math.isnan(np.sum(B))
        assert not math.isnan(np.sum(C))
        start = datetime.datetime.now()
        # Run for as many iterations as was requested
        #for idx in range(I):
        np.dot(A, B, out=C)
        C_data = C.__legate_data_interface__
    
    opened by eddy16112 6
  • Build error - comparison to zero  in scalar_unary_red_omp.cc - both g++ and clang++

    Build error - comparison to zero in scalar_unary_red_omp.cc - both g++ and clang++

    Summary: There are 2 instances of what appears to be a comparison of a complex/bool/float (generic 'auto' variable) to zero in unary/scalar_unary_red_omp.c. g++ and clang++ both fail, saying there's no match for the != operator with the provided types.

    unary/scalar_unary_red_omp.cc:130:77: error: no match for ‘operator!=’ (operand types are ‘const std::complex<float
    >’ and ‘int’)
    

    I've built legate.core (without cuda, see 'aside' at bottom of post) from source, have a pre-installed openblas, and get this error when building legate-numpy on both OSX and Ubuntu

    Instances of zero comparison (unary/scalar_unary_red_omp.cc). Both trigger compiler errors 1:

      130 |         for (size_t idx = 0; idx < volume; ++idx) locals[tid] += inptr[idx] != 0;
    

    2:

    137         for (size_t idx = 0; idx < volume; ++idx) {
    138           auto point = pitches.unflatten(idx, rect.lo);
    139           locals[tid] += in[point] != 0;
    

    Command On ubuntu (OSX command is similar):

    python3 install.py --with-core /home/shivneural/legate/legate.core/target --with-openblas /usr/lib/x86_64-linux-gnu/openblas-pthread/
    

    Environment: I'm encountering this error on both OSX and Ubuntu (and have tried a few different compilers):

    1. OSX 10.15.7 Catalina Compilers tried: a. clang++ version 12.0.0. b. clang++ Apple LLVM version 7.0.2 (clang-700.1.81) c. g++-11 (Homebrew GCC 11.1.0) 11.1.0 (fails due to different errors).

    2. Ubuntu 20.04.2 LTS (Focal Fossa) Compilers tried: a. g++ (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0.

    Full error:

    g++ -o unary/unary_red_omp.cc.o -c unary/unary_red_omp.cc   -fopenmp -I/home/shivneural/legate/legate.core/install/
    thrust  -I. -I/usr/lib/x86_64-linux-gnu/openblas-pthread/include -std=c++14 -Wfatal-errors -I/home/shivneural/legat
    e/legate.core/target/include -O2 -fno-strict-aliasing  -DLEGATE_USE_CUDA -I/include -fPIC -DLEGATE_USE_OPENMP
    unary/scalar_unary_red_omp.cc: In instantiation of ‘void legate::numpy::ScalarUnaryRedImplBody<legate::numpy::Varia
    ntKind::OMP, legate::numpy::UnaryRedCode::COUNT_NONZERO, CODE, DIM>::operator()(uint64_t&, legate::AccessorRO<typen
    ame legate::LegateTypeOf<CODE>::type, DIM>, Legion::Rect<N>&, const legate::numpy::Pitches<(DIM - 1)>&, bool) const
     [with legate_core_type_code_t CODE = COMPLEX64_LT; int DIM = 1; uint64_t = long unsigned int; legate::AccessorRO<t
    ypename legate::LegateTypeOf<CODE>::type, DIM> = Legion::FieldAccessor<LEGION_READ_PRIV, std::complex<float>, 1, lo
    ng long int, Realm::AffineAccessor<std::complex<float>, 1, long long int>, false>; typename legate::LegateTypeOf<CO
    DE>::type = std::complex<float>; Legion::Rect<N> = Realm::Rect<1, long long int>]’:
    ./unary/scalar_unary_red_template.inl:132:75:   required from ‘legate::numpy::UntypedScalar legate::numpy::ScalarUn
    aryRedImpl<KIND, legate::numpy::UnaryRedCode::COUNT_NONZERO>::operator()(legate::numpy::ScalarUnaryRedArgs&) const 
    [with legate_core_type_code_t CODE = COMPLEX64_LT; int DIM = 1; legate::numpy::VariantKind KIND = legate::numpy::Va
    riantKind::OMP]’
    /home/shivneural/legate/legate.core/target/include/utilities/dispatch.h:67:40:   required from ‘constexpr decltype(
    auto) legate::inner_type_dispatch_fn<DIM>::operator()(legate::LegateTypeCode, Functor, Fnargs&& ...) [with Functor 
    = legate::numpy::ScalarUnaryRedImpl<legate::numpy::VariantKind::OMP, legate::numpy::UnaryRedCode::COUNT_NONZERO>; F
    nargs = {legate::numpy::ScalarUnaryRedArgs&}; int DIM = 1; legate::LegateTypeCode = legate_core_type_code_t]’
    /home/shivneural/legate/legate.core/target/include/utilities/dispatch.h:141:41:   required from ‘constexpr decltype
    (auto) legate::double_dispatch(int, legate::LegateTypeCode, Functor, Fnargs&& ...) [with Functor = legate::numpy::S
    calarUnaryRedImpl<legate::numpy::VariantKind::OMP, legate::numpy::UnaryRedCode::COUNT_NONZERO>; Fnargs = {legate::n
    umpy::ScalarUnaryRedArgs&}; legate::LegateTypeCode = legate_core_type_code_t]’
    ./unary/scalar_unary_red_template.inl:167:27:   required from ‘legate::numpy::UntypedScalar legate::numpy::scalar_u
    nary_red_template(legate::TaskContext&) [with legate::numpy::VariantKind KIND = legate::numpy::VariantKind::OMP]’
    unary/scalar_unary_red_omp.cc:151:61:   required from here
    unary/scalar_unary_red_omp.cc:130:77: error: no match for ‘operator!=’ (operand types are ‘const std::complex<float
    >’ and ‘int’)
      130 |         for (size_t idx = 0; idx < volume; ++idx) locals[tid] += inptr[idx] != 0;
          |                                                                  ~~~~~~~~~~~^~~~
    compilation terminated due to -Wfatal-errors.
    make: *** [/home/shivneural/legate/legate.core/target/share/legate/legate.mk:200: unary/scalar_unary_red_omp.cc.o] 
    Error 1
    

    Changing the 0 to a std::complex<float>(0.0f,0.0f) emits an error that the operand types are a bool and complex float.

    Perhaps I'm configuring something incorrectly, in which case any guidance is appreciated.

    (Aside: My Ubuntu machine is GCP instance w/ a T4 GPU, running cuda 10.1. When kicking of a legate-core with-cuda build it fails as it can't recognize the "_habs" half precision function when building legion legate/legate.core/legion/runtime/mathtypes/half.h(364): error: identifier "__habs" is undefined. Looks like T4's Turing architecture isn't one of legate's supported platforms, but AFAIK Turing supports half precision) Thanks

    opened by shivsundram 5
  • Exception raised when using sys.exit(0)

    Exception raised when using sys.exit(0)

    Problem

    When having sys.exit(0) to indicate a program's termination, Legate seems to catch the SystemExit raised by sys.exit(0) and treat it as an error.

    To reproduce

    1. step 1: prepare two Python test1.py and test2.py with these contents:
      • test1.py
        import sys
        from legate import numpy
        sys.exit(0)
        
      • test2.py
        import sys
        from legate import numpy
        
    2. step 2: run both scripts with legate, for example:
      $ legate --cpus 1 test1.py
      

      and

      $ legate --cpus 1 test2.py
      

    Expected and actual outputs

    Both scripts are supposed to output nothing. However, test1.py returns this message

    [0 - 7f34317b87c0]    0.807057 {6}{python}: python exception occurred within task:
    

    I guess Legate catches the SystemExit from sys.exit and treat it as a normal exception, i.e., en error.

    enhancement 
    opened by piyueh 5
  • Added NumPy / Legate comparison table

    Added NumPy / Legate comparison table

    Ported over and enhanced CuPy's comparison table.

    opened by mmccarty 3
  • Missing features in unary reduction

    Missing features in unary reduction

    The current unary reduction is missing implementations for the following cases:

    • [ ] np.argmin and np.argmax with no axis value
    • [ ] any unary reduction with more than one reducing dimension
    opened by magnatelee 0
  • Non-deterministic wrong result from tensordot

    Non-deterministic wrong result from tensordot

    This program (derived from tests/tensordot.py):

    import legate.numpy as lg
    import numpy as np
    
    a = lg.random.rand(3, 5, 4).astype(np.float16)
    b = lg.random.rand(4, 5, 3).astype(np.float16)
    
    a = lg.random.rand(3, 5, 4).astype(np.float16)
    b = lg.random.rand(5, 4, 3).astype(np.float16)
    cn = np.tensordot(a, b)
    print('cn', flush=True)
    print(cn, flush=True)
    c = lg.tensordot(a, b)
    print('c', flush=True)
    print(c, flush=True)
    
    assert np.allclose(cn, c)
    

    when run as follows:

    LEGATE_TEST=1 legate 79.py -lg:numpy:test --cpus 4
    

    fails about 20% of the time, with:

    cn
    [[4.07  4.83  5.01 ]
     [4.2   4.562 5.863]
     [4.344 4.52  3.914]]
    c
    [[4.07  4.83  5.01 ]
     [4.2   4.562 5.863]
     [4.344 4.52  3.916]]
    [0 - 700005133000]    0.946367 {6}{python}: python exception occurred within task:
    Traceback (most recent call last):
      File "/Users/mpapadakis/legate.core/install/lib/python3.8/site-packages/legion_top.py", line 410, in legion_python_main
        run_path(args[start], run_name='__main__')
      File "/Users/mpapadakis/legate.core/install/lib/python3.8/site-packages/legion_top.py", line 234, in run_path
        exec(code, module.__dict__, module.__dict__)
      File "79.py", line 16, in <module>
        assert np.allclose(cn, c)
    AssertionError
    
    bug 
    opened by manopapad 3
  • Test suite additional configurations

    Test suite additional configurations

    NumPy-specific additional configurations to run our existing tests (in addition to generic configurations listed in https://github.com/nv-legate/legate.core/issues/27):

    • [ ] Run without partitioning (currently we always force partitioned execution with NUMPY_TEST=1).
    • [ ] Run in eager mode (currently we always force deferred mode with -lg:numpy:test).
    • [ ] Run with shadow debugging enabled.
    testing 
    opened by manopapad 0
  • Redirecting outputs to overlapping slices of inputs is buggy

    Redirecting outputs to overlapping slices of inputs is buggy

    I believe examples like the following would raise an interfering requirement error, in both master and branch-21.10:

    legate.numpy.add(x[1:5], 1, out=x[2:6])
    

    We need to fix the operators whose outputs can be redirected such that the intermediate results are materialized before getting assigned to the designated arrays. The in-place update code already handles this using the alias check on Legate Stores, so we can reuse that code.

    bug 
    opened by magnatelee 3
  • Comprehensively exercise each function in the test suite

    Comprehensively exercise each function in the test suite

    Each supported function should be fully exercised in the test suite, e.g.:

    • all datatypes supported for the operation
    • all supported options
    • using a where array, if supported by the operation
    • all broadcasting modes
    • all cases of implicit up-casting (e.g. adding an integer and a real array)
    • other cases of store transformations on inputs (e.g. slice, transpose)
    • all array dimensions, up to the max number of dimensions that Legate was compiled for
    • passing existing arrays as outputs

    To cover an arbitrary amount of dimensions it will be necessary to programmatically generate inputs, e.g. see https://github.com/nv-legate/legate.numpy/blob/896f4fd9b32db445da6cdabf7b78d523fca96936/tests/binary_op_broadcast.py and https://github.com/nv-legate/legate.numpy/blob/067a541905bf3bfc8d3727c6e1fe97a4855729b9/tests/intra_array_copy.py.

    The NumPy test suite may be a good starting point, see #22.

    testing 
    opened by manopapad 0
  • Performance regression testing

    Performance regression testing

    It would be good to periodically run performance regression tests, on representative hardware combinations. A project like https://github.com/spcl/npbench could be used as a starting point for a benchmark suite, or an actual benchmark run we did in the past.

    testing 
    opened by manopapad 0
  • Meaning of parameters in NumPyProjectionFunctorRadix2D

    Meaning of parameters in NumPyProjectionFunctorRadix2D

    Hi, in the NumPyProjectionFunctorRadix2D (similar in NumPyProjectionFunctorRadix3D) https://github.com/nv-legate/legate.numpy/blob/896f4fd9b32db445da6cdabf7b78d523fca96936/src/proj.cc#L528

    there are three parameters: template <int DIM, int RADIX, int OFFSET>. The DIM is the dimension of a tensor. What about RADIX and OFFSET? I notice that in the register_projection_functors function, the RADIX is given as 4 and OFFSET range from 0 to 3.

    register_functor<NumPyProjectionFunctorRadix3D<0, 4, 0>>( runtime, offset, NUMPY_PROJ_RADIX_3D_X_4_0);

    How about 4D or NDArray? Should I format the 4D tensor projection function as follows:

    register_functor<NumPyProjectionFunctorRadix4D<0, 4, 0>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_X_4_0);
    register_functor<NumPyProjectionFunctorRadix4D<0, 4, 1>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_X_4_0);
    register_functor<NumPyProjectionFunctorRadix4D<0, 4, 2>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_X_4_0);
    register_functor<NumPyProjectionFunctorRadix4D<0, 4, 3>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_X_4_0);
    
    register_functor<NumPyProjectionFunctorRadix4D<1, 4, 0>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_Y_4_0);
    register_functor<NumPyProjectionFunctorRadix4D<1, 4, 1>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_Y_4_0);
    register_functor<NumPyProjectionFunctorRadix4D<1, 4, 2>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_Y_4_0);
    register_functor<NumPyProjectionFunctorRadix4D<1, 4, 3>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_Y_4_0);
    
    register_functor<NumPyProjectionFunctorRadix4D<2, 4, 0>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_Z_4_0);
    register_functor<NumPyProjectionFunctorRadix4D<2, 4, 1>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_Z_4_0);
    register_functor<NumPyProjectionFunctorRadix4D<2, 4, 2>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_Z_4_0);
    register_functor<NumPyProjectionFunctorRadix4D<2, 4, 3>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_Z_4_0);
    
    register_functor<NumPyProjectionFunctorRadix4D<3, 4, 0>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_D_4_0);
    register_functor<NumPyProjectionFunctorRadix4D<3, 4, 1>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_D_4_0);
    register_functor<NumPyProjectionFunctorRadix4D<3, 4, 2>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_D_4_0);
    register_functor<NumPyProjectionFunctorRadix4D<3, 4, 3>>(
    runtime, offset, NUMPY_PROJ_RADIX_4D_D_4_0);
    

    ?

    opened by MiZhangWhuer 2
  • sharding functor related error when running gemm.py

    sharding functor related error when running gemm.py

    I'm executing gemm.py on 8 nodes with this command line and see the following error

    /g/g15/yadav2/legate.core/install/bin/legate /g/g15/yadav2/legate.numpy/examples/gemm.py -n 46340 -p 64 -i 10 --num_nodes 32 --omps 2 --ompthreads 18 --nodes 32 --numamem 30000 --eager-alloc-percentage 1 --cpus 1 --sysmem 10000 --launcher jsrun --cores-per-node 40 --verbose
    Running: jsrun -n 32 -r 1 -a 1 -c 40 -g 0 -b none /g/g15/yadav2/legate.core/install/bin/legion_python /g/g15/yadav2/legate.numpy/examples/gemm.py -n 46340 -p 64 -i 10 --num_nodes 32 -ll:py 1 -lg:local 0 -ll:ocpu 2 -ll:othr 18 -ll:onuma 1 -ll:util 2 -ll:bgwork 2 -ll:csize 10000 -ll:nsize 30000 -ll:ncsize 0 -level openmp=5 -lg:eager_alloc_percentage 1
    [7 - 20003ac2f8b0]    5.675422 {5}{runtime}: [error 605] LEGION ERROR: Illegal output shard 32 from sharding functor 1073741900. Shards for this index space launch must be between 0 and 32 (exclusive). (from file /g/g15/yadav2/legate.core/legion/runtime/legion/runtime.cc:15560)
    For more information see:
    http://legion.stanford.edu/messages/error_code.html#error_code_605
    
    opened by rohany 0
  • NDArray support

    NDArray support

    Is there any plan for the NDArray?

    opened by MiZhangWhuer 3
  • Return order of nonzero elements differs from NumPy

    Return order of nonzero elements differs from NumPy

    Consider the following program:

    a = np.arange(25).reshape((5,5))
    print(np.nonzero(a > 17))
    

    When ran with python NumPy (or legate.numpy on 1 CPU) the non-zero indices are returned in this order:

    x = [3 3 4 4 4 4 4]
    y = [3 4 0 1 2 3 4]
    

    If instead we run with legate.numpy on 4 CPUs (using NUMPY_TEST to force legate.numpy to do distributed execution) (command line: NUMPY_TEST=1 legate nz-order.py -lg:numpy:test --cpus 4) we get:

    x = [4 4 4 3 3 4 4]
    y = [0 1 2 3 4 3 4]
    

    I.e. legate.numpy returns the indices grouped by tile (see nz-order.pdf for a visualization), instead of returning them according to the global row-major order, as is guaranteed in the NumPy API. This is a side effect of how distributed nonzero is implemented. Making this work like NumPy would require a sort after every nonzero call.

    We could simply decide to live with this incompatibility, since I expect most code using nonzero will not explicitly depend on the order that nonzero elements are returned in. The most likely scenario I can think of where this incompatibility would be problematic is if the user code mixes the results of different nonzero calls in the same operation:

    import legate.numpy as np
    
    # too small to be partitioned; indices will be in C order
    small = np.ones((2,2))
    small_is = np.nonzero(small)
    
    # large enough to be partitioned; indices will be grouped by tile.
    large = np.zeros((10000,10000))
    large[2500,2500] = 2.0
    large[2500,7500] = 3.0
    large[2501,2500] = 4.0
    large[2501,7500] = 5.0
    large_is = np.nonzero(large)
    
    small[small_is] = large[large_is]
    print(small)
    
    opened by manopapad 1
Owner
Legate
High Productivity High Performance Computing
Legate
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled 91 Oct 6, 2021
:truck: Agile Data Preparation Workflows made easy with dask, cudf, dask_cudf and pyspark

To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges: Optimus is the missing framework to prof

Iron 1.1k Oct 20, 2021
Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations.

Elicited Helper tools to construct probability distributions built from expert elicited data for use in monte carlo simulations. Credit to Brett Hoove

Ryan McGeehan 3 Oct 13, 2021
Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis ?? This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

Andy Pham 0 Sep 5, 2021
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.4k Oct 14, 2021
A library to create multi-page Streamlit applications with ease.

A library to create multi-page Streamlit applications with ease.

Jackson Storm 65 Oct 19, 2021
NumPy and Pandas interface to Big Data

Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar inte

Blaze 3k Oct 16, 2021
Stochastic Gradient Trees implementation in Python

Stochastic Gradient Trees - Python Stochastic Gradient Trees1 by Henry Gouk, Bernhard Pfahringer, and Eibe Frank implementation in Python. Based on th

John Koumentis 1 Oct 22, 2021
A Python package for the mathematical modeling of infectious diseases via compartmental models

A Python package for the mathematical modeling of infectious diseases via compartmental models. Originally designed for epidemiologists, epispot can be adapted for almost any type of modeling scenario.

epispot 10 Oct 17, 2021
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 2 Oct 7, 2021
pipeline for migrating lichess data into postgresql

How Long Does It Take Ordinary People To "Get Good" At Chess? TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, mo

Joseph Wong 1 Oct 24, 2021
Multiple Pairwise Comparisons (Post Hoc) Tests in Python

scikit-posthocs is a Python package that provides post hoc tests for pairwise multiple comparisons that are usually performed in statistical data anal

Maksim Terpilowski 221 Oct 10, 2021
Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

SALib 529 Oct 22, 2021
Performance analysis of predictive (alpha) stock factors

Alphalens Alphalens is a Python Library for performance analysis of predictive (alpha) stock factors. Alphalens works great with the Zipline open sour

Quantopian, Inc. 2.1k Oct 16, 2021
Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

Raphael Vallat 817 Oct 22, 2021
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Tuplex 718 Oct 21, 2021
PyChemia, Python Framework for Materials Discovery and Design

PyChemia, Python Framework for Materials Discovery and Design PyChemia is an open-source Python Library for materials structural search. The purpose o

Materials Discovery Group 56 Oct 2, 2021
BasstatPL is a package for performing different tabulations and calculations for descriptive statistics.

BasstatPL is a package for performing different tabulations and calculations for descriptive statistics. It provides: Frequency table constr

Angel Chavez 1 Oct 23, 2021