⚡️Optimizing einsum functions in NumPy, Tensorflow, Dask, and more with contraction order optimization.

Overview

Optimized Einsum

Tests codecov Anaconda-Server Badge PyPI PyPIStats Documentation Status DOI

Optimized Einsum: A tensor contraction order optimizer

Optimized einsum can significantly reduce the overall execution time of einsum-like expressions (e.g., np.einsum, dask.array.einsum, pytorch.einsum, tensorflow.einsum, ) by optimizing the expression's contraction order and dispatching many operations to canonical BLAS, cuBLAS, or other specialized routines.

Optimized einsum is agnostic to the backend and can handle NumPy, Dask, PyTorch, Tensorflow, CuPy, Sparse, Theano, JAX, and Autograd arrays as well as potentially any library which conforms to a standard API. See the documentation for more information.

Example usage

The opt_einsum.contract function can often act as a drop-in replacement for einsum functions without further changes to the code while providing superior performance. Here, a tensor contraction is performed with and without optimization:

import numpy as np
from opt_einsum import contract

N = 10
C = np.random.rand(N, N)
I = np.random.rand(N, N, N, N)

%timeit np.einsum('pi,qj,ijkl,rk,sl->pqrs', C, C, I, C, C)
1 loops, best of 3: 934 ms per loop

%timeit contract('pi,qj,ijkl,rk,sl->pqrs', C, C, I, C, C)
1000 loops, best of 3: 324 us per loop

In this particular example, we see a ~3000x performance improvement which is not uncommon when compared against unoptimized contractions. See the backend examples for more information on using other backends.

Features

The algorithms found in this repository often power the einsum optimizations in many of the above projects. For example, the optimization of np.einsum has been passed upstream and most of the same features that can be found in this repository can be enabled with np.einsum(..., optimize=True). However, this repository often has more up to date algorithms for complex contractions.

The following capabilities are enabled by opt_einsum:

Please see the documentation for more features!

Installation

opt_einsum can either be installed via pip install opt_einsum or from conda conda install opt_einsum -c conda-forge. See the installation documentation for further methods.

Citation

If this code has benefited your research, please support us by citing:

Daniel G. A. Smith and Johnnie Gray, opt_einsum - A Python package for optimizing contraction order for einsum-like expressions. Journal of Open Source Software, 2018, 3(26), 753

DOI: https://doi.org/10.21105/joss.00753

Contributing

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

A detailed overview on how to contribute can be found in the contributing guide.

Comments
  • added dynamic programming path

    added dynamic programming path

    Description

    In the field of tensor network states (matrix product states, projected entangled pair states, ...) the most common approach to optimise tensor network contractions is based on dynamic programming [Phys. Rev. E 90, 033315 (2014)], which usually provides an optimal contraction path within a reasonable amount of time (even for a large number of tensors). This pull request provides such an implementation.

    Questions

    • [ ] Is there a simple way to run the benchmarks of the "Performance Comparison" section of the documentation for the dynamic programming optimiser to compare with the other approaches?
    opened by mrader1248 45
  • [WIP] recursive, depth-first optimizer

    [WIP] recursive, depth-first optimizer

    Use recursive, depth-first optimizer for optimal & possibly 'greedy2'

    I had a play around with making the opt_einsum optimal path finder a bit more efficient. The basic idea is:

    1. Use a recursive search to simplify the logic and find paths 'depth-first', i.e. follow a path all the way to its termination before searching for the next one.
    2. Allowed by this, terminate searches as soon as the path cost increases above the best found so far, sieving many known-bad paths

    The results are pretty promising, current optimal:

    >> eq, shapes = oe.helpers.rand_equation(8, 3, seed=42)
    >> views = [np.random.rand(*s) for s in shapes]
    >> %time path = oe.contract_path(eq, *views, optimize='optimal')[1]
    CPU times: user 25.3 s, sys: 716 ms, total: 26 s
    Wall time: 26 s
    >>> path.opt_cost
    130316
    

    and with this recursive strategy:

    >> %time path = oe.contract_path(eq, *views, optimize='roptimal')[1]
    CPU times: user 107 ms, sys: 324 µs, total: 107 ms
    Wall time: 108 ms
    >>> path.opt_cost
    130316
    

    The fundamental scaling is still factorial, but it should allow optimal-paths to be realistically found for expressions of up to 10 tensors. Possibly other heuristics can be used to sieve paths eagerly as well.

    One additional option this opens up is to recurse only into the n best possible contractions (current greedy/eager being n=1). Thus you could have a 'greedy2', 'greedy3' etc, which scale like 2**n, 3**n which while exponential (modulo sieving), is significantly more manageable than factorial scaling. These might use custom cost functions #64 and random shaking as well #63.

    Anyway, thought I'd share this WIP initial implementation. My suggestion might be to have this approach eventually replace optimal, with 'greedy', 'greedy2'... using it also.

    cc @fritzo --- took some inspiration from your 'eager' path if you have any input!

    TODO:

    • [x] apply some caching?
    • [ ] keep track of keep indices rather than computing each time?
    • [x] handle memory_limit case properly? (i.e. contract all remaining tensors at once)
    enhancement 
    opened by jcmgray 34
  • Use best of many random greedy paths

    Use best of many random greedy paths

    In quantum computing simulations particularly, finding the optimal contraction sequence has a huge effect on what is computable - the amount of memory and time required is exponential in how 'bad' the path found is. The greedy approach is generally worse than the optimal approach but it is cheap to compute (and even cheaper with #60!).

    If we introduce a random greedy algorithm by tweaking the cost objective function, we get a variety of different paths with varying costs and we can just pick the best one. I had a little play with contracting a 49 qubit depth 20 circuit (661 tensors overall) and randomly adjusting each cost by +- 5%.

    The following are 'contraction complexities' (maximum rank of tensor produced - 2^this is the memory required which is also a rough proxy for computation time) on the left and frequency on the right from 100 runs.

    [(25, 1),
     (27, 1),
     (28, 3),
     (29, 6),
     (30, 10),
     (31, 16),
     (32, 6),  # <- this is what the current greedy finds
     (33, 9),
     (34, 21),
     (35, 8),
     (36, 8),
     (37, 1),
     (38, 4),
     (39, 2),
     (40, 1),
     (42, 3)]
    

    So the best contraction found vs the current contraction found requires ~128x less memory and computation, and turns the simulation from 'super'-computable to 'laptop'-computable.

    enhancement 
    opened by jcmgray 19
  • Implement shared_intermediates context manager

    Implement shared_intermediates context manager

    Resolves #9

    Description

    This implements a shared_intermediates context manager, within which multiple calls to contract will share intermediate computations. This is very useful e.g. in message passing algorithms. The first draft of this PR is copied directly from Pyro, where we are using sharing for probabilistic inference in graphical models.

    The implementation uses a special internal backend opt_einsum.backends.shared. When a sharing context is opened

    with shared_intermediates():
        contract(..., backend='foo')
        contract(..., backend='foo')
    

    a special handle_sharing() context is activated inside the contract function, temporarily switching to the shared backend, and setting the original backend foo in a global shared._CURRENT_BACKEND variable. The sharing backend then performs all the original foo operations and also memoizes them.

    One design choice is to memoize by id() rather than by value. This makes sense from a computational perspective (equality comparison is expensive), but requires a bit more care by users.

    A second design choice is to expose the memoization cache to users. This makes it easy to share intermediates by passing the cache around, as is necessary when it is difficult to perform all computation in a context:

    with shared_intermediates() as cache:
        contract(...)
    ...pass control elsewhere...
    with shared_intermediates(cache):  # <-- reuse previous cache
        contract(...)
    del cache  # <-- intermediates will be garbage collected
    

    Todos

    • [x] Cache type conversions e.g. numpy->torch->numpy so args can be compared by id.
    • [x] fix failing tests
    • [x] add tests that pass around the cache
    • [x] add tests where we tests the normal ordering:
      with shared_intermeidates() as cache:
          a = np.random.rand(4, 4)
          b = np.random.rand(4, 4)
          contract("ab,bc->ac", a, b)
      
      assert get_cache_key("zc,az->ac", b, a) in cache
      ...
      
    • [x] test against the Pyro rm-einsum-shared branch
    • [x] test nested sharing
    • [x] write docs
    • [x] move _alpha_canonicalize() to parser.py
    • [x] test caching with constant expressions

    Questions

    • [x] Could a maintainer suggest where the handle_sharing() context should be moved? I've attempted to insert it into ContractExpression.__call__(), but I'm a bit lost.

    Status

    • [x] Ready to go
    enhancement 
    opened by fritzo 16
  • Re-use intermediates across various contractions (Common Subexpression Elimination)

    Re-use intermediates across various contractions (Common Subexpression Elimination)

    Suppose you want to compute two contractions of the same tensors, e.g. contraction strings

    ['ab,dca,eb,cde', 'ab,cda,eb,cde']

    The (globally) optimal way to do this would be to first perform the contractions over indices a,b and e, and then perform the remaining contractions over c and d for the two sets of contractions. The current opt_einsum implementation does not allow for such a global optimization of contraction order and re-use of common intermediates.

    I'm still organizing my thoughts on this, and all input would be most welcome. On a side note, I believe implementing such a more general optimization strategy will also fix #7 as a by-product.

    Optimization logic

    A relevant publication suggesting a concrete algorithm for this optimization problem is

    Hartono et al, Identifying Cost-Effective Common Subexpressions to Reduce Operation Count in Tensor Contraction Evaluations

    I do not know to what extent the current code can be re-used in this more general setup, but the single-term optimization should be under control with the current implementation.

    Interface

    Such a multi_contract function could be called with a list of tuples (contractionString, [tensors, ..]), and would return a list of results with the same length as the input list.

    Internally, one would have to find out which tensors are actually identical between the various contractions, and then use the above contraction logic. Ideally this information should be deduced automatically and not rely on user input being in the correct form. In the same spirit, dummy indices should be allowed to have arbitrary names, i.e. they should not have to match across different contractions to be correctly identified as a common subexpression. This may require transforming the input into a 'canonical' form first to make sure that common subexpressions are captured correctly.

    In contrast to the setup outlined in Hartono et al, contraction strings should maybe remain in their current 'simple' form and not be generalized to allow for numerical expressions like sums of tensors etc. Such a behavior can be implemented a posteriori with the interface described here by computing the sum of the resulting tensors, e.g.

    contract('ab,ab + ab,ba', N,M) --> 'sum'(multi_contract( [('ab,ab', N,M), ('ab,ba', N,M)] ))

    Thus, restricting contraction strings to be of the form currently used does not cause loss of generality, the only downside being that it might lead to a slightly increased memory-footprint as the function would return several arrays instead of one.

    Other thoughts?

    enhancement 
    opened by ebatz 15
  • Support more symbols for subscripts (#67)

    Support more symbols for subscripts (#67)

    • Add more symbol support based on hash

    • Refactor the converting to subscripts routine a little bit

    • Update tests

    Description

    Make opt_einsum support anything hashable as subscripts, such as:

    contract(arr1, ['left edge', 'bond'], arr2, ['bond', 'right edge'])
    

    See #67

    Questions

    • [x] I've tried to add the new feature to the document, but it seems the document does not have a chapter describing the inputs. Maybe I've left out something or maybe this should be solved in a different PR?

    Status

    • [x] Ready to go
    enhancement 
    opened by liwt31 14
  • non-numpy (e.g. GPU) backend support

    non-numpy (e.g. GPU) backend support

    This adds non-numpy backend support to opt_einsum in two flavors:

    General duck-typed support

    For arbitrary ndarray libraries, i.e. if the arrays supplied have got a .shape attribute and opt_einsum can find {backend}.tensordot etc then it should work, returning whatever type the library uses. So for example

    contract("abc,bcd,def->aef", *dask_arrays, backend='dask')
    

    will return a dask.Array without any explicit logic. I've added tests for dask and also sparse.

    Edit: I should add that extra functionality like taking diagonals etc is still reliant on a einsum implementation.

    Special conversion support

    To support GPU operations on numpy arrays, here, on first call the arrays are converted to the correct placeholder, an expression generated using these, and then this expression can be called with normal numpy arrays, returning numpy arrays also. E.g.:

    expr = contract_expression("abc,bcd,def->aef", *shapes)
    expr(*numpy_arrays, backend='theano')
    

    Explicit tested support here is for tensorflow, theano and cupy - GPU libraries where it makes sense to convert to and from.

    Adding more backends

    contract.py doesn't have any explicit reference to the various backends, thats all contained in backends.py now. So it should be very easy to add new backends. E.g. if tblis were to add tensordot, it would fall under the duck-typing category above and I think no changes would actually be needed at all.

    Obviously any recommendations welcome! Getting the right backend function and all that is cached so I don't think this has added any more than a few 100ns overhead.

    opened by jcmgray 14
  • `Optimal` path not optimal?

    `Optimal` path not optimal?

    First, thanks for the great package!

    Was running optimal vs. auto on a 10 element random expression, and while I know optimal is not recommended for larger networks, I found it odd that it gives worse scaling and with a larger intermediate size. Some of the other threads mention memory as a factor in the optimization, but the auto solution seems better on memory as well? All the tensors are 3x3. Is this intentional behaviour?

    auto path:

    ([(2, 3), (4, 8), (0, 4), (0, 4), (4, 5), (0, 2), (2, 3), (1, 2), (0, 1)],
       Complete contraction:  db,cc,fe,fe,aa,ff,fe,cb,ea,ac->d
              Naive scaling:  6
          Optimized scaling:  3
           Naive FLOP count:  7.290e+3
       Optimized FLOP count:  2.700e+2
        Theoretical speedup:  27.000
       Largest intermediate:  9.000e+0 elements
     --------------------------------------------------------------------------------
     scaling        BLAS                current                             remaining
     --------------------------------------------------------------------------------
        2              0              fe,fe->fe         db,cc,aa,ff,fe,cb,ea,ac,fe->d
        2              0              fe,fe->fe            db,cc,aa,ff,cb,ea,ac,fe->d
        3           GEMM              cb,db->cd               cc,aa,ff,ea,ac,fe,cd->d
        2              0              ac,cc->ac                  aa,ff,ea,fe,cd,ac->d
        3           GEMM              ac,cd->ad                     aa,ff,ea,fe,ad->d
        2              0              ea,aa->ea                        ff,fe,ad,ea->d
        3           GEMM              ea,ad->ed                           ff,fe,ed->d
        3           GEMM              ed,fe->df                              ff,df->d
    

    optimal path:

    ([(1, 7), (0, 8), (0, 2), (0, 1), (0, 5), (0, 3), (0, 3), (0, 2), (0, 1)],
       Complete contraction:  db,cc,fe,fe,aa,ff,fe,cb,ea,ac->d
              Naive scaling:  6
          Optimized scaling:  4
           Naive FLOP count:  7.290e+3
       Optimized FLOP count:  5.130e+2
        Theoretical speedup:  14.211
       Largest intermediate:  2.700e+1 elements
     --------------------------------------------------------------------------------
     scaling        BLAS                current                             remaining
     --------------------------------------------------------------------------------
        2              0              cb,cc->cb         db,fe,fe,aa,ff,fe,ea,ac,cb->d
        3           GEMM              cb,db->cd            fe,fe,aa,ff,fe,ea,ac,cd->d
        3              0             aa,fe->afe              fe,ff,fe,ea,ac,cd,afe->d
        2              0              ff,fe->fe                 fe,ea,ac,cd,afe,fe->d
        2              0              fe,fe->fe                    ea,ac,cd,afe,fe->d
        3              0            afe,ea->afe                       ac,cd,fe,afe->d
        4           GEMM            afe,ac->fec                          cd,fe,fec->d
        4           GEMM            fec,cd->fed                             fe,fed->d
        3           GEMM              fed,fe->d                                  d->d) 
    
    bug 
    opened by Bonnevie 13
  • [WIP] allow constant tensors in expressions

    [WIP] allow constant tensors in expressions

    This allows constant tensors to be supplied to contract_expression, which will then build as many constant intermediaries as it can. Still a bit of a work in progress but the core functionality is working so thought I'd share. Resolves #7.

    Example

    Basic setup:

    >>> from opt_einsum import contract_expression
    >>> from numpy.random import rand
    
    >>> equation = 'ij,jk,kl,lm,mn->ni'
    >>> const_ops = rand(2, 2), rand(2, 2), rand(2, 2)
    >>> ops = (3, 2), *const_ops, (2, 3)  # non-const ops are just shapes
    
    >>> expr = contract_expression(equation, *ops, constants=[1, 2, 3])
    >>> expr
    <ContractExpression('ij,[jk,kl,lm],mn->ni', constants=[1, 2, 3])>
    
    >>> print(expr)
    <ContractExpression('ij,[jk,kl,lm],mn->ni', constants=[1, 2, 3])>
      1.  'jm,ij->mi' [GEMM]
      2.  'mi,mn->ni' [GEMM]
    

    Now we can call it with just two arrays:

    >>> expr(rand(3, 2), rand(2, 3))
    array([[0.10982582, 0.06334146, 0.09379349],
           [0.09159049, 0.05282374, 0.07821902],
           [0.15775775, 0.09097925, 0.13471551]])
    
    >>> expr(rand(3, 2), rand(2, 3))
    array([[0.19207156, 0.21354127, 0.04811443],
           [0.01633236, 0.01815862, 0.004091  ],
           [0.06079124, 0.06758219, 0.0152304 ]])
    

    Notes

    • Currently this just performs as many contractions in the contraction_list as possible, stopping when it reaches a non-constant array. This means that some potential constant contractions might still be missed even though they don't depend on anything (e.g. 'a,a,b,b' with constants=[0, 1] might still choose to contract b,b first). One solution would be to somehow reshuffle the contraction_list but this gets a bit messy.
    • How do we feel about the repr?

    TODO

    • [x] ~~Sort out interaction with backend (need to not cache expression on first constants pass, and deal with various conversion points (e.g. keep constants on GPU)~~
    • [x] Document
    • [ ] Add edge case tests? Additional errors to help user?
    opened by jcmgray 13
  • shared_intermediates and contraction trees

    shared_intermediates and contraction trees

    The shared_intermediates example in the docs is to calculate multiple marginals, which is a use case of interest to me. Calculating marginals is equivalent to calculating environment tensors, so I'm interested in using the approach from https://arxiv.org/abs/1310.8023 which basically guarantees that calculating all marginals has the same cost as 3x the cost of calculating 1 marginal by using extensive memoization. This is ensured by making each contraction path a member of the same contraction tree family, with each tree node at worst corresponding to 3 different contractions that can be appropriately memoized. This function is called multienv() in various tensor network packages.

    Here is a rough but working prototype of a ContractionTree class that can generate the various contraction paths by calls to get_environment_path.

    Contraction tree prototype
    class ContractionTree:
        """Interface for finding contraction paths for each possible environment tensor with high cache reuse.
    
        Based on "Improving the efficiency of variational tensor network algorithms" by Evenbly and Pfeifer. Returned
        environment paths stay in same contraction tree family.
        """
    
        def __init__(self, tensors, optimize='greedy'):
            operands = chain.from_iterable([(tensor.tensor, tensor.axis_names) for tensor in tensors])
            self.partition_path, _ = contract_path(*operands, (), optimize=optimize)
            self.contraction_tree = nx.DiGraph()
            for index, tensor in enumerate(tensors):
                self.contraction_tree.add_node(tensor, index=index)
            self.tensors = tensors
            self.cache = HitDict()
    
            active_nodes = [tensor for tensor in tensors]
            for path_step in self.partition_path:
                children = []
                for index in sorted(path_step, reverse=True):
                    children.append(active_nodes.pop(index))
                active_nodes.append(frozenset(children))
                for child in children:
                    self.contraction_tree.add_edge(child, active_nodes[-1])
    
        def get_environment_path(self, marginal_tensor):
            """Returns opt_einsum path for computing the environment tensor of `marginal_tensor`"""
            environment_tree = self.contraction_tree.copy()
    
            def redirect_edges(node, child, visited):
                """recursive function that redirects all edges towards initial node passed"""
                visited = visited + [node]
                out_edges = list(environment_tree.out_edges(node))
                for start, end in out_edges:
                    if child is None or end != child:
                        environment_tree.remove_edge(start, end)
                        environment_tree.add_edge(end, start)
                    if end not in visited:
                        redirect_edges(end, node, visited)
                in_edges = list(environment_tree.in_edges(node))
                for start, end in in_edges:
                    if start not in visited:
                        redirect_edges(start, node, visited)
    
            #  redirect and remove redundant nodes
            redirect_edges(marginal_tensor, None, [marginal_tensor])
            nodes = list(environment_tree.nodes)
            for node in nodes:
                in_edges = environment_tree.in_edges(node)
                out_edges = environment_tree.out_edges(node)
                if len(out_edges) and len(in_edges) == 1:
                    origin = list(in_edges)[0][0]
                    for out_edge in out_edges:
                        environment_tree.add_edge(origin, out_edge[1])
                    environment_tree.remove_node(node)
    
            #  maintain dict mapping tensor -> index position in list of tensors
            positions = {tensor: index for index, tensor in enumerate(self.tensors)}
            positions = {tensor: position - (positions[marginal_tensor] < position)
                         for tensor, position in positions.items()}
            del positions[marginal_tensor] #  assume the query tensor is removed from input string
    
            #  rebuild path from reoriented contraction tree
            path = []
            active = len(self.tensors) - 1
            for node in nx.topological_sort(environment_tree):
                children = [start for start, end in environment_tree.in_edges(node)]
                if len(children) > 1:
                    path.append(tuple(positions[tensor] for tensor in children))
                    positions = {tensor: position - sum(positions[removed_tensor] < position for removed_tensor in children)
                                 for tensor, position in positions.items() if tensor not in children}
                    active = active - len(children) + 1
                    positions[node] = active - 1
            return path
    

    My issue is

    1. using the HitDict class proposed for counting cache hits it seems that using these paths gives me 0 cache hits?? I am using jax, but tried with numpy as well.
    2. Is shared_intermediates capable of reusing high-order intermediates as found in the contraction tree?

    I am calling it using interleaved input (I am using a wrapper of tensor with an axis_names attribute):

        def get_environment(self, tensors, environment_of, optimize='auto', **kwargs):
            environment_tensors = [tensor for tensor in tensors if tensor != environment_of]
            operands = chain.from_iterable([(tensor.tensor, tensor.axis_names) for tensor in environment_tensors])
            return oe.contract(*operands, environment_of.axis_names, optimize=optimize, **kwargs)
        
        def marginals(all_tensors, marginal_tensors):
            tree = ContractionTree(all_tensors)
            marginals = []
            for marginal_tensor in marginal_tensors:
                path = tree.get_environment_path(marginal_tensor)
                with shared_intermediates(tree.cache):
                    marginals.append(get_environment(all_tensors, marginal_tensor, optimize=path))
        return marginals, tree.cache
                
    

    I realize this might be outside the scope of opt_einsum, but it also seems like this would be a valuable extension of shared_intermediates - or it might already be supported via the current caching mechanism and there's just a bug in my implementation :)

    opened by Bonnevie 12
  • v3.0 Planning

    v3.0 Planning

    We are planning a major version bump for the following reasons:

    • Transition to be backend agnostic (NumPy is no longer the principle backend).
      • See #81 which has the (small) potential to break current implementations.

    TODO Items:

    • [ ] Refactoring in preparation for NEP18 (#46) should be considered.
    • [ ] Consider adding a Ricci calculus parser contract('A_ij B_jk A_kl', {"A": a, "B": b}).
    • [x] Run through the backends to check where we can refactor.
    • [x] Update docs and README to reflect that we are NumPy agnostic.
    • [ ] Consider parsers that use metrics besides raw FLOPS (actually trying out several contraction paths for example).

    Are there other items where we might need to do a few (small) compatibility breaks that would be good for future sustainability?

    release 
    opened by dgasmith 12
  • Contigugous loss after opt-einsum that may affect the efficiency of subsequent operations

    Contigugous loss after opt-einsum that may affect the efficiency of subsequent operations

    After a tensor contraction, the resulting tensor from numpy.einsum is C-contiguous, and in opt_einsum, such properties may be lost. If there is a further tensor addition, the non-contiguous array may be slower. Is there any solution to let the resulting tensor from opt_einsum be still C-contiguous? If not, is there any option to specify the resulting tensor to be C-contiguous, while letting the searching of the path within C-contiguous constrain?

    For example, the following code

    import numpy as np
    import opt_einsum as oe
    import time
    
    
    na = 5
    nb = 5
    nc = 8
    nd = 8
    ne = 2
    nf = 1
    ng = 8
    nh = 8
    ni = 8
    nj = 8
    
    dim_a = (na,nb,nc,nd,ne,nf)
    dim_b = (ne,nf,ng,nh,ni,nj)
    dim_c = (na,nb,nc,nd,ng,nh,ni,nj)
    n_times = 20
    
    contraction = 'abcdef,efghij -> abcdghij'
    a = np.random.random(dim_a)
    b = np.random.random(dim_b)
    c = np.einsum(contraction, a, b)
    d = oe.contract(contraction, a, b)
    
    sum_array = np.zeros(dim_c)
    print(a.flags)
    print(b.flags)
    print(c.flags)
    print(d.flags)
    
    print('list')
    list_a = []
    list_b = []
    list_c = []
    for n in range(n_times):
        a = np.random.random(dim_a)
        b = np.random.random(dim_b)
        list_a.append(a)
        list_b.append(b)
    
    
    print('np')
    
    start = time.time()
    for n in range(n_times):
        list_c.append( np.einsum(contraction, list_a[n], list_b[n]) )
    print(time.time() - start) 
    
    sum_array = np.zeros(dim_c)
    start = time.time()
    for n in range(n_times):
        np.add(sum_array, list_c[n], out = sum_array)
    print(time.time() - start) 
    
    
    print('oe')
    opt_path = oe.contract_expression(contraction, a.shape, b.shape, optimize = "optimal") 
    
    list_a = []
    list_b = []
    list_c = []
    for n in range(n_times):
        a = np.random.random(dim_a)
        b = np.random.random(dim_b)
        list_a.append(a)
        list_b.append(b)
    
    
    start = time.time()
    for n in range(n_times):
       list_c.append(opt_path(list_a[n], list_b[n]))
    print(time.time() - start) 
    
    #start = time.time()
    #for n in range(n_times):
    #    list_c[n] = np.ascontiguousarray(list_c[n])
    #print(time.time() - start) 
    
    
    
    sum_array = np.zeros(dim_c)
    start = time.time()
    for n in range(n_times):
      #  list_c[n] = np.ascontiguousarray(list_c[n])
        np.add(sum_array, list_c[n], out = sum_array)
    print(time.time() - start) 
    

    The result is

      C_CONTIGUOUS : True
      F_CONTIGUOUS : False
      OWNDATA : True
      WRITEABLE : True
      ALIGNED : True
      WRITEBACKIFCOPY : False
    
      C_CONTIGUOUS : True
      F_CONTIGUOUS : False
      OWNDATA : True
      WRITEABLE : True
      ALIGNED : True
      WRITEBACKIFCOPY : False
    
      C_CONTIGUOUS : True
      F_CONTIGUOUS : False
      OWNDATA : True
      WRITEABLE : True
      ALIGNED : True
      WRITEBACKIFCOPY : False
    
      C_CONTIGUOUS : False
      F_CONTIGUOUS : False
      OWNDATA : False
      WRITEABLE : True
      ALIGNED : True
      WRITEBACKIFCOPY : False
    
    list
    np
    1.9423425197601318
    0.32601356506347656
    oe
    1.6317214965820312
    1.2006006240844727
    

    which the np.add part is much slower in opt_einsum than einsum. If I use np.ascontiguousarray to let the resulting array be C-contiguous, this step itself is slow.

    opened by Geositta2000 7
  • [WIP] add a preprocessor

    [WIP] add a preprocessor

    Description

    This adds a potential preprocessor that takes an equation and converts it into a minimal form with the following simplifications:

    • All indices which only appear on a single input (and not the output) are summed over.
    • All indices which appear multiple times on the same term are traced.
    • All scalars are multiplied into the smallest other term
    • All terms with the same indices are multiplied (hamadard product / elementwise) into a single term ('de-duplication').

    along with a function, that transforms input arrays into the inputs for the new simplified eq. It is a function so that it could be part of a contract 'expression'.

    This would address (#114, #99, #112, #167, #189, ...).

    Example from the docstring:

    eq = ',ab,abee,,cd,cd,dd->ac'
    arrays = helpers.build_views(eq)
    new_eq, simplifier = make_simplifier(eq)
    new_eq
    # 'ab,cd,d->ac'
    sarrays = simplifier(arrays)
    oe.contract(new_eq, *sarrays)
    # array([[0.65245661, 0.14684493, 0.42543411, 0.32350895],
    #        [0.38357005, 0.08632807, 0.25010672, 0.19018636]])
    

    Todos

    Notable points that this PR has either accomplished or will accomplish.

    • [ ] when and how should this be called?
    • [ ] return a partial path rather than a 'simplifier'?
    • [ ] should shape information (size_dict) be used and propagated for use with a contraction expression (and so that the smallest tensor to multiply scalars into can be most accurately chosen)?
    • [ ] other simplifications are possible (e.g. hadamard deduplication up to transpose, performing all non-rank increasing contractions)...
    • [x] ~currently the parsing is a single sweep, should it iteratively sweep til no more changes? E.g. currently examples like ab,ab-> transform to ab-> but if processed again could transform to ->~

    Status

    • [ ] Ready to go
    opened by jcmgray 1
  • Disable torch.backends.opt_einsum to avoid duplicate work

    Disable torch.backends.opt_einsum to avoid duplicate work

    Description

    Recently, torch.einsum has improved to automatically optimize for multi-contractions if the opt_einsum library is installed. This way, torch users can reap benefits easily. However, this change may inadvertently cause regressions for those who use opt_einsum.contract with torch tensors. While these use cases can be improved by just changing opt_einsum.contract calls to torch.einsum, it is not right to cause silent regressions.

    Consider the following example:

    import torch
    import opt_einsum
    
    A = torch.rand(4, 5)
    B = torch.rand(5, 6)
    C = torch.rand(6, 7) 
    D = opt_einsum.contract('ij,jk,kl->il', A, B, C)
    

    Looking through the code, opt_einsum.contract will figure out an ideal path + return that path in the form of contraction tuples (commonly pairs). Then, for each of the contraction tuples, opt_einsum may call BACK into the torch backend and call torch.einsum.

    Since the contractions are commonly pairs, calling back into torch.einsum will do what it used to--just directly do the contraction.

    HOWEVER. There are cases where the contractions are not necessarily pairs (@dgasmith had mentioned that hadamard products are faster when chained vs staggered in pairs) and in this case, torch.einsum will do unnecessary work in recomputing an ideal path, which is a regression from the previous state.

    This PR simply asks "hey, does the opt_einsum backend exist in torch?" If so, this means torch.einsum will do the unnecessary work in recomputing an ideal path, so let's just turn it off.

    Todos

    Notable points that this PR has either accomplished or will accomplish.

    • [x] Add the code to disable torch's opt_einsum optimization
    • [ ] Test cases?

    Questions

    • [ ] Unsure how to best test this/what use case would ensure that, without this code, the path would be computed more than once, and after this code, the path would only be computed at the top level. We could just test that torch.backends.opt_einsum.enabled is globally False, I guess.

    Status

    • [ ] Ready to go
    opened by janeyx99 8
  • Remove NumPy as a hard dependency

    Remove NumPy as a hard dependency

    Description

    This PR removes NumPy from opt_einsum as a runtime dependency. The change has little effect on the opt_einsum code base but significantly effects the testing infrastructure. Several testing strategies are currently employed to help understand the most effective one. Currently, these are:

    • Move all NumPy functions to a testing.py module with pytest.skip
    • Various pytest.skip functions wrapped in decorators
    • Local NumPy imports.

    Feedback welcome!

    Todos

    Notable points that this PR has either accomplished or will accomplish.

    • [x] Remove unused custom tensordot code
    • [ ] Remove import numpy from possibly_convert_to_numpy
    • [ ] Fix ssa_to_linear with a proper implementation

    Questions

    • [ ] Question1

    Status

    • [ ] Ready to go
    opened by dgasmith 3
  • Thoughts on making the numpy dependency optional?

    Thoughts on making the numpy dependency optional?

    Proposal

    Make the numpy dependency optional, if possible.

    Why?

    Minimizing dependencies is a general goal as it allows a bigger audience to reap the benefits of this library. More specifically, some of us are interested in making opt_einsum a hard dependency for torch, but we would like to keep numpy unrequired. (If you're curious why torch does not have a hard dependency on numpy, see https://github.com/pytorch/pytorch/issues/60081, tl;dr being the last comment.)

    A hard dependency would mean all torch users would get the benefits of opt_einsum right away without thinking too hard/needing to manually install opt_einsum themselves.

    Alternatives

    We could also have torch vendor in opt_einsum, but that is increased complexity/maintenance + we would like to automatically subscribe to improvements in opt_einsum!

    opened by janeyx99 4
  • When I use opt_einsum optimizes torch.einum, the running time after optimization increases.

    When I use opt_einsum optimizes torch.einum, the running time after optimization increases.

    import numpy as np
    import time
    import torch
    from opt_einsum import contract
    
    dim = 4
    
    x = torch.randn(6 ,4 ,4, 4)
    w1 = torch.randn(1,4,dim)
    w2 = torch.randn(dim,4,dim)
    w3 = torch.randn(dim,4,dim)
    w4 = torch.randn(dim,8,dim)
    w5 = torch.randn(dim,8,dim)
    w6 = torch.randn(dim,4,1)
    
    def naive(x, w1, w2, w3, w4, w5, w6):
        return torch.einsum('bkxy,ikj,jxm,myf,fpl,lqz,zri -> bpqr', x, w1, w2, w3, w4, w5, w6)
    
    def optimized(x, w1, w2, w3, w4, w5, w6):
        return contract('bkxy,ikj,jxm,myf,fpl,lqz,zri -> bpqr', x, w1, w2, w3, w4, w5, w6)
    
    

    The respective running time:

    naive
    0.0005145072937011719
    
    optimized
    0.876018762588501
    

    I want to know what caused this.Thanks!

    opened by edwin-zft 5
Releases(v3.3.0)
  • v3.3.0(Jul 19, 2020)

    Adds a object backend for optimized contractions on arbitrary Python objects.

    New Features

    • (#145) Adds a object based backend so that contract(backend='object') can be used on arbitrary objects such as SymPy symbols.

    Enhancements

    • (#140) Better error messages when the requested contract backend cannot be found.
    • (#141) Adds a check with RandomOptimizers to ensure the objects are not accidentally reused for different contractions.
    • (#149) Limits the remaining category for the contract_path output to only show up to 20 tensors to prevent issues with the quadratically scaling memory requirements and the number of print lines for large contractions.
    Source code(tar.gz)
    Source code(zip)
  • v3.2.1(Apr 15, 2020)

    Bug Fixes

    • (#131) Fixes an einstein subscript error message to point to the correct value.
    • (#135) Lazily loads JAX similar to other backends.
    Source code(tar.gz)
    Source code(zip)
  • v3.2.0(Mar 3, 2020)

    Small fixes for the dp path and support for a new mars backend.

    New Features

    • (#109) Adds mars backend support.

    Enhancements

    • (#110) New auto-hq and 'random-greedy-128' paths.
    • (#119) Fixes several edge cases in the dp path.

    Bug fixes

    • (#127) Fixes an issue where Python 3.6 features are required while Python 3.5 is opt_einsum's stated minimum version.
    Source code(tar.gz)
    Source code(zip)
  • v3.1.0(Sep 30, 2019)

  • v3.0.1(Aug 22, 2019)

    Alters setup.py to correctly state that opt_einsum requires Python 3.5+. This will now correctly effect PyPI and other pip mirror downloads, v3.0.0 will be removed from PyPI to prevent further issues.

    Source code(tar.gz)
    Source code(zip)
  • v3.0.0(Aug 12, 2019)

    This release moves opt_einsum to be backend agnostic while adding support additional backends such as Jax and Autograd. Support for Python 2.7 has been dropped and Python 3.5 will become the new minimum version, a Python deprecation policy equivalent to NumPy's has been adopted.

    New Features

    • (#78) A new random-optimizer has been implemented which uses Boltzmann weighting to explore alternative near-minimum paths using greedy-like schemes. This provides a fairly large path performance enhancements with a linear path time overhead.
    • (#78) A new PathOptimizer class has been implemented to provide a framework for building new optimizers. An example is that now custom cost functions can now be provided in the greedy formalism for building custom optimizers without a large amount of additional code.
    • (#81) The backend="auto" keyword has been implemented for contract allowing automatic detection of the correct backend to use based off provided tensors in the contraction.
    • (#88) Autograd and Jax support have been implemented.
    • (#96) Deprecates Python 2 functionality and devops improvements.

    Enhancements

    • (#84) The contract_path function can now accept shape tuples rather than full tensors.
    • (#84) The contract_path automated path algorithm decision technology has been refactored to a standalone function.
    Source code(tar.gz)
    Source code(zip)
  • v2.3.2(Dec 9, 2018)

  • v2.3.1(Dec 1, 2018)

  • v2.3.0(Dec 1, 2018)

    This release primarily focuses on expanding the suite of available path technologies to provide better optimization characistics for 4-20 tensors while decreasing the time to find paths for 50-200+ tensors. See Path Overview for more information.

    New Features:

    • (#60) A new greedy implementation has been added which is up to two orders of magnitude faster for 200 tensors.
    • (#73) Adds a new branch path that uses greedy ideas to prune the optimal exploration space to provide a better path than greedy at sub optimal cost.
    • (#73) Adds a new auto keyword to the opt_einsum.contract path option. This keyword automatically chooses the best path technology that takes under 1ms to execute.

    Enhancements:

    • (#61) The opt_einsum.contract path keyword has been changed to optimize to more closely match NumPy. path will be deprecated in the future.
    • (#61) The opt_einsum.contract_path now returns a opt_einsum.contract.PathInfo object that can be queried for the scaling, flops, and intermediates of the path. The print representation of this object is identical to before.
    • (#61) The default memory_limit is now unlimited by default based on community feedback.
    • (#66) The Torch backend will now use tensordot when using a version of Torch which includes this functionality.
    • (#68) Indices can now be any hashable object when provided in the "Interleaved Input" syntax.
    • (#74) Allows the default transpose operation to be overridden to take advantage of more advanced tensor transpose libraries.
    • (#73) The optimal path is now significantly faster.

    Bug fixes:

    • (#72) Fixes the "Interleaved Input" syntax and adds documentation.
    Source code(tar.gz)
    Source code(zip)
  • v2.2.0(Aug 29, 2018)

    New features:

    • (#48) Intermediates can now be shared between contractions, see here for more details.
    • (#53) Intermediate caching is thread safe.

    Enhancements:

    • (#48) Expressions are now mapped to non-unicode index set so that unicode input is support for all backends.
    • (#58) Adds tensorflow and theano with shared intermediates.

    Bug fixes:

    • (#41) PyTorch indices are mapped back to a small a-z subset valid for PyTorch's einsum implementation.
    Source code(tar.gz)
    Source code(zip)
  • v2.1.3(Aug 23, 2018)

  • v2.1.2(Aug 15, 2018)

  • v2.1.1(Aug 14, 2018)

  • v2.1.0(Aug 14, 2018)

    opt_einsum continues to improve its support for additional backends beyond NumPy with PyTorch.

    We have also published the opt_einsum package in the Journal of Open Source Software. If you use this package in your work, please consider citing us!

    New features:

    • PyTorch backend support
    • Tensorflow eager-mode execution backend support

    Enhancements:

    • Intermediate tensordot-like expressions are now ordered to avoid transposes.
    • CI now uses conda backend to better support GPU and tensor libraries.
    • Now accepts arbitrary unicode indices rather than a subset.
    • New auto path option which switches between optimal and greedy at four tensors.

    Bug fixes:

    • Fixed issue where broadcast indices were incorrectly locked out of tensordot-like evaluations even after their dimension was broadcast.
    Source code(tar.gz)
    Source code(zip)
  • v2.0.1(Jun 28, 2018)

    opt_einsum is a powerful tensor contraction order optimizer for NumPy and related ecosystems.

    New Features

    • Allows unlimited Unicode indices.
    • Adds a Journal of Open-Source Software paper.
    • Minor documentation improvements.
    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(May 17, 2018)

    opt_einsum is a powerful tensor contraction order optimizer for NumPy and related ecosystems.

    New Features

    • Expressions can be precompiled so that the expression optimization need not happen multiple times.
    • The greedy order optimization algorithm has been tuned to be able to handle hundreds of tensors in several seconds.
    • Input indices can now be unicode so that expressions can have many thousands of indices.
    • GPU and distributed computing backends have been added such as Dask, TensorFlow, CUPy, Theano, and Sparse.

    Bug Fixes

    • A error effecting cases where opt_einsum mistook broadcasting operations for matrix multiply has been fixed.
    • Most error messages are now more expressive.
    Source code(tar.gz)
    Source code(zip)
  • 1.0(Oct 15, 2016)

    Official 1.0 release.

    Einsum is a very powerful function for contracting tensors of arbitrary dimension and index. However, it is only optimized to contract two terms at a time resulting in non-optimal scaling for contractions with many terms. Opt_einsum aims to fix this by optimizing the contraction order which can lead to arbitrarily large speed ups at the cost of additional intermediate tensors.

    Opt_einsum is also implemented into the np.einsum function as of NumPy v1.12.

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Jul 30, 2016)

    A large step towards to a full 1.0 release. BLAS usage is now automatically applied to all operations. Future releases will be more careful with regard to views and needless data copying.

    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Mar 23, 2016)

Owner
Daniel Smith
Twitter: @dga_smith
Daniel Smith
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 23.6k Dec 31, 2022
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 20.6k Feb 13, 2021
Minimal PyTorch implementation of Generative Latent Optimization from the paper "Optimizing the Latent Space of Generative Networks"

Minimal PyTorch implementation of Generative Latent Optimization This is a reimplementation of the paper Piotr Bojanowski, Armand Joulin, David Lopez-

Thomas Neumann 117 Nov 27, 2022
Very simple NCHW and NHWC conversion tool for ONNX. Change to the specified input order for each and every input OP. Also, change the channel order of RGB and BGR. Simple Channel Converter for ONNX.

scc4onnx Very simple NCHW and NHWC conversion tool for ONNX. Change to the specified input order for each and every input OP. Also, change the channel

Katsuya Hyodo 16 Dec 22, 2022
Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy.

Deploy tensorflow graphs for fast evaluation and export to tensorflow-less environments running numpy. Now with tensorflow 1.0 support. Evaluation usa

Marcel R. 349 Aug 6, 2022
Composable transformations of Python+NumPy programsComposable transformations of Python+NumPy programs

Chex Chex is a library of utilities for helping to write reliable JAX code. This includes utils to help: Instrument your code (e.g. assertions) Debug

DeepMind 506 Jan 8, 2023
MLP-Numpy - A simple modular implementation of Multi Layer Perceptron in pure Numpy.

MLP-Numpy A simple modular implementation of Multi Layer Perceptron in pure Numpy. I used the Iris dataset from scikit-learn library for the experimen

Soroush Omranpour 1 Jan 1, 2022
A numpy-based implementation of RANSAC for fundamental matrix and homography estimation. The degeneracy updating and local optimization components are included and optional.

Description A numpy-based implementation of RANSAC for fundamental matrix and homography estimation. The degeneracy updating and local optimization co

AoxiangFan 9 Nov 10, 2022
Implements Stacked-RNN in numpy and torch with manual forward and backward functions

Recurrent Neural Networks Implements simple recurrent network and a stacked recurrent network in numpy and torch respectively. Both flavours implement

Vishal R 1 Nov 16, 2021
This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Black-Box-Defense This repository contains the code and models necessary to replicate the results of our recent paper: How to Robustify Black-Box ML M

OPTML Group 2 Oct 5, 2022
This repository contains the code and models necessary to replicate the results of paper: How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective

Black-Box-Defense This repository contains the code and models necessary to replicate the results of our recent paper: How to Robustify Black-Box ML M

OPTML Group 2 Oct 5, 2022
Second Order Optimization and Curvature Estimation with K-FAC in JAX.

KFAC-JAX - Second Order Optimization with Approximate Curvature in JAX Installation | Quickstart | Documentation | Examples | Citing KFAC-JAX KFAC-JAX

DeepMind 90 Dec 22, 2022
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

JAX: Autograd and XLA Quickstart | Transformations | Install guide | Neural net libraries | Change logs | Reference docs | Code search News: JAX tops

Google 21.3k Jan 1, 2023
Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more

JAX: Autograd and XLA Quickstart | Transformations | Install guide | Neural net libraries | Change logs | Reference docs | Code search News: JAX tops

Google 11.4k Feb 13, 2021
TensorFlow, PyTorch and Numpy layers for generating Orthogonal Polynomials

OrthNet TensorFlow, PyTorch and Numpy layers for generating multi-dimensional Orthogonal Polynomials 1. Installation 2. Usage 3. Polynomials 4. Base C

Chuan 29 May 25, 2022
Simple helper library to convert a collection of numpy data to tfrecord, and build a tensorflow dataset from the tfrecord.

numpy2tfrecord Simple helper library to convert a collection of numpy data to tfrecord, and build a tensorflow dataset from the tfrecord. Installation

Ryo Yonetani 2 Jan 16, 2022
Image morphing without reference points by applying warp maps and optimizing over them.

Differentiable Morphing Image morphing without reference points by applying warp maps and optimizing over them. Differentiable Morphing is machine lea

Alex K 380 Dec 19, 2022
Optimizing DR with hard negatives and achieving SOTA first-stage retrieval performance on TREC DL Track (SIGIR 2021 Full Paper).

Optimizing Dense Retrieval Model Training with Hard Negatives Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, Shaoping Ma This repo provi

Jingtao Zhan 99 Dec 27, 2022