PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.

Overview

PyNNDescent Logo

Azure Pipelines Build Status

LGTM Alerts LGTM Grade Documentation Status

PyNNDescent

PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors. It provides a python implementation of Nearest Neighbor Descent for k-neighbor-graph construction and approximate nearest neighbor search, as per the paper:

Dong, Wei, Charikar Moses, and Kai Li. "Efficient k-nearest neighbor graph construction for generic similarity measures." Proceedings of the 20th international conference on World wide web. ACM, 2011.

This library supplements that approach with the use of random projection trees for initialisation. This can be particularly useful for the metrics that are amenable to such approaches (euclidean, minkowski, angular, cosine, etc.). Graph diversification is also performed, pruning the longest edges of any triangles in the graph.

Currently this library targets relatively high accuracy (80%-100% accuracy rate) approximate nearest neighbor searches.

Why use PyNNDescent?

PyNNDescent provides fast approximate nearest neighbor queries. The ann-benchmarks system puts it solidly in the mix of top performing ANN libraries:

SIFT-128 Euclidean

ANN benchmark performance for SIFT 128 dataset

NYTimes-256 Angular

ANN benchmark performance for NYTimes 256 dataset

While PyNNDescent is among fastest ANN library, it is also both easy to install (pip and conda installable) with no platform or compilation issues, and is very flexible, supporting a wide variety of distance metrics by default:

Minkowski style metrics

  • euclidean
  • manhattan
  • chebyshev
  • minkowski

Miscellaneous spatial metrics

  • canberra
  • braycurtis
  • haversine

Normalized spatial metrics

  • mahalanobis
  • wminkowski
  • seuclidean

Angular and correlation metrics

  • cosine
  • dot
  • correlation
  • spearmanr
  • tsss
  • true_angular

Probability metrics

  • hellinger
  • wasserstein

Metrics for binary data

  • hamming
  • jaccard
  • dice
  • russelrao
  • kulsinski
  • rogerstanimoto
  • sokalmichener
  • sokalsneath
  • yule

and also custom user defined distance metrics while still retaining performance.

PyNNDescent also integrates well with Scikit-learn, including providing support for the KNeighborTransformer as a drop in replacement for algorithms that make use of nearest neighbor computations.

How to use PyNNDescent

PyNNDescent aims to have a very simple interface. It is similar to (but more limited than) KDTrees and BallTrees in sklearn. In practice there are only two operations -- index construction, and querying an index for nearest neighbors.

To build a new search index on some training data data you can do something like

from pynndescent import NNDescent
index = NNDescent(data)

You can then use the index for searching (and can pickle it to disk if you wish). To search a pynndescent index for the 15 nearest neighbors of a test data set query_data you can do something like

index.query(query_data, k=15)

and that is pretty much all there is to it. You can find more details in the documentation.

Installing

PyNNDescent is designed to be easy to install being a pure python module with relatively light requirements:

  • numpy
  • scipy
  • scikit-learn >= 0.22
  • numba >= 0.51

all of which should be pip or conda installable. The easiest way to install should be via conda:

conda install -c conda-forge pynndescent

or via pip:

pip install pynndescent

To manually install this package:

wget https://github.com/lmcinnes/pynndescent/archive/master.zip
unzip master.zip
rm master.zip
cd pynndescent-master
python setup.py install

Help and Support

This project is still young. The documentation is still growing. In the meantime please open an issue and I will try to provide any help and guidance that I can. Please also check the docstrings on the code, which provide some descriptions of the parameters.

License

The pynndescent package is 2-clause BSD licensed. Enjoy.

Contributing

Contributions are more than welcome! There are lots of opportunities for potential projects, so please get in touch if you would like to help out. Everything from code to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

Comments
  • fixes for Numba 0.53.0dev0

    fixes for Numba 0.53.0dev0

    It seems like previous Numba versions (up to 0.51.2) may have worked 'by accident'. When running pynndescent against a recent development version of Numba. I receievd many errors during test collection of the form:

    E   numba.core.errors.LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)
    E   Storing i64 to ptr of i32 ('dim'). FE type uint32
    E
    E   File "../miniconda3/envs/umap/lib/python3.7/site-packages/pynndescent/utils.py", line 88:
    E   def norm(vec):
    E       <source elided>
    E       result = 0.0
    E       dim = vec.shape[0]
    E       ^
    E
    E   During: lowering "dim = static_getitem(value=$8load_attr.2, index=0, index_var=$const10.3, fn=<built-in function getitem>)" at /Users/vhaenel/git/numba-integration-testing/miniconda3/envs/umap/lib/python3.7/site-packages/pynndescent/utils.py (88)
    

    Not all errors are shown immediately as pytest fails on the first encounter, so they have to be fixed one-by-one.

    opened by esc 13
  • backport UMAP 0.4dev changes

    backport UMAP 0.4dev changes

    Relates to https://github.com/lmcinnes/pynndescent/issues/37

    This isn't intended to be merged in its current state, this is just to conveniently show the current state of the diffs. The things to look at are:

    • numba decorations. Sometimes UMAP has more, sometimes it's vice versa. I don't know which ones are necessary or helpful.
    • In new_build_candidates, there is a test failure when using numba.prange, so I have commented it out and used a plain old range.
    • The biggest change is to the nearest neighbor descent routine, which introduces a set to keep track of previously seen pairs, and unchecked heap pushes. There are two test failures due to a much lower expected accuracy, unless I comment out the use of the unchecked heap push in the first loop in the nn descent iteration. Not tracked down the problem there yet.
    • In general for nn descent, is it safe to use a set to cache the lookups? It seems like it could grow to be O(N^2) in the worst case? I think something like this is mentioned in the nn descent paper.
    • In UMAP 0.4, the search initialization is handled slightly differently, in that each function is separate and dist and dist_args have to be provided to init_from_tree and its ilk directly, whereas in pynndescent, dist and dist_args are passed to a factory function and the initialization functions access them via closure. Which is the preferred way to do things?
    opened by jlmelville 13
  • -1 entries in neighbor_graph

    -1 entries in neighbor_graph

    I get some -1 entries in the kNN graph when I build it on a sparse matrix with cosine distance. Is this intentional? Does -1 mean X.shape[0]-1? When I run query(X), I don't get any negative elements. E.g.:

    X = scipy.sparse.csr_matrix(np.random.randn(10000,1247)>3)
    nn = pynndescent.NNDescent(X, metric='cosine', n_neighbors=15)
    

    Now nn.neighbor_graph[0] has some values equal to -1 but nn.query(X, k=15)[0] does not.

    Update: Forgot to say that I do get a warning "UserWarning: Failed to correctly find n_neighbors for some samples.Results may be less than ideal. Try re-running withdifferent parameters." from NNDescent. Maybe that's what -1 indicate? But then query() does not return any negative elements and does not complain. How should one approach this?

    opened by dkobak 11
  • Parallelize deheap_sort

    Parallelize deheap_sort

    This is a very simple change: I added parallel=True to utils.deheap_sort and I used a numba.prange in the top loop.

    Sometimes, when I'm watching CPU usage for NNDescent on very large arrays, I've seen it spend a fair amount of time in a single thread near the end of the operation. Looking at the code, I suspect it's just dealing with this final deheap_sort call.

    Because the function operates on each row independently, it's pretty simple to wrap it in a parallel loop and let numba figure it out. It seems to work but I haven't tested rigorously yet. Hoping travis can do that for me.

    ~As an additional tweak, I added python 3.9 to the test matrix here, to see what happens.~ edit: obsolete given recent updates

    opened by jamestwebber 10
  • Measuring minkowski with p= 0.3

    Measuring minkowski with p= 0.3

    Hi,

    Thanks for such great package!

    I am using version 0.5.2 with Python 3.6.5. I compared the results of using euclidean distance vs. minkowski distance with p= 0.3 in the code below: import numpy as np import pynndescent

    data= np.random.randn(500,9)

    euclidean= pynndescent.NNDescent(data, metric= 'euclidean', n_neighbors= 15, random_state= 1969, n_jobs= -1) euclidean.prepare() euc_neig, euc_dist= euclidean.query(data, k= 15)

    minkowski= pynndescent.NNDescent(data, metric= 'minkowski', metric_kwds= {'p': 0.3}, n_neighbors= 15, random_state= 1969, n_jobs= -1) minkowski.prepare() min_neig, min_dist= minkowski.query(data, k= 15)

    Then, I looked at the outputs for the item in the query: euc_neig[0,:] array([ 0, 150, 298, 308, 48, 498, 93, 277, 188, 184, 219, 146, 420, 309, 160]) euc_dist[0,:] array([0. , 2.3493888, 2.3878694, 2.5395243, 2.5565991, 2.5695498, 2.5878463, 2.6194322, 2.624215 , 2.6499615, 2.662645 , 2.6754963, 2.7257159, 2.7690816, 2.808471 ], dtype=float32)

    min_neig[0,:] array([ 0, 107, 14, 287, 260, 348, 57, 464, 22, 319, 298, 3, 186, 15, 424]) min_dist[0,:] array([ 10.079369, 219.20055 , 223.33961 , 227.77594 , 230.01102 , 248.14621 , 260.6848 , 288.2422 , 292.64026 , 304.21317 , 306.87302 , 311.07306 , 322.3568 , 326.94174 , 331.8149 ], dtype=float32)

    As expected, the self-measured distance (first item in the query) is 0 for euclidean. However, for minkowski, the self-measured distance is not 0.

    My question is, why the minkowski distance does not return 0 for the first item?

    Thanks, Ivan

    opened by ivan-marroquin 10
  • Multi-threaded implementation

    Multi-threaded implementation

    Currently, pynndescent doesn't take full advantage of multiple cores on a machine. There are some methods annotated with the Numba parallel option, but these don't provide any noticeable speed up when more cores are available.

    This PR allows nearest neighbor computation to take advantage of multiple cores by processing heaps in parallel chunks. The basic idea is an application of the MapReduce approach described in section 2.8 of Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures by Dong et al.

    Note that in this PR all computation is carried out on a single machine using the standard Python concurrent.futures library, which complements Numba very well, as shown in this example.

    Going into slightly more detail, a distributed heap can be updated in a MapReduce-style operation as follows.

    We have a large array of all the heap updates held in memory and a thread pool to update chunks in parallel in two phases, called "Map" and "Reduce". Each Map partition holds an array of heap updates, corresponding to a row in the following diagram.

    heap_updates

    The Map phase adds updates in the order they were generated (from heap_push calls). After all the updates have been added for a partition, the updates for that partition are sorted by row number. Then the chunk boundaries corresponding to partitions are found, so that in the Reduce phase the array can process all the updates for a given partition together, and apply them to the target heap (not illustrated).

    For the Map phase the array is processed row-wise, and in the Reduce phase it is processed column-wise (although the column bounds are different for each row owing to the different chunk boundaries). The result is an updated distributed heap.

    This PR also adds tests that check the threaded implementation produces the same result as the regular implementation. Since NN-Descent uses a source of randomness, when running the tests the random seed is set to the value of each row index to ensure the same deterministic result is obtained. This per-row seeding is only used for tests. However, to make threaded runs more reproducible when run normally (outside of tests) each thread is seeded from the initial random seed, which means that the result from two threaded runs (with the same number of threads) is the same if the same initial seed is used in each case.

    In terms of performance, the multi-threaded implementation outperforms the regular implementation with 4 cores or more. In one experiment, regular pynndescent took 27 minutes for a million rows (D=128, NN=25), compared to just over 3 minutes with 64 threads.

    There is more work to consider how to wire up the implementation to the standard NNDescent object, but I wanted to get feedback on this approach before going any further.

    opened by tomwhite 9
  • Added spearmanr

    Added spearmanr

    Addresses issue https://github.com/lmcinnes/umap/issues/319 and https://github.com/lmcinnes/pynndescent/issues/85

    I know tests are missing, but I am not 100% sure how to add them the right way. Guidance welcome.

    opened by sleighsoft 8
  • pynndescent throws error when n_jobs=-1

    pynndescent throws error when n_jobs=-1

    Hi, thanks for your work on this project! Unfortunately I seem to have run into a bug through another project that uses pynndescent as a dependency. The owner suggested that I open an issue here. See here for details: https://github.com/pavlin-policar/openTSNE/issues/92#issue-487746322

    opened by jgraving 8
  • NNDescent.update(X) not working as expected

    NNDescent.update(X) not working as expected

    I wanted to do the following:

    1. Create a NNDescent object from the first batch of data.
    2. Query some data.
    3. Wait for some time.
    4. Update the object with new data, but not start from scratch.
    5. Query some data.

    I assumed that NNDescent.update(fresh data) would do the trick. However, maybe I am using it in a wrong way, since it seems that the returned neighbours are only those from the first batch of the data. While inspecting index = NNDescent(...) after the update in debug mode I discovered that

    • index._raw_data and index.neighbor_graph (and index._neighbor_graph) seem to be updated properly (the indices of the neighbours of all 205 instances follow the basic intuition - see below),
    • index._vertex_order was not updated properly (it still has the length of 100 and not 205).

    The data (xs and ys) were generated, so that they look like this:

    xs:
    0 r r r r
    1 r r r r
    2 r r r r
    ...
    99 r r r r
    
    ys:
    100 r r r r
    101 r r r r
    ...
    204 r r r r
    

    where r is an always different random value (from (0, 1)). Thus, the nearest neighbours of the instance i (0 <= i <= 204) should typically be i, i - 1, i + 1, i - 2, i + 2, ... (possibly not in this exact order). However, the code below says that the neighbours of the instances i >= 100 (those from ys) are 99, 98, 97 ... which suggests that the neighbours were found only among xs.

    The actual output of the code below (indices/distances):

    [99, 98, 97, 96, 95, 94, 93, 92, 91, 90]
    [98, 99, 97, 96, 95, 94, 93, 92, 91, 89]
    [99, 98, 97, 96, 95, 94, 93, 92, 91, 90]
    ... (some lines omitted)
    [99, 98, 97, 96, 95, 94, 93, 92, 91, 90]
    [99, 98, 97, 96, 95, 94, 93, 92, 91, 90]
     
    [2.1192119121551514, 2.8882203102111816, 4.930568218231201, 5.635573387145996, 6.850753307342529, 7.7378249168396, 8.205963134765625, 9.593037605285645, 10.798347473144531, 11.230871200561523]
    [4.201259613037109, 4.454722881317139, 5.768087387084961, 6.7560529708862305, 7.452332973480225, 8.632887840270996, 9.428315162658691, 9.682640075683594, 11.393900871276855, 13.470145225524902]
    [4.810489654541016, 5.09433650970459, 6.398769378662109, 7.656699180603027, 8.318954467773438, 9.736209869384766, 9.871195793151855, 11.387312889099121, 12.959441184997559, 13.922148704528809]
    ... (some lines omitted)
    [105.62753295898438, 106.07768249511719, 107.53001403808594, 108.47374725341797, 109.60338592529297, 110.31566619873047, 110.80841064453125, 112.75614166259766, 114.3282699584961, 114.73919677734375]
    [106.41008758544922, 106.78266906738281, 107.53002166748047, 108.84111022949219, 109.72077178955078, 111.25269317626953, 111.68487548828125, 113.38243103027344, 114.24507141113281, 115.30052947998047]
    

    The code:

    import numpy as np
    import pynndescent
    import pickle
    import os
    import numba
    
    @numba.njit()
    def my_manhattan(x, y):
        return np.sum(np.abs(x - y))
    
    def test3():
        n_x = 100
        xs = np.random.rand(n_x, 5)
        xs[:, 0] = np.arange(0, n_x, 1)
        n_y = 105
        ys = np.random.rand(n_y, 5)
        ys[:, 0] = np.arange(n_x, n_x + n_y, 1)
        index = pynndescent.NNDescent(xs, metric=my_manhattan, n_neighbors=5)
        # index.prepare()
        k = 10
        ns_x, ds_x = index.query(xs, k=k)
        index.update(ys)
        # index._init_search_graph()
        ns_y, ds_y = index.query(ys, k=k)
        for n_y in ns_y:
            print(n_y.tolist())
        print(" ")
        for d_y in ds_y:
            print(d_y.tolist())
    
    
    test3()
    

    I installed pynndescent via pip.

    bug 
    opened by Petkomat 7
  • Cannot query an index constructed from sparse matrix

    Cannot query an index constructed from sparse matrix

    I have a sparse matrix X on which I can successfully run UMAP:

    <100000x9630 sparse matrix of type '<class 'numpy.float64'>'
    	with 266398 stored elements in List of Lists format>
    

    I can construct an index like this, and it works fine:

    nn = NNDescent(X, metric='cosine')
    

    However, querying this index does not seem to work. nn.query(X[:5,:], k=15) returns the following error:

    ~/anaconda3/lib/python3.7/site-packages/pynndescent/pynndescent_.py in query(self, query_data, k, epsilon)
       1263 
       1264         if self._distance_correction is not None:
    -> 1265             dists = self._distance_correction(dists)
       1266 
       1267         return indices, dists
    
    ~/anaconda3/lib/python3.7/site-packages/numba/npyufunc/dufunc.py in _compile_for_args(self, *args, **kws)
        164                     argty = argty.dtype
        165                 argtys.append(argty)
    --> 166         return self._compile_for_argtys(tuple(argtys))
        167 
        168     def _compile_for_argtys(self, argtys, return_type=None):
    
    ~/anaconda3/lib/python3.7/site-packages/numba/npyufunc/dufunc.py in _compile_for_argtys(self, argtys, return_type)
        184             self._dispatcher, self.targetoptions, sig)
        185         actual_sig = ufuncbuilder._finalize_ufunc_signature(
    --> 186             cres, argtys, return_type)
        187         dtypenums, ptr, env = ufuncbuilder._build_element_wise_ufunc_wrapper(
        188             cres, actual_sig)
    
    ~/anaconda3/lib/python3.7/site-packages/numba/npyufunc/ufuncbuilder.py in _finalize_ufunc_signature(cres, args, return_type)
        140         if cres.objectmode:
        141             # Object mode is used and return type is not specified
    --> 142             raise TypeError("return type must be specified for object mode")
        143         else:
        144             return_type = cres.signature.return_type
    
    TypeError: return type must be specified for object mode
    

    Versions:

    UMAP 0.3.10
    pynndescent 0.4.5
    numba 0.46.0
    
    opened by dkobak 7
  • Error when trying to construct an index with NNDescent

    Error when trying to construct an index with NNDescent

    Hello,

    When I try to construct a NNDescent (pynndescent v0.4.3) index using correlation or euclidean distance I get a numba error (Numba version 0.47.0).

    Here is the code I used that triggers the error and X is a NumPy float64 array with shape (1624, 2000).:

    from pynndescent import NNDescent
    
    def _KNN(X, random_state, n_neighbors = 30, metric='correlation'):
        index = NNDescent(X, n_neighbors=n_neighbors, metric=metric, random_state=random_state)
        knn_indices, knn_dists = index.query(X, n_neighbors=n_neighbors)
    

    Here is the error I get:

    ~/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py in __init__(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_degree_multiplier, diversify_epsilon, n_search_trees, tree_init, random_state, algorithm, low_memory, max_candidates, n_iters, delta, n_jobs, compressed, seed_per_row, verbose)
        977                     leaf_array=leaf_array,
        978                     verbose=verbose,
    --> 979                     seed_per_row=seed_per_row,
        980                 )
        981         else:
    
    ~/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/numba/dispatcher.py in _compile_for_args(self, *args, **kws)
        399                 e.patch_message(msg)
        400 
    --> 401             error_rewrite(e, 'typing')
        402         except errors.UnsupportedError as e:
        403             # Something unsupported is present in the user code, add help info
    
    ~/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/numba/dispatcher.py in error_rewrite(e, issue_type)
        342                 raise e
        343             else:
    --> 344                 reraise(type(e), e, None)
        345 
        346         argtypes = []
    
    ~/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/numba/six.py in reraise(tp, value, tb)
        666             value = tp()
        667         if value.__traceback__ is not tb:
    --> 668             raise value.with_traceback(tb)
        669         raise value
        670 
    
    TypingError: Failed in nopython mode pipeline (step: nopython frontend)
    Failed in nopython mode pipeline (step: nopython frontend)
    Internal error at <numba.typeinfer.CallConstraint object at 0x122cd8eb8>.
    Failed in nopython mode pipeline (step: nopython mode backend)
    scalar type Tuple() given for non scalar argument #5
    
    File "../../../.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py", line 232:
    def generate_leaf_updates(leaf_block, dist_thresholds, data, dist, dist_args):
        <source elided>
    
        for n in numba.prange(leaf_block.shape[0]):
        ^
    
    [1] During: lowering "id=9[LoopNest(index_variable = parfor_index.1714, range = (0, $28.6, 1))]{186: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (241)>, 136: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (238)>, 138: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (238)>, 88: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (232)>, 718: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (245)>, 174: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (238)>, 720: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (245)>, 208: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (243)>, 210: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (244)>, 112: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (233)>, 212: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (245)>, 114: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (236)>, 214: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (238)>, 216: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (245)>, 90: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (233)>, 219: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (245)>}Var(parfor_index.1714, pynndescent_.py:232)" at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (232)
    [2] During: resolving callee type: type(CPUDispatcher(<function generate_leaf_updates at 0x11d87c950>))
    [3] During: typing of call at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (265)
    
    Enable logging at debug level for details.
    
    File "../../../.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py", line 265:
    def init_rp_tree(data, dist, dist_args, current_graph, leaf_array):
        <source elided>
            updates = generate_leaf_updates(
                leaf_block, dist_thresholds, data, dist, dist_args
                ^
    
    [1] During: resolving callee type: type(CPUDispatcher(<function init_rp_tree at 0x11d87cbf8>))
    [2] During: typing of call at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (464)
    
    [3] During: resolving callee type: type(CPUDispatcher(<function init_rp_tree at 0x11d87cbf8>))
    [4] During: typing of call at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (464)
    
    
    File "../../../.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py", line 464:
    def nn_descent(
        <source elided>
        if rp_tree_init:
            init_rp_tree(data, dist, dist_args, current_graph, leaf_array)
            ^
    
    This is not usually a problem with Numba itself but instead often caused by
    the use of unsupported features or an issue in resolving types.
    
    To see Python/NumPy features supported by the latest release of Numba visit:
    http://numba.pydata.org/numba-doc/latest/reference/pysupported.html
    and
    http://numba.pydata.org/numba-doc/latest/reference/numpysupported.html
    
    For more information about typing errors and how to debug them visit:
    http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile
    
    If you think your code should work with Numba, please report the error message
    and traceback, along with a minimal reproducer at:
    https://github.com/numba/numba/issues/new
    
    opened by GWW 7
  • Fatal Python error: Segmentation faultFatal Python error: Segmentation fault

    Fatal Python error: Segmentation faultFatal Python error: Segmentation fault

    I am running the Top2Vec code on MacOS 12.6 with Python 3.8, tensorflow-macos and tensorflow-metal. I'm getting a segmentation exception in nn_descent: Screen Shot 2022-09-27 at 2 12 54 PM

    Here's the packages:

     % pip show pynndescent   
    Name: pynndescent
    Version: 0.5.7
    Summary: Nearest Neighbor Descent
    Home-page: http://github.com/lmcinnes/pynndescent
    Author: Leland McInnes
    Author-email: [email protected]
    License: BSD
    Location: /Users/davidlaxer/tensorflow/lib/python3.8/site-packages
    Requires: joblib, llvmlite, numba, scikit-learn, scipy
    Required-by: umap-learn
    (tensorflow) davidlaxer@x86_64-apple-darwin13 Top2Vec % pip show umap-learn
    Name: umap-learn
    Version: 0.5.3
    Summary: Uniform Manifold Approximation and Projection
    Home-page: http://github.com/lmcinnes/umap
    Author: 
    Author-email: 
    License: BSD
    Location: /Users/davidlaxer/tensorflow/lib/python3.8/site-packages
    Requires: numba, numpy, pynndescent, scikit-learn, scipy, tqdm
    Required-by: tensorflow-similarity, top2vec
    
     % pip show numba
    Name: numba
    Version: 0.56.2
    Summary: compiling Python code using LLVM
    Home-page: https://numba.pydata.org
    Author: 
    Author-email: 
    License: BSD
    Location: /Users/davidlaxer/tensorflow/lib/python3.8/site-packages
    Requires: importlib-metadata, llvmlite, numpy, setuptools
    Required-by: datashader, pynndescent, umap-learn
    
    

    Here's the stack before the exception:

    knn_search_index = NNDescent(
        X,
        n_neighbors=n_neighbors,
        metric=metric,
        metric_kwds=metric_kwds,
        random_state=random_state,
        n_trees=n_trees,
        n_iters=n_iters,
        max_candidates=60,
        low_memory=low_memory,
        n_jobs=n_jobs,
        verbose=verbose,
        compressed=False,
    )
    
    _deepcopy_atomic, copy.py:183
    deepcopy, copy.py:146
    _deepcopy_dict, copy.py:230
    deepcopy, copy.py:146
    _deepcopy_dict, copy.py:230
    deepcopy, copy.py:146
    _reconstruct, copy.py:270
    deepcopy, copy.py:172
    _deepcopy_dict, copy.py:230
    deepcopy, copy.py:146
    _reconstruct, copy.py:270
    deepcopy, copy.py:172
    _deepcopy_dict, copy.py:230
    deepcopy, copy.py:146
    _reconstruct, copy.py:270
    deepcopy, copy.py:172
    _deepcopy_dict, copy.py:230
    deepcopy, copy.py:146
    _deepcopy_dict, copy.py:230
    deepcopy, copy.py:146
    _reconstruct, copy.py:270
    deepcopy, copy.py:172
    slice_size, array_analysis.py:1817
    to_shape, array_analysis.py:2074
    _index_to_shape, array_analysis.py:2090
    _analyze_op_getitem, array_analysis.py:2139
    guard, ir_utils.py:1527
    _analyze_expr, array_analysis.py:1571
    _analyze_inst, array_analysis.py:1311
    _determine_transform, array_analysis.py:1273
    _run_on_blocks, array_analysis.py:1198
    run, array_analysis.py:1178
    _pre_run, parfor.py:2859
    run, parfor.py:2867
    run_pass, typed_passes.py:306
    check, compiler_machinery.py:273
    _runPass, compiler_machinery.py:311
    _acquire_compile_lock, compiler_lock.py:35
    run, compiler_machinery.py:356
    _compile_core, compiler.py:486
    _compile_ir, compiler.py:527
    compile_ir, compiler.py:462
    compile_ir, compiler.py:779
    _create_gufunc_for_parfor_body, parfor_lowering.py:1509
    _lower_parfor_parallel, parfor_lowering.py:316
    lower_inst, lowering.py:567
    lower_block, lowering.py:265
    lower_function_body, lowering.py:251
    lower_normal_function, lowering.py:222
    lower, lowering.py:168
    run_pass, typed_passes.py:394
    check, compiler_machinery.py:273
    _runPass, compiler_machinery.py:311
    _acquire_compile_lock, compiler_lock.py:35
    run, compiler_machinery.py:356
    _compile_core, compiler.py:486
    _compile_bytecode, compiler.py:520
    compile_extra, compiler.py:452
    compile_extra, compiler.py:716
    _compile_core, dispatcher.py:152
    _compile_cached, dispatcher.py:139
    compile, dispatcher.py:125
    compile, dispatcher.py:965
    get_call_template, dispatcher.py:363
    get_call_type, functions.py:540
    _resolve_user_function_type, context.py:248
    resolve_function_type, context.py:196
    resolve_call, typeinfer.py:1555
    resolve, typeinfer.py:601
    __call__, typeinfer.py:578
    propagate, typeinfer.py:155
    propagate, typeinfer.py:1078
    type_inference_stage, typed_passes.py:83
    run_pass, typed_passes.py:105
    check, compiler_machinery.py:273
    _runPass, compiler_machinery.py:311
    _acquire_compile_lock, compiler_lock.py:35
    run, compiler_machinery.py:356
    _compile_core, compiler.py:486
    _compile_bytecode, compiler.py:520
    compile_extra, compiler.py:452
    compile_extra, compiler.py:716
    _compile_core, dispatcher.py:152
    _compile_cached, dispatcher.py:139
    compile, dispatcher.py:125
    compile, dispatcher.py:965
    _compile_for_args, dispatcher.py:420
    __init__, pynndescent_.py:891
    nearest_neighbors, umap_.py:328
    fit, umap_.py:2516
    __init__, Top2Vec.py:669
    <module>, test1.py:26
    

    Verbose output:

    ...
    UMAP(n_neighbors=10, verbose=True)
    OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
    Wed Sep 28 07:46:37 2022 Construct fuzzy simplicial set
    Wed Sep 28 07:46:37 2022 Finding Nearest Neighbors
    Wed Sep 28 07:47:11 2022 Building RP forest with 10 trees
    Wed Sep 28 07:47:12 2022 NN descent for 14 iterations
    Fatal Python error: Segmentation faultFatal Python error: Segmentation fault
    
    Thread 0x0000700011629000 (most recent call first):
      File "/Users/davidlaxer/anaconda3/libFatal Python error: 
    
    /Segmentation faultp
    
    ython3.8/threading.py", line 306 in wait
      File "/Users/davidlaxer/anaconda3/lib/python3.8/threading.py", line 558 in wa
    Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
    
    

    Any clues as to what's causing the exception?

    opened by dbl001 1
  • self implemented distance metric

    self implemented distance metric

    Dear Pynndescent team,

    I am wondering whether there could be an option for this library to provide an API to allow external distance metric so that some unique distance metric can be used., e.g., for biological applications, distance computation was actually based on MinHash of genomic profiles, none of the provided metric is fast in this case.

    Thanks,

    Jianshu

    opened by jianshu93 3
  • Importing pickled index gives error - 'NNDescent' object has no attribute 'shape'

    Importing pickled index gives error - 'NNDescent' object has no attribute 'shape'

    Hi Using the latest version via pip install. Running this on Google Colab with python 3.7.13 Followed the Docs and created an index with the params pynnindex = pynndescent.NNDescent(arr, metric="cosine", n_neighbors=100) Everything works fine and I get results from pynnindex.neighbor_graph as expected.

    Then I pickled the index like so:

    with open('pynnindex','wb') as f:
      pickle.dump(pynnindex,f)
    

    The trouble starts when I try to load the pickled index later on, like so:

    with open('pynnindex','rb') as f:
      msgembed = pickle.load(f)
    

    I get the following error:

    AttributeError                            Traceback (most recent call last)
    [<ipython-input-7-44d19f2c98cc>](https://localhost:8080/#) in <module>
          3   msgembed = pickle.load(f)
          4 
    ----> 5 pynnindex_p = pynndescent.NNDescent(msgembed)
    
    [/usr/local/lib/python3.7/dist-packages/pynndescent/pynndescent_.py](https://localhost:8080/#) in __init__(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_degree_multiplier, diversify_prob, n_search_trees, tree_init, init_graph, random_state, low_memory, max_candidates, n_iters, delta, n_jobs, compressed, parallel_batch_queries, verbose)
        671 
        672         if n_trees is None:
    --> 673             n_trees = 5 + int(round((data.shape[0]) ** 0.25))
        674             n_trees = min(32, n_trees)  # Only so many trees are useful
        675         if n_iters is None:
    
    AttributeError: 'NNDescent' object has no attribute 'shape'
    

    The error crops up even after running prepare() on the index. Tried loading the index on another system (with python 3.9.0) and got a different error:

    Traceback (most recent call last):
      File "D:\Code\markupf\pynn.py", line 6, in <module>
        msgembeds = pickle.load(f)
      File "D:\Code\gnanetra\gnanetra\lib\site-packages\numba\core\serialize.py", line 97, in _unpickle__CustomPickled
        ctor, states = loads(serialized)
    TypeError: an integer is required (got type bytes)
    

    Essentially, to use the index, I need to build it every time. Can anyone point me towards a resolution.

    Thanks

    opened by regstuff 0
  • When distance computation is expensive how to gradually build graph

    When distance computation is expensive how to gradually build graph

    Hello pynndescent team,

    I am in a situation where, distance computation is very expensive (genomic sequence), all versus all distance as input is not possible (it takes forever for all versus all distance computation). I am wondering how to compute distance only when necessary in NNDescent() building, that is k graph is build gradually without the need to compute all versus all in the dataset but only those are needed, e.g., 2 datapoint that are far away from each other in k graph, we do not need their distance. Any idea how, original data as input and we have our own distance function.

    Thanks,

    Jianshu

    opened by jianshu93 3
  • Can't pass parameters to weighted_minkowski distance metric

    Can't pass parameters to weighted_minkowski distance metric

    When I run this code, index = NNdescent(data = data, metric = 'weighted_minkowski', metric_kwds={'p':1} ) I get an error of no match with getitem. How can I pass this?

    Also, I need to pass an array of weights to weight the metric. How can I pass those? By convetring to a dict?

    Thanks, David

    opened by Stevod 3
  • Distributed PyNNDescent

    Distributed PyNNDescent

    I'm doing a research about distributed ways do build the kNNG, and i am especifically studying NNDescent. When i tested the distributed-pynndescent, that is in the fork, i couldn't perform it. Is this algorithm ready or it is on development?

    opened by Gab0410 1
Releases(0.5.8)
  • 0.5.8(Oct 31, 2022)

    What's Changed

    • Fixing the bug with n_trees introduced when fixing #180 by @Petkomat in https://github.com/lmcinnes/pynndescent/pull/183
    • Accepting rp trees with no splits by @Petkomat in https://github.com/lmcinnes/pynndescent/pull/185
    • [minor] clean up some strings by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/171
    • [small fix] add spaces to warning message by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/145
    • Only one update method by @Petkomat in https://github.com/lmcinnes/pynndescent/pull/189
    • Replace pkg_resources with importlib.metadata by @konstin in https://github.com/lmcinnes/pynndescent/pull/137
    • Increase test_spearmanr tolerance by @ginggs in https://github.com/lmcinnes/pynndescent/pull/199
    • Fix typos in docu by @MalteIwanicki in https://github.com/lmcinnes/pynndescent/pull/198

    New Contributors

    • @Petkomat made their first contribution in https://github.com/lmcinnes/pynndescent/pull/183
    • @konstin made their first contribution in https://github.com/lmcinnes/pynndescent/pull/137
    • @ginggs made their first contribution in https://github.com/lmcinnes/pynndescent/pull/199
    • @MalteIwanicki made their first contribution in https://github.com/lmcinnes/pynndescent/pull/198

    Full Changelog: https://github.com/lmcinnes/pynndescent/compare/0.5.7...0.5.8

    Source code(tar.gz)
    Source code(zip)
  • 0.5.7(May 14, 2022)

    What's Changed

    • removed old CI files by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/169
    • fix wminkowski implementation, and test for scipy >= 1.8.0 by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/170
    • Parallelize deheap_sort by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/168
    • [bug] was overwriting input indices in graph utils function by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/172
    • fix: remove cache=True from numba functions that implement parallel=True by @rhysnewell in https://github.com/lmcinnes/pynndescent/pull/175
    • Ensure query after to update works correctly (fixes #180) by @lmcinnes in https://github.com/lmcinnes/pynndescent/pull/181
    • Ensure that tree_init=False works including for querying by @lmcinnes in https://github.com/lmcinnes/pynndescent/pull/182

    New Contributors

    • @rhysnewell made their first contribution in https://github.com/lmcinnes/pynndescent/pull/175

    Full Changelog: https://github.com/lmcinnes/pynndescent/compare/0.5.6...0.5.7

    Source code(tar.gz)
    Source code(zip)
  • 0.5.6(Jan 21, 2022)

    What's Changed

    • Fix "'NNDescent' object has no attribute '_vertex_order'" crash in NNDescent.update by @ThomasNickerson in https://github.com/lmcinnes/pynndescent/pull/147
    • 1d wasserstein by @cjweir in https://github.com/lmcinnes/pynndescent/pull/155
    • fix docstring for NNDescent by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/160
    • Optional parallel query batch by @lmcinnes in https://github.com/lmcinnes/pynndescent/pull/162
    • Update sparse distances by @lmcinnes in https://github.com/lmcinnes/pynndescent/pull/165

    New Contributors

    • @ThomasNickerson made their first contribution in https://github.com/lmcinnes/pynndescent/pull/147

    Full Changelog: https://github.com/lmcinnes/pynndescent/compare/0.5.5...0.5.6

    Source code(tar.gz)
    Source code(zip)
  • 0.5.5(Oct 15, 2021)

    What's Changed

    • [bug fix] Always used checked_flagged_heap_push to prevent redundant edges by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/138
    • Add deprecation notice to n_search_trees parameter docstring by @jlmelville in https://github.com/lmcinnes/pynndescent/pull/143

    New Contributors

    • @jamestwebber made their first contribution in https://github.com/lmcinnes/pynndescent/pull/138

    Full Changelog: https://github.com/lmcinnes/pynndescent/compare/0.5.4...0.5.5

    Source code(tar.gz)
    Source code(zip)
  • 0.5.4(Jul 6, 2021)

  • 0.5.3(Jul 5, 2021)

  • 0.5.2(Feb 8, 2021)

  • 0.4.8(Jun 29, 2020)

  • 0.4.7(Mar 14, 2020)

  • 0.4.6(Mar 13, 2020)

  • 0.4.2(Dec 13, 2019)

Owner
Leland McInnes
Leland McInnes
A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Nicholas Monath 31 Nov 3, 2022
Iterative stochastic gradient descent (SGD) linear regressor with regularization

SGD-Linear-Regressor Iterative stochastic gradient descent (SGD) linear regressor with regularization Dataset: Kaggle “Graduate Admission 2” https://w

Zechen Ma 1 Oct 29, 2021
Neighbourhood Retrieval (Nearest Neighbours) with Distance Correlation.

Neighbourhood Retrieval with Distance Correlation Assign Pseudo class labels to datapoints in the latent space. NNDC is a slim wrapper around FAISS. N

The Learning Machines 1 Jan 16, 2022
Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark environment.

pyspark-anonymizer Python library which makes it possible to dynamically mask/anonymize data using JSON string or python dict rules in a PySpark envir

null 6 Jun 30, 2022
Educational python for Neural Networks, written in pure Python/NumPy.

Educational python for Neural Networks, written in pure Python/NumPy.

null 127 Oct 27, 2022
learn python in 100 days, a simple step could be follow from beginner to master of every aspect of python programming and project also include side project which you can use as demo project for your personal portfolio

learn python in 100 days, a simple step could be follow from beginner to master of every aspect of python programming and project also include side project which you can use as demo project for your personal portfolio

BDFD 6 Nov 5, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 8, 2023
A modular active learning framework for Python

Modular Active Learning framework for Python3 Page contents Introduction Active learning from bird's-eye view modAL in action From zero to one in a fe

modAL 1.9k Dec 31, 2022
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 29, 2022
Sequence learning toolkit for Python

seqlearn seqlearn is a sequence classification toolkit for Python. It is designed to extend scikit-learn and offer as similar as possible an API. Comp

Lars 653 Dec 27, 2022
Simple structured learning framework for python

PyStruct PyStruct aims at being an easy-to-use structured learning and prediction library. Currently it implements only max-margin methods and a perce

pystruct 666 Jan 3, 2023
Python implementation of the rulefit algorithm

RuleFit Implementation of a rule based prediction algorithm based on the rulefit algorithm from Friedman and Popescu (PDF) The algorithm can be used f

Christoph Molnar 326 Jan 2, 2023
Metric learning algorithms in Python

metric-learn: Metric Learning in Python metric-learn contains efficient Python implementations of several popular supervised and weakly-supervised met

null 1.3k Dec 28, 2022
[HELP REQUESTED] Generalized Additive Models in Python

pyGAM Generalized Additive Models in Python. Documentation Official pyGAM Documentation: Read the Docs Building interpretable models with Generalized

daniel servén 747 Jan 5, 2023
Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Karate Club is an unsupervised machine learning extension library for NetworkX. Please look at the Documentation, relevant Paper, Promo Video, and Ext

Benedek Rozemberczki 1.8k Jan 3, 2023
Open source time series library for Python

PyFlux PyFlux is an open source time series library for Python. The library has a good array of modern time series models, as well as a flexible array

Ross Taylor 2k Jan 2, 2023
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 9, 2023
MLBox is a powerful Automated Machine Learning python library.

MLBox is a powerful Automated Machine Learning python library. It provides the following features: Fast reading and distributed data preprocessing/cle

Axel 1.4k Jan 6, 2023