PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors.

Leland McInnes

Last update: Jan 9, 2023

Related tags

Overview

PyNNDescent

PyNNDescent is a Python nearest neighbor descent for approximate nearest neighbors. It provides a python implementation of Nearest Neighbor Descent for k-neighbor-graph construction and approximate nearest neighbor search, as per the paper:

Dong, Wei, Charikar Moses, and Kai Li. "Efficient k-nearest neighbor graph construction for generic similarity measures." Proceedings of the 20th international conference on World wide web. ACM, 2011.

This library supplements that approach with the use of random projection trees for initialisation. This can be particularly useful for the metrics that are amenable to such approaches (euclidean, minkowski, angular, cosine, etc.). Graph diversification is also performed, pruning the longest edges of any triangles in the graph.

Currently this library targets relatively high accuracy (80%-100% accuracy rate) approximate nearest neighbor searches.

Why use PyNNDescent?

PyNNDescent provides fast approximate nearest neighbor queries. The ann-benchmarks system puts it solidly in the mix of top performing ANN libraries:

SIFT-128 Euclidean

NYTimes-256 Angular

While PyNNDescent is among fastest ANN library, it is also both easy to install (pip and conda installable) with no platform or compilation issues, and is very flexible, supporting a wide variety of distance metrics by default:

Minkowski style metrics

euclidean
manhattan
chebyshev
minkowski

Miscellaneous spatial metrics

canberra
braycurtis
haversine

Normalized spatial metrics

mahalanobis
wminkowski
seuclidean

Angular and correlation metrics

cosine
dot
correlation
spearmanr
tsss
true_angular

Probability metrics

hellinger
wasserstein

Metrics for binary data

hamming
jaccard
dice
russelrao
kulsinski
rogerstanimoto
sokalmichener
sokalsneath
yule

and also custom user defined distance metrics while still retaining performance.

PyNNDescent also integrates well with Scikit-learn, including providing support for the KNeighborTransformer as a drop in replacement for algorithms that make use of nearest neighbor computations.

How to use PyNNDescent

PyNNDescent aims to have a very simple interface. It is similar to (but more limited than) KDTrees and BallTrees in sklearn. In practice there are only two operations -- index construction, and querying an index for nearest neighbors.

To build a new search index on some training data data you can do something like

from pynndescent import NNDescent
index = NNDescent(data)

You can then use the index for searching (and can pickle it to disk if you wish). To search a pynndescent index for the 15 nearest neighbors of a test data set query_data you can do something like

index.query(query_data, k=15)

and that is pretty much all there is to it. You can find more details in the documentation.

Installing

PyNNDescent is designed to be easy to install being a pure python module with relatively light requirements:

numpy
scipy
scikit-learn >= 0.22
numba >= 0.51

all of which should be pip or conda installable. The easiest way to install should be via conda:

conda install -c conda-forge pynndescent

or via pip:

pip install pynndescent

To manually install this package:

wget https://github.com/lmcinnes/pynndescent/archive/master.zip
unzip master.zip
rm master.zip
cd pynndescent-master
python setup.py install

Help and Support

This project is still young. The documentation is still growing. In the meantime please open an issue and I will try to provide any help and guidance that I can. Please also check the docstrings on the code, which provide some descriptions of the parameters.

License

The pynndescent package is 2-clause BSD licensed. Enjoy.

Contributing

Contributions are more than welcome! There are lots of opportunities for potential projects, so please get in touch if you would like to help out. Everything from code to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

Comments

fixes for Numba 0.53.0dev0

It seems like previous Numba versions (up to 0.51.2) may have worked 'by accident'. When running pynndescent against a recent development version of Numba. I receievd many errors during test collection of the form:

E   numba.core.errors.LoweringError: Failed in nopython mode pipeline (step: nopython mode backend)
E   Storing i64 to ptr of i32 ('dim'). FE type uint32
E
E   File "../miniconda3/envs/umap/lib/python3.7/site-packages/pynndescent/utils.py", line 88:
E   def norm(vec):
E       <source elided>
E       result = 0.0
E       dim = vec.shape[0]
E       ^
E
E   During: lowering "dim = static_getitem(value=$8load_attr.2, index=0, index_var=$const10.3, fn=<built-in function getitem>)" at /Users/vhaenel/git/numba-integration-testing/miniconda3/envs/umap/lib/python3.7/site-packages/pynndescent/utils.py (88)

Not all errors are shown immediately as pytest fails on the first encounter, so they have to be fixed one-by-one.

opened by esc 13

backport UMAP 0.4dev changes
Relates to https://github.com/lmcinnes/pynndescent/issues/37

This isn't intended to be merged in its current state, this is just to conveniently show the current state of the diffs. The things to look at are:

numba decorations. Sometimes UMAP has more, sometimes it's vice versa. I don't know which ones are necessary or helpful.

In new_build_candidates, there is a test failure when using numba.prange, so I have commented it out and used a plain old range.

The biggest change is to the nearest neighbor descent routine, which introduces a set to keep track of previously seen pairs, and unchecked heap pushes. There are two test failures due to a much lower expected accuracy, unless I comment out the use of the unchecked heap push in the first loop in the nn descent iteration. Not tracked down the problem there yet.

In general for nn descent, is it safe to use a set to cache the lookups? It seems like it could grow to be O(N^2) in the worst case? I think something like this is mentioned in the nn descent paper.

In UMAP 0.4, the search initialization is handled slightly differently, in that each function is separate and dist and dist_args have to be provided to init_from_tree and its ilk directly, whereas in pynndescent, dist and dist_args are passed to a factory function and the initialization functions access them via closure. Which is the preferred way to do things?
opened by jlmelville 13
-1 entries in neighbor_graph
I get some -1 entries in the kNN graph when I build it on a sparse matrix with cosine distance. Is this intentional? Does -1 mean X.shape[0]-1? When I run query(X), I don't get any negative elements. E.g.:

X = scipy.sparse.csr_matrix(np.random.randn(10000,1247)>3) nn = pynndescent.NNDescent(X, metric='cosine', n_neighbors=15)

Now nn.neighbor_graph[0] has some values equal to -1 but nn.query(X, k=15)[0] does not.

Update: Forgot to say that I do get a warning "UserWarning: Failed to correctly find n_neighbors for some samples.Results may be less than ideal. Try re-running withdifferent parameters." from NNDescent. Maybe that's what -1 indicate? But then query() does not return any negative elements and does not complain. How should one approach this?
opened by dkobak 11
Parallelize deheap_sort

This is a very simple change: I added parallel=True to utils.deheap_sort and I used a numba.prange in the top loop.

Sometimes, when I'm watching CPU usage for NNDescent on very large arrays, I've seen it spend a fair amount of time in a single thread near the end of the operation. Looking at the code, I suspect it's just dealing with this final deheap_sort call.

Because the function operates on each row independently, it's pretty simple to wrap it in a parallel loop and let numba figure it out. It seems to work but I haven't tested rigorously yet. Hoping travis can do that for me.

~As an additional tweak, I added python 3.9 to the test matrix here, to see what happens.~ edit: obsolete given recent updates

opened by jamestwebber 10
Measuring minkowski with p= 0.3

Hi,

Thanks for such great package!

I am using version 0.5.2 with Python 3.6.5. I compared the results of using euclidean distance vs. minkowski distance with p= 0.3 in the code below: import numpy as np import pynndescent

data= np.random.randn(500,9)

euclidean= pynndescent.NNDescent(data, metric= 'euclidean', n_neighbors= 15, random_state= 1969, n_jobs= -1) euclidean.prepare() euc_neig, euc_dist= euclidean.query(data, k= 15)

minkowski= pynndescent.NNDescent(data, metric= 'minkowski', metric_kwds= {'p': 0.3}, n_neighbors= 15, random_state= 1969, n_jobs= -1) minkowski.prepare() min_neig, min_dist= minkowski.query(data, k= 15)

Then, I looked at the outputs for the item in the query: euc_neig[0,:] array([ 0, 150, 298, 308, 48, 498, 93, 277, 188, 184, 219, 146, 420, 309, 160]) euc_dist[0,:] array([0. , 2.3493888, 2.3878694, 2.5395243, 2.5565991, 2.5695498, 2.5878463, 2.6194322, 2.624215 , 2.6499615, 2.662645 , 2.6754963, 2.7257159, 2.7690816, 2.808471 ], dtype=float32)

min_neig[0,:] array([ 0, 107, 14, 287, 260, 348, 57, 464, 22, 319, 298, 3, 186, 15, 424]) min_dist[0,:] array([ 10.079369, 219.20055 , 223.33961 , 227.77594 , 230.01102 , 248.14621 , 260.6848 , 288.2422 , 292.64026 , 304.21317 , 306.87302 , 311.07306 , 322.3568 , 326.94174 , 331.8149 ], dtype=float32)

As expected, the self-measured distance (first item in the query) is 0 for euclidean. However, for minkowski, the self-measured distance is not 0.

My question is, why the minkowski distance does not return 0 for the first item?

Thanks, Ivan

opened by ivan-marroquin 10
Multi-threaded implementation

Currently, pynndescent doesn't take full advantage of multiple cores on a machine. There are some methods annotated with the Numba parallel option, but these don't provide any noticeable speed up when more cores are available.

This PR allows nearest neighbor computation to take advantage of multiple cores by processing heaps in parallel chunks. The basic idea is an application of the MapReduce approach described in section 2.8 of Efficient K-Nearest Neighbor Graph Construction for Generic Similarity Measures by Dong et al.

Note that in this PR all computation is carried out on a single machine using the standard Python concurrent.futures library, which complements Numba very well, as shown in this example.

Going into slightly more detail, a distributed heap can be updated in a MapReduce-style operation as follows.

We have a large array of all the heap updates held in memory and a thread pool to update chunks in parallel in two phases, called "Map" and "Reduce". Each Map partition holds an array of heap updates, corresponding to a row in the following diagram.

The Map phase adds updates in the order they were generated (from heap_push calls). After all the updates have been added for a partition, the updates for that partition are sorted by row number. Then the chunk boundaries corresponding to partitions are found, so that in the Reduce phase the array can process all the updates for a given partition together, and apply them to the target heap (not illustrated).

For the Map phase the array is processed row-wise, and in the Reduce phase it is processed column-wise (although the column bounds are different for each row owing to the different chunk boundaries). The result is an updated distributed heap.

This PR also adds tests that check the threaded implementation produces the same result as the regular implementation. Since NN-Descent uses a source of randomness, when running the tests the random seed is set to the value of each row index to ensure the same deterministic result is obtained. This per-row seeding is only used for tests. However, to make threaded runs more reproducible when run normally (outside of tests) each thread is seeded from the initial random seed, which means that the result from two threaded runs (with the same number of threads) is the same if the same initial seed is used in each case.

In terms of performance, the multi-threaded implementation outperforms the regular implementation with 4 cores or more. In one experiment, regular pynndescent took 27 minutes for a million rows (D=128, NN=25), compared to just over 3 minutes with 64 threads.

There is more work to consider how to wire up the implementation to the standard NNDescent object, but I wanted to get feedback on this approach before going any further.

opened by tomwhite 9
Added spearmanr

Addresses issue https://github.com/lmcinnes/umap/issues/319 and https://github.com/lmcinnes/pynndescent/issues/85

I know tests are missing, but I am not 100% sure how to add them the right way. Guidance welcome.

opened by sleighsoft 8
pynndescent throws error when n_jobs=-1

Hi, thanks for your work on this project! Unfortunately I seem to have run into a bug through another project that uses pynndescent as a dependency. The owner suggested that I open an issue here. See here for details: https://github.com/pavlin-policar/openTSNE/issues/92#issue-487746322

opened by jgraving 8

NNDescent.update(X) not working as expected

I wanted to do the following:

Create a NNDescent object from the first batch of data.
Query some data.
Wait for some time.
Update the object with new data, but not start from scratch.
Query some data.

I assumed that NNDescent.update(fresh data) would do the trick. However, maybe I am using it in a wrong way, since it seems that the returned neighbours are only those from the first batch of the data. While inspecting index = NNDescent(...) after the update in debug mode I discovered that

index._raw_data and index.neighbor_graph (and index._neighbor_graph) seem to be updated properly (the indices of the neighbours of all 205 instances follow the basic intuition - see below),
index._vertex_order was not updated properly (it still has the length of 100 and not 205).

The data (xs and ys) were generated, so that they look like this:

xs:
0 r r r r
1 r r r r
2 r r r r
...
99 r r r r

ys:
100 r r r r
101 r r r r
...
204 r r r r

where r is an always different random value (from (0, 1)). Thus, the nearest neighbours of the instance i (0 <= i <= 204) should typically be i, i - 1, i + 1, i - 2, i + 2, ... (possibly not in this exact order). However, the code below says that the neighbours of the instances i >= 100 (those from ys) are 99, 98, 97 ... which suggests that the neighbours were found only among xs.

The actual output of the code below (indices/distances):

[99, 98, 97, 96, 95, 94, 93, 92, 91, 90]
[98, 99, 97, 96, 95, 94, 93, 92, 91, 89]
[99, 98, 97, 96, 95, 94, 93, 92, 91, 90]
... (some lines omitted)
[99, 98, 97, 96, 95, 94, 93, 92, 91, 90]
[99, 98, 97, 96, 95, 94, 93, 92, 91, 90]
 
[2.1192119121551514, 2.8882203102111816, 4.930568218231201, 5.635573387145996, 6.850753307342529, 7.7378249168396, 8.205963134765625, 9.593037605285645, 10.798347473144531, 11.230871200561523]
[4.201259613037109, 4.454722881317139, 5.768087387084961, 6.7560529708862305, 7.452332973480225, 8.632887840270996, 9.428315162658691, 9.682640075683594, 11.393900871276855, 13.470145225524902]
[4.810489654541016, 5.09433650970459, 6.398769378662109, 7.656699180603027, 8.318954467773438, 9.736209869384766, 9.871195793151855, 11.387312889099121, 12.959441184997559, 13.922148704528809]
... (some lines omitted)
[105.62753295898438, 106.07768249511719, 107.53001403808594, 108.47374725341797, 109.60338592529297, 110.31566619873047, 110.80841064453125, 112.75614166259766, 114.3282699584961, 114.73919677734375]
[106.41008758544922, 106.78266906738281, 107.53002166748047, 108.84111022949219, 109.72077178955078, 111.25269317626953, 111.68487548828125, 113.38243103027344, 114.24507141113281, 115.30052947998047]

The code:

import numpy as np
import pynndescent
import pickle
import os
import numba

@numba.njit()
def my_manhattan(x, y):
    return np.sum(np.abs(x - y))

def test3():
    n_x = 100
    xs = np.random.rand(n_x, 5)
    xs[:, 0] = np.arange(0, n_x, 1)
    n_y = 105
    ys = np.random.rand(n_y, 5)
    ys[:, 0] = np.arange(n_x, n_x + n_y, 1)
    index = pynndescent.NNDescent(xs, metric=my_manhattan, n_neighbors=5)
    # index.prepare()
    k = 10
    ns_x, ds_x = index.query(xs, k=k)
    index.update(ys)
    # index._init_search_graph()
    ns_y, ds_y = index.query(ys, k=k)
    for n_y in ns_y:
        print(n_y.tolist())
    print(" ")
    for d_y in ds_y:
        print(d_y.tolist())


test3()

I installed pynndescent via pip.

bug

opened by Petkomat 7

Cannot query an index constructed from sparse matrix

I have a sparse matrix X on which I can successfully run UMAP:

<100000x9630 sparse matrix of type '<class 'numpy.float64'>'
	with 266398 stored elements in List of Lists format>

I can construct an index like this, and it works fine:

nn = NNDescent(X, metric='cosine')

However, querying this index does not seem to work. nn.query(X[:5,:], k=15) returns the following error:

~/anaconda3/lib/python3.7/site-packages/pynndescent/pynndescent_.py in query(self, query_data, k, epsilon)
   1263 
   1264         if self._distance_correction is not None:
-> 1265             dists = self._distance_correction(dists)
   1266 
   1267         return indices, dists

~/anaconda3/lib/python3.7/site-packages/numba/npyufunc/dufunc.py in _compile_for_args(self, *args, **kws)
    164                     argty = argty.dtype
    165                 argtys.append(argty)
--> 166         return self._compile_for_argtys(tuple(argtys))
    167 
    168     def _compile_for_argtys(self, argtys, return_type=None):

~/anaconda3/lib/python3.7/site-packages/numba/npyufunc/dufunc.py in _compile_for_argtys(self, argtys, return_type)
    184             self._dispatcher, self.targetoptions, sig)
    185         actual_sig = ufuncbuilder._finalize_ufunc_signature(
--> 186             cres, argtys, return_type)
    187         dtypenums, ptr, env = ufuncbuilder._build_element_wise_ufunc_wrapper(
    188             cres, actual_sig)

~/anaconda3/lib/python3.7/site-packages/numba/npyufunc/ufuncbuilder.py in _finalize_ufunc_signature(cres, args, return_type)
    140         if cres.objectmode:
    141             # Object mode is used and return type is not specified
--> 142             raise TypeError("return type must be specified for object mode")
    143         else:
    144             return_type = cres.signature.return_type

TypeError: return type must be specified for object mode

Versions:

UMAP 0.3.10
pynndescent 0.4.5
numba 0.46.0

opened by dkobak 7

Error when trying to construct an index with NNDescent

Hello,

When I try to construct a NNDescent (pynndescent v0.4.3) index using correlation or euclidean distance I get a numba error (Numba version 0.47.0).

Here is the code I used that triggers the error and X is a NumPy float64 array with shape (1624, 2000).:

from pynndescent import NNDescent

def _KNN(X, random_state, n_neighbors = 30, metric='correlation'):
    index = NNDescent(X, n_neighbors=n_neighbors, metric=metric, random_state=random_state)
    knn_indices, knn_dists = index.query(X, n_neighbors=n_neighbors)

Here is the error I get:

~/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py in __init__(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_degree_multiplier, diversify_epsilon, n_search_trees, tree_init, random_state, algorithm, low_memory, max_candidates, n_iters, delta, n_jobs, compressed, seed_per_row, verbose)
    977                     leaf_array=leaf_array,
    978                     verbose=verbose,
--> 979                     seed_per_row=seed_per_row,
    980                 )
    981         else:

~/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/numba/dispatcher.py in _compile_for_args(self, *args, **kws)
    399                 e.patch_message(msg)
    400 
--> 401             error_rewrite(e, 'typing')
    402         except errors.UnsupportedError as e:
    403             # Something unsupported is present in the user code, add help info

~/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/numba/dispatcher.py in error_rewrite(e, issue_type)
    342                 raise e
    343             else:
--> 344                 reraise(type(e), e, None)
    345 
    346         argtypes = []

~/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/numba/six.py in reraise(tp, value, tb)
    666             value = tp()
    667         if value.__traceback__ is not tb:
--> 668             raise value.with_traceback(tb)
    669         raise value
    670 

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Failed in nopython mode pipeline (step: nopython frontend)
Internal error at <numba.typeinfer.CallConstraint object at 0x122cd8eb8>.
Failed in nopython mode pipeline (step: nopython mode backend)
scalar type Tuple() given for non scalar argument #5

File "../../../.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py", line 232:
def generate_leaf_updates(leaf_block, dist_thresholds, data, dist, dist_args):
    <source elided>

    for n in numba.prange(leaf_block.shape[0]):
    ^

[1] During: lowering "id=9[LoopNest(index_variable = parfor_index.1714, range = (0, $28.6, 1))]{186: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (241)>, 136: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (238)>, 138: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (238)>, 88: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (232)>, 718: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (245)>, 174: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (238)>, 720: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (245)>, 208: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (243)>, 210: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (244)>, 112: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (233)>, 212: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (245)>, 114: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (236)>, 214: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (238)>, 216: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (245)>, 90: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (233)>, 219: <ir.Block at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (245)>}Var(parfor_index.1714, pynndescent_.py:232)" at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (232)
[2] During: resolving callee type: type(CPUDispatcher(<function generate_leaf_updates at 0x11d87c950>))
[3] During: typing of call at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (265)

Enable logging at debug level for details.

File "../../../.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py", line 265:
def init_rp_tree(data, dist, dist_args, current_graph, leaf_array):
    <source elided>
        updates = generate_leaf_updates(
            leaf_block, dist_thresholds, data, dist, dist_args
            ^

[1] During: resolving callee type: type(CPUDispatcher(<function init_rp_tree at 0x11d87cbf8>))
[2] During: typing of call at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (464)

[3] During: resolving callee type: type(CPUDispatcher(<function init_rp_tree at 0x11d87cbf8>))
[4] During: typing of call at /Users/gww/.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py (464)


File "../../../.pyenv/versions/3.6.6/Python.framework/Versions/3.6/lib/python3.6/site-packages/pynndescent/pynndescent_.py", line 464:
def nn_descent(
    <source elided>
    if rp_tree_init:
        init_rp_tree(data, dist, dist_args, current_graph, leaf_array)
        ^

This is not usually a problem with Numba itself but instead often caused by
the use of unsupported features or an issue in resolving types.

To see Python/NumPy features supported by the latest release of Numba visit:
http://numba.pydata.org/numba-doc/latest/reference/pysupported.html
and
http://numba.pydata.org/numba-doc/latest/reference/numpysupported.html

For more information about typing errors and how to debug them visit:
http://numba.pydata.org/numba-doc/latest/user/troubleshoot.html#my-code-doesn-t-compile

If you think your code should work with Numba, please report the error message
and traceback, along with a minimal reproducer at:
https://github.com/numba/numba/issues/new

opened by GWW 7

Fatal Python error: Segmentation faultFatal Python error: Segmentation fault

I am running the Top2Vec code on MacOS 12.6 with Python 3.8, tensorflow-macos and tensorflow-metal. I'm getting a segmentation exception in nn_descent: Screen Shot 2022-09-27 at 2 12 54 PM

Here's the packages:

 % pip show pynndescent   
Name: pynndescent
Version: 0.5.7
Summary: Nearest Neighbor Descent
Home-page: http://github.com/lmcinnes/pynndescent
Author: Leland McInnes
Author-email: [email protected]
License: BSD
Location: /Users/davidlaxer/tensorflow/lib/python3.8/site-packages
Requires: joblib, llvmlite, numba, scikit-learn, scipy
Required-by: umap-learn
(tensorflow) davidlaxer@x86_64-apple-darwin13 Top2Vec % pip show umap-learn
Name: umap-learn
Version: 0.5.3
Summary: Uniform Manifold Approximation and Projection
Home-page: http://github.com/lmcinnes/umap
Author: 
Author-email: 
License: BSD
Location: /Users/davidlaxer/tensorflow/lib/python3.8/site-packages
Requires: numba, numpy, pynndescent, scikit-learn, scipy, tqdm
Required-by: tensorflow-similarity, top2vec

 % pip show numba
Name: numba
Version: 0.56.2
Summary: compiling Python code using LLVM
Home-page: https://numba.pydata.org
Author: 
Author-email: 
License: BSD
Location: /Users/davidlaxer/tensorflow/lib/python3.8/site-packages
Requires: importlib-metadata, llvmlite, numpy, setuptools
Required-by: datashader, pynndescent, umap-learn

Here's the stack before the exception:

knn_search_index = NNDescent(
    X,
    n_neighbors=n_neighbors,
    metric=metric,
    metric_kwds=metric_kwds,
    random_state=random_state,
    n_trees=n_trees,
    n_iters=n_iters,
    max_candidates=60,
    low_memory=low_memory,
    n_jobs=n_jobs,
    verbose=verbose,
    compressed=False,
)

_deepcopy_atomic, copy.py:183
deepcopy, copy.py:146
_deepcopy_dict, copy.py:230
deepcopy, copy.py:146
_deepcopy_dict, copy.py:230
deepcopy, copy.py:146
_reconstruct, copy.py:270
deepcopy, copy.py:172
_deepcopy_dict, copy.py:230
deepcopy, copy.py:146
_reconstruct, copy.py:270
deepcopy, copy.py:172
_deepcopy_dict, copy.py:230
deepcopy, copy.py:146
_reconstruct, copy.py:270
deepcopy, copy.py:172
_deepcopy_dict, copy.py:230
deepcopy, copy.py:146
_deepcopy_dict, copy.py:230
deepcopy, copy.py:146
_reconstruct, copy.py:270
deepcopy, copy.py:172
slice_size, array_analysis.py:1817
to_shape, array_analysis.py:2074
_index_to_shape, array_analysis.py:2090
_analyze_op_getitem, array_analysis.py:2139
guard, ir_utils.py:1527
_analyze_expr, array_analysis.py:1571
_analyze_inst, array_analysis.py:1311
_determine_transform, array_analysis.py:1273
_run_on_blocks, array_analysis.py:1198
run, array_analysis.py:1178
_pre_run, parfor.py:2859
run, parfor.py:2867
run_pass, typed_passes.py:306
check, compiler_machinery.py:273
_runPass, compiler_machinery.py:311
_acquire_compile_lock, compiler_lock.py:35
run, compiler_machinery.py:356
_compile_core, compiler.py:486
_compile_ir, compiler.py:527
compile_ir, compiler.py:462
compile_ir, compiler.py:779
_create_gufunc_for_parfor_body, parfor_lowering.py:1509
_lower_parfor_parallel, parfor_lowering.py:316
lower_inst, lowering.py:567
lower_block, lowering.py:265
lower_function_body, lowering.py:251
lower_normal_function, lowering.py:222
lower, lowering.py:168
run_pass, typed_passes.py:394
check, compiler_machinery.py:273
_runPass, compiler_machinery.py:311
_acquire_compile_lock, compiler_lock.py:35
run, compiler_machinery.py:356
_compile_core, compiler.py:486
_compile_bytecode, compiler.py:520
compile_extra, compiler.py:452
compile_extra, compiler.py:716
_compile_core, dispatcher.py:152
_compile_cached, dispatcher.py:139
compile, dispatcher.py:125
compile, dispatcher.py:965
get_call_template, dispatcher.py:363
get_call_type, functions.py:540
_resolve_user_function_type, context.py:248
resolve_function_type, context.py:196
resolve_call, typeinfer.py:1555
resolve, typeinfer.py:601
__call__, typeinfer.py:578
propagate, typeinfer.py:155
propagate, typeinfer.py:1078
type_inference_stage, typed_passes.py:83
run_pass, typed_passes.py:105
check, compiler_machinery.py:273
_runPass, compiler_machinery.py:311
_acquire_compile_lock, compiler_lock.py:35
run, compiler_machinery.py:356
_compile_core, compiler.py:486
_compile_bytecode, compiler.py:520
compile_extra, compiler.py:452
compile_extra, compiler.py:716
_compile_core, dispatcher.py:152
_compile_cached, dispatcher.py:139
compile, dispatcher.py:125
compile, dispatcher.py:965
_compile_for_args, dispatcher.py:420
__init__, pynndescent_.py:891
nearest_neighbors, umap_.py:328
fit, umap_.py:2516
__init__, Top2Vec.py:669
<module>, test1.py:26

Verbose output:

...
UMAP(n_neighbors=10, verbose=True)
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Wed Sep 28 07:46:37 2022 Construct fuzzy simplicial set
Wed Sep 28 07:46:37 2022 Finding Nearest Neighbors
Wed Sep 28 07:47:11 2022 Building RP forest with 10 trees
Wed Sep 28 07:47:12 2022 NN descent for 14 iterations
Fatal Python error: Segmentation faultFatal Python error: Segmentation fault

Thread 0x0000700011629000 (most recent call first):
  File "/Users/davidlaxer/anaconda3/libFatal Python error: 

/Segmentation faultp

ython3.8/threading.py", line 306 in wait
  File "/Users/davidlaxer/anaconda3/lib/python3.8/threading.py", line 558 in wa
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Any clues as to what's causing the exception?

opened by dbl001 1

self implemented distance metric

Dear Pynndescent team,

I am wondering whether there could be an option for this library to provide an API to allow external distance metric so that some unique distance metric can be used., e.g., for biological applications, distance computation was actually based on MinHash of genomic profiles, none of the provided metric is fast in this case.

Thanks,

Jianshu

opened by jianshu93 3

Importing pickled index gives error - 'NNDescent' object has no attribute 'shape'

Hi Using the latest version via pip install. Running this on Google Colab with python 3.7.13 Followed the Docs and created an index with the params pynnindex = pynndescent.NNDescent(arr, metric="cosine", n_neighbors=100) Everything works fine and I get results from pynnindex.neighbor_graph as expected.

Then I pickled the index like so:

with open('pynnindex','wb') as f:
  pickle.dump(pynnindex,f)

The trouble starts when I try to load the pickled index later on, like so:

with open('pynnindex','rb') as f:
  msgembed = pickle.load(f)

I get the following error:

AttributeError                            Traceback (most recent call last)
[<ipython-input-7-44d19f2c98cc>](https://localhost:8080/#) in <module>
      3   msgembed = pickle.load(f)
      4 
----> 5 pynnindex_p = pynndescent.NNDescent(msgembed)

[/usr/local/lib/python3.7/dist-packages/pynndescent/pynndescent_.py](https://localhost:8080/#) in __init__(self, data, metric, metric_kwds, n_neighbors, n_trees, leaf_size, pruning_degree_multiplier, diversify_prob, n_search_trees, tree_init, init_graph, random_state, low_memory, max_candidates, n_iters, delta, n_jobs, compressed, parallel_batch_queries, verbose)
    671 
    672         if n_trees is None:
--> 673             n_trees = 5 + int(round((data.shape[0]) ** 0.25))
    674             n_trees = min(32, n_trees)  # Only so many trees are useful
    675         if n_iters is None:

AttributeError: 'NNDescent' object has no attribute 'shape'

The error crops up even after running prepare() on the index. Tried loading the index on another system (with python 3.9.0) and got a different error:

Traceback (most recent call last):
  File "D:\Code\markupf\pynn.py", line 6, in <module>
    msgembeds = pickle.load(f)
  File "D:\Code\gnanetra\gnanetra\lib\site-packages\numba\core\serialize.py", line 97, in _unpickle__CustomPickled
    ctor, states = loads(serialized)
TypeError: an integer is required (got type bytes)

Essentially, to use the index, I need to build it every time. Can anyone point me towards a resolution.

Thanks

opened by regstuff 0

When distance computation is expensive how to gradually build graph

Hello pynndescent team,

I am in a situation where, distance computation is very expensive (genomic sequence), all versus all distance as input is not possible (it takes forever for all versus all distance computation). I am wondering how to compute distance only when necessary in NNDescent() building, that is k graph is build gradually without the need to compute all versus all in the dataset but only those are needed, e.g., 2 datapoint that are far away from each other in k graph, we do not need their distance. Any idea how, original data as input and we have our own distance function.

Thanks,

Jianshu

opened by jianshu93 3
Can't pass parameters to weighted_minkowski distance metric

When I run this code, index = NNdescent(data = data, metric = 'weighted_minkowski', metric_kwds={'p':1} ) I get an error of no match with getitem. How can I pass this?

Also, I need to pass an array of weights to weight the metric. How can I pass those? By convetring to a dict?

Thanks, David

opened by Stevod 3
Distributed PyNNDescent

I'm doing a research about distributed ways do build the kNNG, and i am especifically studying NNDescent. When i tested the distributed-pynndescent, that is in the fork, i couldn't perform it. Is this algorithm ready or it is on development?

opened by Gab0410 1

Releases(0.5.8)

0.5.8(Oct 31, 2022)
What's Changed

Fixing the bug with n_trees introduced when fixing #180 by @Petkomat in https://github.com/lmcinnes/pynndescent/pull/183

Accepting rp trees with no splits by @Petkomat in https://github.com/lmcinnes/pynndescent/pull/185

[minor] clean up some strings by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/171

[small fix] add spaces to warning message by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/145

Only one update method by @Petkomat in https://github.com/lmcinnes/pynndescent/pull/189

Replace pkg_resources with importlib.metadata by @konstin in https://github.com/lmcinnes/pynndescent/pull/137

Increase test_spearmanr tolerance by @ginggs in https://github.com/lmcinnes/pynndescent/pull/199

Fix typos in docu by @MalteIwanicki in https://github.com/lmcinnes/pynndescent/pull/198

New Contributors

@Petkomat made their first contribution in https://github.com/lmcinnes/pynndescent/pull/183

@konstin made their first contribution in https://github.com/lmcinnes/pynndescent/pull/137

@ginggs made their first contribution in https://github.com/lmcinnes/pynndescent/pull/199

@MalteIwanicki made their first contribution in https://github.com/lmcinnes/pynndescent/pull/198

Full Changelog: https://github.com/lmcinnes/pynndescent/compare/0.5.7...0.5.8
Source code(tar.gz)
Source code(zip)
0.5.7(May 14, 2022)
What's Changed

removed old CI files by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/169

fix wminkowski implementation, and test for scipy >= 1.8.0 by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/170

Parallelize deheap_sort by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/168

[bug] was overwriting input indices in graph utils function by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/172

fix: remove cache=True from numba functions that implement parallel=True by @rhysnewell in https://github.com/lmcinnes/pynndescent/pull/175

Ensure query after to update works correctly (fixes #180) by @lmcinnes in https://github.com/lmcinnes/pynndescent/pull/181

Ensure that tree_init=False works including for querying by @lmcinnes in https://github.com/lmcinnes/pynndescent/pull/182

New Contributors

@rhysnewell made their first contribution in https://github.com/lmcinnes/pynndescent/pull/175

Full Changelog: https://github.com/lmcinnes/pynndescent/compare/0.5.6...0.5.7
Source code(tar.gz)
Source code(zip)
0.5.6(Jan 21, 2022)
What's Changed

Fix "'NNDescent' object has no attribute '_vertex_order'" crash in NNDescent.update by @ThomasNickerson in https://github.com/lmcinnes/pynndescent/pull/147

1d wasserstein by @cjweir in https://github.com/lmcinnes/pynndescent/pull/155

fix docstring for NNDescent by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/160

Optional parallel query batch by @lmcinnes in https://github.com/lmcinnes/pynndescent/pull/162

Update sparse distances by @lmcinnes in https://github.com/lmcinnes/pynndescent/pull/165

New Contributors

@ThomasNickerson made their first contribution in https://github.com/lmcinnes/pynndescent/pull/147

Full Changelog: https://github.com/lmcinnes/pynndescent/compare/0.5.5...0.5.6
Source code(tar.gz)
Source code(zip)
0.5.5(Oct 15, 2021)
What's Changed

[bug fix] Always used checked_flagged_heap_push to prevent redundant edges by @jamestwebber in https://github.com/lmcinnes/pynndescent/pull/138

Add deprecation notice to n_search_trees parameter docstring by @jlmelville in https://github.com/lmcinnes/pynndescent/pull/143

New Contributors

@jamestwebber made their first contribution in https://github.com/lmcinnes/pynndescent/pull/138

Full Changelog: https://github.com/lmcinnes/pynndescent/compare/0.5.4...0.5.5
Source code(tar.gz)
Source code(zip)
0.5.4(Jul 6, 2021)

Try to fix threading state issues
Source code(tar.gz)
Source code(zip)
0.5.3(Jul 5, 2021)

Improve import speed with aggressive caching, fix a number of minor bugs.
Source code(tar.gz)
Source code(zip)
0.5.2(Feb 8, 2021)

A new patch release to fix potential issues with numba 0.53
Source code(tar.gz)
Source code(zip)
0.5.1(Feb 4, 2021)

Source code(tar.gz)
Source code(zip)
0.5rc2(Sep 6, 2020)

Source code(tar.gz)
Source code(zip)
0.5rc1(Sep 5, 2020)

Source code(tar.gz)
Source code(zip)
0.4.8(Jun 29, 2020)

Fixes an issue with Scipy.sparse.lil_matrix handling from the most recent scipy release
Source code(tar.gz)
Source code(zip)
0.4.7(Mar 14, 2020)

Fix for sparse threaded cases.
Source code(tar.gz)
Source code(zip)
0.4.6(Mar 13, 2020)

Fixes a bug in sparse angular metrics.
Source code(tar.gz)
Source code(zip)
0.4.5(Feb 8, 2020)

Source code(tar.gz)
Source code(zip)
0.4.4(Jan 12, 2020)

Source code(tar.gz)
Source code(zip)
0.4.3(Jan 5, 2020)

Source code(tar.gz)
Source code(zip)
0.4.2(Dec 13, 2019)

Exclude cached files
Source code(tar.gz)
Source code(zip)
0.4.1(Dec 13, 2019)

Source code(tar.gz)
Source code(zip)
v0.3.3(Sep 9, 2019)

Source code(tar.gz)
Source code(zip)
0.3.2(Sep 7, 2019)

Source code(tar.gz)
Source code(zip)