Approximate Nearest Neighbor Search for Sparse Data in Python!

Overview

PySparNN

Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).

Out of the box, PySparNN supports Cosine Distance (i.e. 1 - cosine_similarity).

PySparNN benefits:

  • Designed to be efficient on sparse data (memory & cpu).
  • Implemented leveraging existing python libraries (scipy & numpy).
  • Easily extended with other metrics: Manhattan, Euclidian, Jaccard, etc.
  • Supports incremental insertion of elements.

If your data is NOT SPARSE - please consider faiss or annoy. They use similar methods and I am a big fan of both. You should expect better performance on dense vectors from both of those projects.

The most comparable library to PySparNN is scikit-learn's LSHForest module. As of this writing, PySparNN is ~4x faster on the 20newsgroups dataset (as a sparse vector). A more robust benchmarking on sparse data is desired. Here is the comparison. Here is another comparison on the larger Enron email dataset.

Example Usage

Simple Example

import pysparnn.cluster_index as ci

import numpy as np
from scipy.sparse import csr_matrix

features = np.random.binomial(1, 0.01, size=(1000, 20000))
features = csr_matrix(features)

# build the search index!
data_to_return = range(1000)
cp = ci.MultiClusterIndex(features, data_to_return)

cp.search(features[:5], k=1, return_distance=False)
>> [[0], [1], [2], [3], [4]]

Text Example

import pysparnn.cluster_index as ci

from sklearn.feature_extraction.text import TfidfVectorizer

data = [
    'hello world',
    'oh hello there',
    'Play it',
    'Play it again Sam',
]    

tv = TfidfVectorizer()
tv.fit(data)

features_vec = tv.transform(data)

# build the search index!
cp = ci.MultiClusterIndex(features_vec, data)

# search the index with a sparse matrix
search_data = [
    'oh there',
    'Play it again Frank'
]

search_features_vec = tv.transform(search_data)

cp.search(search_features_vec, k=1, k_clusters=2, return_distance=False)
>> [['oh hello there'], ['Play it again Sam']]

Requirements

PySparNN requires numpy and scipy. Tested with numpy 1.11.2 and scipy 0.18.1.

Installation

# clone pysparnn
cd pysparnn 
pip install -r requirements.txt 
python setup.py install

How PySparNN works

Searching for a document in an collection of D documents is naively O(D) (assuming documents are constant sized).

However! we can create a tree structure where the first level is O(sqrt(D)) and each of the leaves are also O(sqrt(D)) - on average.

We randomly pick sqrt(D) candidate items to be in the top level. Then -- each document in the full list of D documents is assigned to the closest candidate in the top level.

This breaks up one O(D) search into two O(sqrt(D)) searches which is much much faster when D is big!

This generalizes to h levels. The runtime becomes: O(h * h_root(D))

Further Information

http://nlp.stanford.edu/IR-book/html/htmledition/cluster-pruning-1.html

See the CONTRIBUTING file for how to help out.

License

PySparNN is BSD-licensed. We also provide an additional patent grant.

Comments
  • Callable metric function for custom distances

    Callable metric function for custom distances

    Hi,

    PySparNN looks great! I'd like to know if there is a provision for using custom distances as a metric while searching for nearest neighbors.

    E.g. For sklearn.neighbors.NearestNeighbors, the documentation mentions:

    metric : string or callable If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them.

    Do we have such a provision for custom distances in PySparNN? Thanks.

    opened by tejasshah93 15
  • Fewer than k nearest neighbors are found

    Fewer than k nearest neighbors are found

    I use the MultiClusterIndex class. With the search method, I only changed the parameter k to 10, but fewer than 10 nearest neighbors are found, only 80% of the examples returned 10 NNs. What I can adjust to make sure that I get 10 NNs for all the cases?

    Another question is in the examples, the doc_index are for the whole dataset. Isn’t it should be only for the training dataset?

    opened by wangyuran 9
  • Python3/Anaconda compatability

    Python3/Anaconda compatability

    I got it working for Anaconda3 by doing the following:

    In cluster_pruning.py

    123c123 < records_index = np.arange(features.shape[0])

            records_index = list(np.arange(features.shape[0]))
    

    131c131 < np.arange(clusters_selection.shape[0]))

                                 list(np.arange(clusters_selection.shape[0])))
    

    223c223 < if feature <> None and record <> None:

        if feature != None and record != None:
    

    273a274

                        elements = list(elements)
    

    In matrix_distance.py:

    123c123,124 < arg_index = np.random.choice(len(scores), k, replace=False)

                lenScores = len(scores)
                arg_index = np.random.choice(lenScores, min(lenScores, k), replace=False)
    

    329a331

    In init.py:

    7c7 < from cluster_pruning import ClusterIndex, MultiClusterIndex

    from .cluster_pruning import ClusterIndex, MultiClusterIndex

    I you should just create two more files for ClusterIndex and MultiClusterIndex. Otherwise it will cause issues with importing in Python3 and backwards compatability for Python2

    opened by ontocord 7
  • Speedup pysparnn with CUDA

    Speedup pysparnn with CUDA

    Hi!

    I think this project is similar to the toolchain we are currently using in our production - https://github.com/ekzhu/datasketch + https://github.com/src-d/minhashcuda Do you think it is possible to leverage minhashcuda in pysparnn?

    enhancement help wanted question 
    opened by vmarkovtsev 6
  • AttributeError: 'matrix' object has no attribute 'toarray'

    AttributeError: 'matrix' object has no attribute 'toarray'

    Traceback (most recent call last):
      File "/Users/Bhoomit/work/robin/pysparnn/tests/test_pysparnn.py", line 52, in test_veccosine
        return_distance=False)
      File "pysparnn/cluster_pruning.py", line 274, in search
        k_clusters=k_clusters))
      File "pysparnn/cluster_pruning.py", line 208, in _search
        max_distance=max_distance)
      File "pysparnn/matrix_distance.py", line 81, in nearest_search
        dist_matrix = self._distance(sparse_features)
      File "pysparnn/matrix_distance.py", line 182, in _distance
        return 1 - dprod.multiply(magnitude).toarray()
    AttributeError: 'matrix' object has no attribute 'toarray'
    

    I had to use return 1 - np.asarray(dprod.multiply(magnitude))

    Am I doing something wrong? I'm using python 2.7 and virtualenv.

    enhancement 
    opened by bhoomit 4
  • Problem with large data set

    Problem with large data set

    When I try to build the search index just like in the example I get a Memory Error:. The data I use is wiki data, I have ~50 gb ram than can be used.

    MemoryError                               Traceback (most recent call last)
    <ipython-input-15-9815ea3148b7> in <module>()
          1 import pysparnn.cluster_index as ci
    ----> 2 cp = ci.MultiClusterIndex(tfidfdtmatrix, doctexts)
          3 k = 10
          4 nq = 1
    
    ~/miniconda3/lib/python3.6/site-packages/pysparnn/cluster_index.py in __init__(self, features, records_data, distance_type, matrix_size, num_indexes)
        425         for _ in range(num_indexes):
        426             self.indexes.append((ClusterIndex(features, records_data,
    --> 427                                               distance_type, matrix_size)))
        428 
        429     def insert(self, feature, record):
    
    ~/miniconda3/lib/python3.6/site-packages/pysparnn/cluster_index.py in __init__(self, features, records_data, distance_type, matrix_size, parent)
        121         else:
        122             self.is_terminal = False
    --> 123             records_data = _np.array(records_data)
        124 
        125             records_index = list(_np.arange(features.shape[0]))
    
    MemoryError: 
    
    >>> tfidfdtmatrix
    <310993x1225250 sparse matrix of type '<class 'numpy.float64'>'
    	with 34322030 stored elements in Compressed Sparse Row format>
    >>> print(type(doctexts))
    <class 'list'>
    >>> print(type(doctexts[0]))
    <class 'str'>
    
    opened by Fethbita 3
  • Large data sets cannot be used.

    Large data sets cannot be used.

    When I run the script with facebookresearch's fasttext, it cannot acommodate it. But the sample script seems to work well. On this line of the example script: cp = snn.ClusterIndex(feat, data_to_return)

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-14-61d2fb78a7b1> in <module>()
    ----> 1 cp = snn.ClusterIndex(feat, data_to_return)
    
    /home/jung4351/pysparnn/pysparnn/cluster_pruning.py in __init__(self, sparse_features, records_data, distance_type, matrix_size, parent)
        144             records_index = np.arange(sparse_features.shape[0])
        145             clusters_size = min(self.matrix_size, num_records)
    --> 146             clusters_selection = random.sample(records_index, clusters_size)
        147             clusters_selection = sparse_features[clusters_selection]
        148
    
    /usr/lib/python3.4/random.py in sample(self, population, k)
        309             population = tuple(population)
        310         if not isinstance(population, _Sequence):
    --> 311             raise TypeError("Population must be a sequence or set.  For dicts, use list(d).")
        312         randbelow = self._randbelow
        313         n = len(population)
    
    TypeError: Population must be a sequence or set.  For dicts, use list(d).
    
    invalid 
    opened by younghj 2
  • Is there a way to get row number for the matched value in the index?

    Is there a way to get row number for the matched value in the index?

    The search function, returns only matched value and the similarity score. Is there a way to get which row# in the index the match was found for the top k results?

    opened by dmoham1476 1
  • Interpolation NN

    Interpolation NN

    Hello, I have spatial data (points in Cartesian x, y, z.). I need to designate new value “z” using algorithm Natural Neighbour. Should I use PySparNN or different library? I ask because I haven’t been able to install annoy in win7.

    opened by ainglot 1
  • Near misses for min_distance thresholds

    Near misses for min_distance thresholds

    From README:

    Note on min_distance thresholds - Each document is assigned to the closest candidate cluster. When we set min_distance we will filter out clusters that dont meet that requirement without going into the individual clusters looking for matches. This means that we are likely to miss some good matches along the way since we wont investigate clusters that just miss the cutoff. A (planned) patch for this behavior would be to also search clusters that 'just' miss this cutoff.

    enhancement help wanted 
    opened by spencebeecher 1
  • Duplicate Elements >> matrix_size cause infinite loop

    Duplicate Elements >> matrix_size cause infinite loop

    When you are dealing with a large number of duplicate items the recursive nature of the alg goes into an infinite loop. Instead of getting the duplicates distributed evenly at a level they are all allocated to the first item in the matrix.

    The current fix is not very efficient:

    Line ~96 in nearest_search within matrix_distance.py

    if scores.sum() < 0.0001 and len(scores) > 0:
        # they are all practically the same
        # we have to do this to prevent infinite recursion
        # TODO: would love an alternative solution
        arg_index = np.random.choice(len(scores), k, replace=False)
    else:
        arg_index = np.argsort(scores)[:k]
    
    enhancement help wanted 
    opened by spencebeecher 0
  • Proposing a PR to fix a few small typos

    Proposing a PR to fix a few small typos

    Issue Type

    [x] Bug (Typo)

    Steps to Replicate and Expected Behaviour

    • Examine pysparnn/matrix_distance.py and observe implmentation, however expect to see implementation.
    • Examine pysparnn/cluster_index.py and observe reccomended, however expect to see recommended.
    • Examine pysparnn/matrix_distance.py and observe shortucts, however expect to see shortcuts.
    • Examine pysparnn/cluster_index.py and observe perameters, however expect to see parameters.
    • Examine pysparnn/cluster_index.py and observe naievely, however expect to see naively.
    • Examine pysparnn/cluster_index.py and observe multtiple, however expect to see multiple.
    • Examine pysparnn/cluster_index.py and observe doccuments, however expect to see documents.

    Notes

    Semi-automated issue generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

    To avoid wasting CI processing resources a branch with the fix has been prepared but a pull request has not yet been created. A pull request fixing the issue can be prepared from the link below, feel free to create it or request @timgates42 create the PR. Alternatively if the fix is undesired please close the issue with a small comment about the reasoning.

    https://github.com/timgates42/pysparnn/pull/new/bugfix_typos

    Thanks.

    opened by timgates42 0
  • Sorting based on `string` type. Problems when distance is very small value ('5.55555E-6')

    Sorting based on `string` type. Problems when distance is very small value ('5.55555E-6')

    We have observed some weird behavior seeing small distances at the end of the list of closest matches.

    I believe the problem is that the distances are returned as string types and then the sorting does not work

    https://github.com/facebookresearch/pysparnn/blob/c299c825fd99f263f3957e9b31197daf23a1e7a3/pysparnn/cluster_index.py#L26

    opened by JoanFM 1
  • About the sparse search result compare with lucene searcher ?

    About the sparse search result compare with lucene searcher ?

    Thanks to provide this convenient toolkit I retrieve bm25 and tfidf sparse vector from lucene indexer (provide by pyserini) and use this project to generate sparse indexer to search. i find that these indexer can not beat original lucene search results. (this problem seems not have much effect on tiny datasets or semantic disperse datasets, but with the dataset become larger, the shortcomings seems can not be omitted which is the situation to use this project.)

    This is not the problem of your clustering search algorithm. But the sparse feature itself. And if i use SVD to decrease the dimension of sparse data, it can only maintain topic level feature. So i don’t understand the truly usage of sparse feature except calculate some search scores(like bm25) Because they seems weak than truly lexicon based score (bm25) and dense semantic similarity based on BERT sentence embedding (like Sentence-Transformers)

    Can you provide some truly awesome text sparse feature construction reference materials that can use this project in a suitable way ?

    opened by svjack 0
  • What is the impact of using dense vectors?

    What is the impact of using dense vectors?

    From a very quick test with a small index, this seems to work well with dense vectors (I tried d=300), but is there any specific impact of using dense vectors on performance for building or searching the index?

    opened by johann-petrak 3
  • the memory is leak when using pickle.load(MultiClusterInde object)

    the memory is leak when using pickle.load(MultiClusterInde object)

    python: 3.6 os: ubuntu numpy:1.16.1 scipy:1.0.0

    cp = ci.MultiClusterIndex(features, data_to_return) with open('cluterIndex.bin', 'wb') as fh_out: pickle.dump(cp, fh_out)

    obj_file = 'cluterIndex.bin' def test_memory_leak(obj_file): while True: with open('cluterIndex.bin', 'rb') as file_: cp =pickle.load(file_) if name == 'main': test_memory_leak(obj_file )

    opened by liuchenbaidu 0
Owner
Meta Research
Meta Research
Example Of Splunk Search Query With Python And Splunk Python SDK

SSQAuto (Splunk Search Query Automation) Example Of Splunk Search Query With Python And Splunk Python SDK installation: ➜ ~ git clone https://github.c

AmirHoseinTangsiriNET 1 Nov 14, 2021
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 3, 2023
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

null 898 Jan 9, 2023
Python Package for DataHerb: create, search, and load datasets.

The Python Package for DataHerb A DataHerb Core Service to Create and Load Datasets.

DataHerb 4 Feb 11, 2022
Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. ?? Motiv

Souvik Pratiher 31 Dec 16, 2022
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

Marc Skov Madsen 97 Dec 8, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Tuplex 791 Jan 4, 2023
Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

FangWei 1 Jan 16, 2022
Catalogue data - A Python Scripts to prepare catalogue data

catalogue_data Scripts to prepare catalogue data. Setup Clone this repo. Install

BigScience Workshop 3 Mar 3, 2022
fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

DAGsHub 359 Dec 22, 2022
A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data structure.

Zed(Zijun) Chen 40 Dec 12, 2022
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 3, 2022
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

null 2 Nov 20, 2021
Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

Yongxian (Caroline) Lun 1 Dec 27, 2021
PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Emmanuel Boateng Sifah 1 Jan 19, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 1, 2022
PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

PostQF Copyright © 2022 Ralph Seichter PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j. See the ma

Ralph Seichter 11 Nov 24, 2022