Approximate Nearest Neighbor Search for Sparse Data in Python!

Meta Research

Last update: Jan 1, 2023

Related tags

Data Analysis pysparnn

Overview

PySparNN

Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).

Out of the box, PySparNN supports Cosine Distance (i.e. 1 - cosine_similarity).

PySparNN benefits:

Designed to be efficient on sparse data (memory & cpu).
Implemented leveraging existing python libraries (scipy & numpy).
Easily extended with other metrics: Manhattan, Euclidian, Jaccard, etc.
Supports incremental insertion of elements.

If your data is NOT SPARSE - please consider faiss or annoy. They use similar methods and I am a big fan of both. You should expect better performance on dense vectors from both of those projects.

The most comparable library to PySparNN is scikit-learn's LSHForest module. As of this writing, PySparNN is ~4x faster on the 20newsgroups dataset (as a sparse vector). A more robust benchmarking on sparse data is desired. Here is the comparison. Here is another comparison on the larger Enron email dataset.

Example Usage

Simple Example

import pysparnn.cluster_index as ci

import numpy as np
from scipy.sparse import csr_matrix

features = np.random.binomial(1, 0.01, size=(1000, 20000))
features = csr_matrix(features)

# build the search index!
data_to_return = range(1000)
cp = ci.MultiClusterIndex(features, data_to_return)

cp.search(features[:5], k=1, return_distance=False)
>> [[0], [1], [2], [3], [4]]

Text Example

import pysparnn.cluster_index as ci

from sklearn.feature_extraction.text import TfidfVectorizer

data = [
    'hello world',
    'oh hello there',
    'Play it',
    'Play it again Sam',
]    

tv = TfidfVectorizer()
tv.fit(data)

features_vec = tv.transform(data)

# build the search index!
cp = ci.MultiClusterIndex(features_vec, data)

# search the index with a sparse matrix
search_data = [
    'oh there',
    'Play it again Frank'
]

search_features_vec = tv.transform(search_data)

cp.search(search_features_vec, k=1, k_clusters=2, return_distance=False)
>> [['oh hello there'], ['Play it again Sam']]

Requirements

PySparNN requires numpy and scipy. Tested with numpy 1.11.2 and scipy 0.18.1.

Installation

# clone pysparnn
cd pysparnn 
pip install -r requirements.txt 
python setup.py install

How PySparNN works

Searching for a document in an collection of D documents is naively O(D) (assuming documents are constant sized).

However! we can create a tree structure where the first level is O(sqrt(D)) and each of the leaves are also O(sqrt(D)) - on average.

We randomly pick sqrt(D) candidate items to be in the top level. Then -- each document in the full list of D documents is assigned to the closest candidate in the top level.

This breaks up one O(D) search into two O(sqrt(D)) searches which is much much faster when D is big!

This generalizes to h levels. The runtime becomes: O(h * h_root(D))

Further Information

http://nlp.stanford.edu/IR-book/html/htmledition/cluster-pruning-1.html

See the CONTRIBUTING file for how to help out.

License

PySparNN is BSD-licensed. We also provide an additional patent grant.

Comments

Callable metric function for custom distances

Hi,

PySparNN looks great! I'd like to know if there is a provision for using custom distances as a metric while searching for nearest neighbors.

E.g. For sklearn.neighbors.NearestNeighbors, the documentation mentions:

metric : string or callable If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them.

Do we have such a provision for custom distances in PySparNN? Thanks.

opened by tejasshah93 15
Fewer than k nearest neighbors are found

I use the MultiClusterIndex class. With the search method, I only changed the parameter k to 10, but fewer than 10 nearest neighbors are found, only 80% of the examples returned 10 NNs. What I can adjust to make sure that I get 10 NNs for all the cases?

Another question is in the examples, the doc_index are for the whole dataset. Isn’t it should be only for the training dataset?

opened by wangyuran 9
Python3/Anaconda compatability
I got it working for Anaconda3 by doing the following:

In cluster_pruning.py

123c123 < records_index = np.arange(features.shape[0])

records_index = list(np.arange(features.shape[0]))

131c131 < np.arange(clusters_selection.shape[0]))

list(np.arange(clusters_selection.shape[0])))

223c223 < if feature <> None and record <> None:

if feature != None and record != None:

273a274

elements = list(elements)

In matrix_distance.py:

123c123,124 < arg_index = np.random.choice(len(scores), k, replace=False)

lenScores = len(scores) arg_index = np.random.choice(lenScores, min(lenScores, k), replace=False)

329a331

In init.py:

7c7 < from cluster_pruning import ClusterIndex, MultiClusterIndex

from .cluster_pruning import ClusterIndex, MultiClusterIndex

I you should just create two more files for ClusterIndex and MultiClusterIndex. Otherwise it will cause issues with importing in Python3 and backwards compatability for Python2
opened by ontocord 7
Speedup pysparnn with CUDA

Hi!

I think this project is similar to the toolchain we are currently using in our production - https://github.com/ekzhu/datasketch + https://github.com/src-d/minhashcuda Do you think it is possible to leverage minhashcuda in pysparnn?
enhancement help wanted question

opened by vmarkovtsev 6

AttributeError: 'matrix' object has no attribute 'toarray'

Traceback (most recent call last):
  File "/Users/Bhoomit/work/robin/pysparnn/tests/test_pysparnn.py", line 52, in test_veccosine
    return_distance=False)
  File "pysparnn/cluster_pruning.py", line 274, in search
    k_clusters=k_clusters))
  File "pysparnn/cluster_pruning.py", line 208, in _search
    max_distance=max_distance)
  File "pysparnn/matrix_distance.py", line 81, in nearest_search
    dist_matrix = self._distance(sparse_features)
  File "pysparnn/matrix_distance.py", line 182, in _distance
    return 1 - dprod.multiply(magnitude).toarray()
AttributeError: 'matrix' object has no attribute 'toarray'

I had to use return 1 - np.asarray(dprod.multiply(magnitude))

Am I doing something wrong? I'm using python 2.7 and virtualenv.

enhancement

opened by bhoomit 4

Problem with large data set

When I try to build the search index just like in the example I get a Memory Error:. The data I use is wiki data, I have ~50 gb ram than can be used.

MemoryError                               Traceback (most recent call last)
<ipython-input-15-9815ea3148b7> in <module>()
      1 import pysparnn.cluster_index as ci
----> 2 cp = ci.MultiClusterIndex(tfidfdtmatrix, doctexts)
      3 k = 10
      4 nq = 1

~/miniconda3/lib/python3.6/site-packages/pysparnn/cluster_index.py in __init__(self, features, records_data, distance_type, matrix_size, num_indexes)
    425         for _ in range(num_indexes):
    426             self.indexes.append((ClusterIndex(features, records_data,
--> 427                                               distance_type, matrix_size)))
    428 
    429     def insert(self, feature, record):

~/miniconda3/lib/python3.6/site-packages/pysparnn/cluster_index.py in __init__(self, features, records_data, distance_type, matrix_size, parent)
    121         else:
    122             self.is_terminal = False
--> 123             records_data = _np.array(records_data)
    124 
    125             records_index = list(_np.arange(features.shape[0]))

MemoryError:

>>> tfidfdtmatrix
<310993x1225250 sparse matrix of type '<class 'numpy.float64'>'
	with 34322030 stored elements in Compressed Sparse Row format>
>>> print(type(doctexts))
<class 'list'>
>>> print(type(doctexts[0]))
<class 'str'>

opened by Fethbita 3

Large data sets cannot be used.

When I run the script with facebookresearch's fasttext, it cannot acommodate it. But the sample script seems to work well. On this line of the example script: cp = snn.ClusterIndex(feat, data_to_return)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-61d2fb78a7b1> in <module>()
----> 1 cp = snn.ClusterIndex(feat, data_to_return)

/home/jung4351/pysparnn/pysparnn/cluster_pruning.py in __init__(self, sparse_features, records_data, distance_type, matrix_size, parent)
    144             records_index = np.arange(sparse_features.shape[0])
    145             clusters_size = min(self.matrix_size, num_records)
--> 146             clusters_selection = random.sample(records_index, clusters_size)
    147             clusters_selection = sparse_features[clusters_selection]
    148

/usr/lib/python3.4/random.py in sample(self, population, k)
    309             population = tuple(population)
    310         if not isinstance(population, _Sequence):
--> 311             raise TypeError("Population must be a sequence or set.  For dicts, use list(d).")
    312         randbelow = self._randbelow
    313         n = len(population)

TypeError: Population must be a sequence or set.  For dicts, use list(d).

invalid

opened by younghj 2

Is there a way to get row number for the matched value in the index?

The search function, returns only matched value and the similarity score. Is there a way to get which row# in the index the match was found for the top k results?

opened by dmoham1476 1
Interpolation NN

Hello, I have spatial data (points in Cartesian x, y, z.). I need to designate new value “z” using algorithm Natural Neighbour. Should I use PySparNN or different library? I ask because I haven’t been able to install annoy in win7.

opened by ainglot 1
Near misses for min_distance thresholds

From README:

Note on min_distance thresholds - Each document is assigned to the closest candidate cluster. When we set min_distance we will filter out clusters that dont meet that requirement without going into the individual clusters looking for matches. This means that we are likely to miss some good matches along the way since we wont investigate clusters that just miss the cutoff. A (planned) patch for this behavior would be to also search clusters that 'just' miss this cutoff.

enhancement help wanted

opened by spencebeecher 1
Duplicate Elements >> matrix_size cause infinite loop
When you are dealing with a large number of duplicate items the recursive nature of the alg goes into an infinite loop. Instead of getting the duplicates distributed evenly at a level they are all allocated to the first item in the matrix.

The current fix is not very efficient:

Line ~96 in nearest_search within matrix_distance.py

if scores.sum() < 0.0001 and len(scores) > 0: # they are all practically the same # we have to do this to prevent infinite recursion # TODO: would love an alternative solution arg_index = np.random.choice(len(scores), k, replace=False) else: arg_index = np.argsort(scores)[:k]
enhancement help wanted
opened by spencebeecher 0
Proposing a PR to fix a few small typos
Issue Type

[x] Bug (Typo)

Steps to Replicate and Expected Behaviour

Examine pysparnn/matrix_distance.py and observe implmentation, however expect to see implementation.

Examine pysparnn/cluster_index.py and observe reccomended, however expect to see recommended.

Examine pysparnn/matrix_distance.py and observe shortucts, however expect to see shortcuts.

Examine pysparnn/cluster_index.py and observe perameters, however expect to see parameters.

Examine pysparnn/cluster_index.py and observe naievely, however expect to see naively.

Examine pysparnn/cluster_index.py and observe multtiple, however expect to see multiple.

Examine pysparnn/cluster_index.py and observe doccuments, however expect to see documents.

Notes

Semi-automated issue generated by https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

To avoid wasting CI processing resources a branch with the fix has been prepared but a pull request has not yet been created. A pull request fixing the issue can be prepared from the link below, feel free to create it or request @timgates42 create the PR. Alternatively if the fix is undesired please close the issue with a small comment about the reasoning.

https://github.com/timgates42/pysparnn/pull/new/bugfix_typos

Thanks.
opened by timgates42 0
Sorting based on `string` type. Problems when distance is very small value ('5.55555E-6')

We have observed some weird behavior seeing small distances at the end of the list of closest matches.

I believe the problem is that the distances are returned as string types and then the sorting does not work

https://github.com/facebookresearch/pysparnn/blob/c299c825fd99f263f3957e9b31197daf23a1e7a3/pysparnn/cluster_index.py#L26

opened by JoanFM 1
About the sparse search result compare with lucene searcher ?

Thanks to provide this convenient toolkit I retrieve bm25 and tfidf sparse vector from lucene indexer (provide by pyserini) and use this project to generate sparse indexer to search. i find that these indexer can not beat original lucene search results. (this problem seems not have much effect on tiny datasets or semantic disperse datasets, but with the dataset become larger, the shortcomings seems can not be omitted which is the situation to use this project.)

This is not the problem of your clustering search algorithm. But the sparse feature itself. And if i use SVD to decrease the dimension of sparse data, it can only maintain topic level feature. So i don’t understand the truly usage of sparse feature except calculate some search scores(like bm25) Because they seems weak than truly lexicon based score (bm25) and dense semantic similarity based on BERT sentence embedding (like Sentence-Transformers)

Can you provide some truly awesome text sparse feature construction reference materials that can use this project in a suitable way ?

opened by svjack 0
What is the impact of using dense vectors?

From a very quick test with a small index, this seems to work well with dense vectors (I tried d=300), but is there any specific impact of using dense vectors on performance for building or searching the index?

opened by johann-petrak 3
the memory is leak when using pickle.load(MultiClusterInde object)

python: 3.6 os: ubuntu numpy:1.16.1 scipy:1.0.0

cp = ci.MultiClusterIndex(features, data_to_return) with open('cluterIndex.bin', 'wb') as fh_out: pickle.dump(cp, fh_out)

obj_file = 'cluterIndex.bin' def test_memory_leak(obj_file): while True: with open('cluterIndex.bin', 'rb') as file_: cp =pickle.load(file_) if name == 'main': test_memory_leak(obj_file )

opened by liuchenbaidu 0

Owner

Meta Research

GitHub

Example Of Splunk Search Query With Python And Splunk Python SDK

SSQAuto (Splunk Search Query Automation) Example Of Splunk Search Query With Python And Splunk Python SDK installation: ➜ ~ git clone https://github.c

1 Nov 14, 2021

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

3.7k Jan 3, 2023

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 9, 2023

Python Package for DataHerb: create, search, and load datasets.

The Python Package for DataHerb A DataHerb Core Service to Create and Load Datasets.

4 Feb 11, 2022

Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. ?? Motiv

31 Dec 16, 2022

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

791 Jan 4, 2023

Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

1 Jan 16, 2022

Catalogue data - A Python Scripts to prepare catalogue data

catalogue_data Scripts to prepare catalogue data. Setup Clone this repo. Install

3 Mar 3, 2022

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

359 Dec 22, 2022

A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data structure.

40 Dec 12, 2022

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Functional Data Analysis Python package

Grupo de Aprendizaje Automático - Universidad Autónoma de Madrid

184 Dec 27, 2022

Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

3 Oct 3, 2022

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

2 Dec 12, 2021

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

1 Dec 27, 2021

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

1 Jan 19, 2022

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

27 Nov 1, 2022

PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

11 Nov 24, 2022

Approximate Nearest Neighbor Search for Sparse Data in Python!

Related tags

Overview

PySparNN

Example Usage

Simple Example

Text Example

Requirements

Installation

How PySparNN works

Further Information

License

Comments

123c123 < records_index = np.arange(features.shape[0])

131c131 < np.arange(clusters_selection.shape[0]))

223c223 < if feature <> None and record <> None:

123c123,124 < arg_index = np.random.choice(len(scores), k, replace=False)

7c7 < from cluster_pruning import ClusterIndex, MultiClusterIndex

Issue Type

Steps to Replicate and Expected Behaviour

Notes

Owner

Meta Research

Example Of Splunk Search Query With Python And Splunk Python SDK

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Python Package for DataHerb: create, search, and load datasets.

Instant search for and access to many datasets in Pyspark.

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Python data processing, analysis, visualization, and data operations

Catalogue data - A Python Scripts to prepare catalogue data

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

A data parser for the internal syncing data format used by Fog of World.

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Fancy data functions that will make your life as a data scientist easier.

A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Utilize data analytics skills to solve real-world business problems using Humana’s big data

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.