Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

Last update: Jan 4, 2023

Related tags

Overview

Non-Metric Space Library (NMSLIB)

Important Notes

NMSLIB is generic but fast, see the results of ANN benchmarks.
A standalone implementation of our fastest method HNSW also exists as a header-only library.
All the documentation (including using Python bindings and the query server, description of methods and spaces, building the library, etc) can be found on this page.
For generic questions/inquiries, please, use the Gitter chat: GitHub issues page is for bugs and feature requests.

Objectives

Non-Metric Space Library (NMSLIB) is an efficient cross-platform similarity search library and a toolkit for evaluation of similarity search methods. The core-library does not have any third-party dependencies. It has been gaining popularity recently. In particular, it has become a part of Amazon Elasticsearch Service.

The goal of the project is to create an effective and comprehensive toolkit for searching in generic and non-metric spaces. Even though the library contains a variety of metric-space access methods, our main focus is on generic and approximate search methods, in particular, on methods for non-metric spaces. NMSLIB is possibly the first library with a principled support for non-metric space searching.

NMSLIB is an extendible library, which means that is possible to add new search methods and distance functions. NMSLIB can be used directly in C++ and Python (via Python bindings). In addition, it is also possible to build a query server, which can be used from Java (or other languages supported by Apache Thrift (version 0.12). Java has a native client, i.e., it works on many platforms without requiring a C++ library to be installed.

Authors: Bilegsaikhan Naidan, Leonid Boytsov, Yury Malkov, David Novak. With contributions from Ben Frederickson, Lawrence Cayton, Wei Dong, Avrelin Nikita, Dmitry Yashunin, Bob Poekert, @orgoro, @gregfriedland, Scott Gigante, Maxim Andreev, Daniel Lemire, Nathan Kurz, Alexander Ponomarenko.

Brief History

NMSLIB started as a personal project of Bilegsaikhan Naidan, who created the initial code base, the Python bindings, and participated in earlier evaluations. The most successful class of methods--neighborhood/proximity graphs--is represented by the Hierarchical Navigable Small World Graph (HNSW) due to Malkov and Yashunin (see the publications below). Other most useful methods, include a modification of the VP-tree due to Boytsov and Naidan (2013), a Neighborhood APProximation index (NAPP) proposed by Tellez et al. (2013) and improved by David Novak, as well as a vanilla uncompressed inverted file.

Credits and Citing

If you find this library useful, feel free to cite our SISAP paper [BibTex] as well as other papers listed in the end. One crucial contribution to cite is the fast Hierarchical Navigable World graph (HNSW) method [BibTex]. Please, also check out the stand-alone HNSW implementation by Yury Malkov, which is released as a header-only HNSWLib library.

License

The code is released under the Apache License Version 2.0 http://www.apache.org/licenses/. Older versions of the library include additional components, which have different licenses (but this does not apply to NMLISB 2.x):

Older versions of the library included the following components:

The LSHKIT, which is embedded in our library, is distributed under the GNU General Public License, see http://www.gnu.org/licenses/.
The k-NN graph construction algorithm NN-Descent due to Dong et al. 2011 (see the links below), which is also embedded in our library, seems to be covered by a free-to-use license, similar to Apache 2.
FALCONN library's licence is MIT.

Funding

Leonid Boytsov was supported by the Open Advancement of Question Answering Systems (OAQA) group and the following NSF grant #1618159: "Matching and Ranking via Proximity Graphs: Applications to Question Answering and Beyond". Bileg was supported by the iAd Center.

Related Publications

Most important related papers are listed below in the chronological order:

L. Boytsov, D. Novak, Y. Malkov, E. Nyberg (2016). Off the Beaten Path: Let’s Replace Term-Based Retrieval with k-NN Search. In proceedings of CIKM'16. [BibTex] We use a special branch of this library, plus the following Java code.
Malkov, Y.A., Yashunin, D.A.. (2016). Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs. CoRR, abs/1603.09320. [BibTex]
Bilegsaikhan, N., Boytsov, L. 2015 Permutation Search Methods are Efficient, Yet Faster Search is Possible PVLDB, 8(12):1618--1629, 2015 [BibTex]
Ponomarenko, A., Averlin, N., Bilegsaikhan, N., Boytsov, L., 2014. Comparative Analysis of Data Structures for Approximate Nearest Neighbor Search. [BibTex]
Malkov, Y., Ponomarenko, A., Logvinov, A., & Krylov, V., 2014. Approximate nearest neighbor algorithm based on navigable small world graphs. Information Systems, 45, 61-68. [BibTex]
Boytsov, L., Bilegsaikhan, N., 2013. Engineering Efficient and Effective Non-Metric Space Library. In Proceedings of the 6th International Conference on Similarity Search and Applications (SISAP 2013). [BibTex]
Boytsov, L., Bilegsaikhan, N., 2013. Learning to Prune in Metric and Non-Metric Spaces. In Advances in Neural Information Processing Systems 2013. [BibTex]
Tellez, Eric Sadit, Edgar Chávez, and Gonzalo Navarro. Succinct nearest neighbor search. Information Systems 38.7 (2013): 1019-1030. [BibTex]
A. Ponomarenko, Y. Malkov, A. Logvinov, and V. Krylov Approximate nearest neighbor search small world approach. ICTA 2011
Dong, Wei, Charikar Moses, and Kai Li. 2011. Efficient k-nearest neighbor graph construction for generic similarity measures. Proceedings of the 20th international conference on World wide web. ACM, 2011. [BibTex]
L. Cayton, 2008 Fast nearest neighbor retrieval for bregman divergences. Twenty-Fifth International Conference on Machine Learning (ICML). [BibTex]
Amato, Giuseppe, and Pasquale Savino. 2008 Approximate similarity search in metric spaces using inverted files. [BibTex]
Gonzalez, Edgar Chavez, Karina Figueroa, and Gonzalo Navarro. Effective proximity retrieval by ordering permutations. Pattern Analysis and Machine Intelligence, IEEE Transactions on 30.9 (2008): 1647-1658. [BibTex]

Comments

Add support to build aarch64 wheels

Travis-CI allows for the creation of aarch64 wheels.

Build: https://travis-ci.com/github/janaknat/nmslib/builds/205780637

There are 8-9 failures when testing hnsw. Any suggestions on how to fix these? A majority of the failures are due to expected=0.99 and calculated=~0.98.

Tagging @jmazanec15 since he added ARM compatibility.

opened by janaknat 33

Speed up pip install

Currently pip installing is slow, since there is a compile step. Is there any way to speed it up? On my macbook:

time pip install --no-cache nmslib
Collecting nmslib
  Downloading https://files.pythonhosted.org/packages/e1/95/1f7c90d682b79398c5ee3f9296be8d2640fa41de24226bcf5473c801ada6/nmslib-1.7.3.6.tar.gz (255kB)
    100% |████████████████████████████████| 256kB 8.8MB/s 
Requirement already satisfied: pybind11>=2.0 in .../virtualenv/python3.6/lib/python3.6/site-packages (from nmslib) (2.2.4)
Requirement already satisfied: numpy in .../virtualenv/python3.6/lib/python3.6/site-packages (from nmslib) (1.15.4)
Installing collected packages: nmslib
  Running setup.py install for nmslib ... -
done
Successfully installed nmslib-1.7.3.6

real	3m11.091s

would it be a good idea to provide pre-compiled wheels over pip? That would also simplify the process of finding the pybind11 headers (I had to do something special to copy them in for pip when running with a --target dir)

opened by matthen 33

Can't load index?

Hi, this might me more of a question than problem in the library. I have created an index with NAPP and saved it using saveIndex. However when I load it with loadIndex I get the following error:

Check failed: A previously saved index is apparently used with a different data set, a different data set split, and/or a different gold standard file! (detected an object index >= #of data points

Am I doing something wrong?

Thanks for the help.

EDIT: The message doesn't make sense to me because I'm not "using the index with a data set", I'm just loading it.

EDIT2: I'm using the Python interface.
enhancement

opened by zommerfelds 31
Custom Metrics

Hello,

I wanted to perform NN search on a dataset of genomes. For this task, the distance between 2 datapoints is calculated by a custom script? Is there I can incorporate this without having to create the entire NN search algorithm myself and only modify some parts of your code?

opened by Chokerino 30

Python process crashes: 'pybind11::error_already_set'

nmslib is the only lib in our project that relies on pybind11 and we could narrow it down to the Dask nodes that use nmslib. When we disable the nodes that use nmslib it doesn't crash.

terminate called after throwing an instance of 'pybind11::error_already_set'
  what():  TypeError: '>=' not supported between instances of 'int' and 'NoneType'

At:
  /opt/conda/envs/jobnet-env/lib/python3.6/logging/__init__.py(1546): isEnabledFor
  /opt/conda/envs/jobnet-env/lib/python3.6/logging/__init__.py(1293): debug

/usr/local/bin/entrypoint.sh: line 46:    21 Aborted                 (core dumped) python scripts/cli.py "${@:2}"```

Version:

- nmslib~=1.7.2
- pybind11=2.2

opened by lukin0110 28

Make failed in linking Boost library

Hello,

I am facing an error in this step:

[ 75%] Linking CXX executable ../release/experiment

All of errors liked that:

undefined reference to `boost::program_options:

I install latest libraries version and checked that libboost 1.58 is compatible with g++ 4.9. I think maybe it related with C++11, however It returns error in both g++ 4.9 and 4.7.

This is my system information:

-- The C compiler identification is GNU 4.9.3 -- The CXX compiler identification is GNU 4.9.3 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Build type: Release -- GSL using gsl-config /usr/bin/gsl-config -- Using GSL from /usr -- Found GSL. -- Found Eigen3: /usr/include/eigen3 (Required is at least version "3") -- Found Eigen3. -- Boost version: 1.58.0 -- Found the following Boost libraries: -- system -- filesystem -- program_options -- Found BOOST.

I also install Clang and LLDB 3.6. I tried search many possible solution but can not fix that :(.

opened by nguyenv7 26
Python wrapper crashes while retrieving nearest neighbors when M>100

Hi, I am working on a problem where I need to retrieve ~500 nearest neighbors out of a million points. I am using the python wrapper for HNSW method. The code works perfectly well if I set the value of parameter M <=100 but setting it greater than 100, the code crashes during retrieving nearest neighbors (no issues while building the model) with an "invalid next size" error. Any idea why this might be happening? Thanks Himanshu
bug

opened by hjain689 25

Incorrect distances returned for all-zero query

An all-zero query vector will result in NMSLib incorrectly reporting a distance of zero for its nearest neighbours (see example below). Is this related to #187? Is there a suggested workaround?

# Training set (CSR sparse matrix)
X.todense()
# Out:
# matrix([[4., 2., 3., 1., 0., 0., 0., 0., 0.],
#         [2., 1., 0., 0., 3., 0., 1., 2., 1.],
#         [4., 2., 0., 0., 3., 1., 0., 0., 0.]], dtype=float32)

# Query vector (CSR sparse matrix)
r.todense()
# Out:
# matrix([[0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

# Train and query
import nmslib
index = nmslib.init(
    method='hnsw',
    space='cosinesimil_sparse_fast',
    data_type=nmslib.DataType.SPARSE_VECTOR,
    dtype=nmslib.DistType.FLOAT)
index.addDataPointBatch(X)
index.createIndex()
index.knnQueryBatch(r, k=3)
# Out:
# [(array([2, 1, 0], dtype=int32), array([0., 0., 0.], dtype=float32))]

# Note that distances are all 0, which is incorrect!
# Same result for dense training & query vectors.

bug

opened by lsorber 24

Jaccard to method HSNW for sparse features

Hi,

I want to know if HSNW provides Jaccard (similarity or distance, does not matter), besides cosine, for sparse features. There are scenarios in which Jaccard outperforms.

Python notebooks provided show the following metrices: l2, l2sqr_sift, cosinesimil_sparse.

According to space_sparse_scalar.h, the following metrices seem to be implemented, or in preparation, to sparse features: #define SPACE_SPARSE_COSINE_SIMILARITY "cosinesimil_sparse" #define SPACE_SPARSE_ANGULAR_DISTANCE "angulardist_sparse" #define SPACE_SPARSE_NEGATIVE_SCALAR "negdotprod_sparse" #define SPACE_SPARSE_QUERY_NORM_NEGATIVE_SCALAR "querynorm_negdotprod_sparse"

What does each of these metrices mean? I also saw cosinesimil_sparse_fast in a few files. What is it, and how is it compared to cosinesimil_sparse? Is it ready for use?

I can provide a Jaccard implementation for sparse vectors, given 2 vectors implemented as hash tables, but I haven't found out how to integrate it to the code. It would also be preferable to check which metrices are already available. The closest clue I got was to expand the following files: distcomp_scalar.cc, hnsw.cc and hnsw_distfunc_opt.cc, but I am not sure which steps to make. I saw some mentions to Jaccard in space_sparse_jaccard.cc and distcomp.h. But no examples are given.

Thanks in advance.

opened by icarocd 24

pybind11.h not found when installing using pip

I'm trying to install python bindings on Ubuntu 16.04 machine:

$ pip3 install pybind11 nmslib
Collecting nmslib
  Using cached https://files.pythonhosted.org/packages/de/eb/28b2060bb1750426c5618e3ad6ce830ac3cfd56cb3eccfb799e52d6064db/nmslib-1.7.2.tar.gz
Requirement already satisfied: pybind11>=2.0 in /homes/alexandrov/.virtualenvs/pytorch/lib/python3.5/site-packages (from nmslib) (2.2.2)
Requirement already satisfied: numpy in /homes/alexandrov/.virtualenvs/pytorch/lib/python3.5/site-packages (from nmslib) (1.14.2)
Building wheels for collected packages: nmslib
  Running setup.py bdist_wheel for nmslib ... error
  Complete output from command /homes/alexandrov/.virtualenvs/pytorch/bin/python3 -u -c "import setuptools, tokenize;__file__='/tmp/pip-install-0y71oxa4/nmslib/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" bdist_wheel -d /tmp/pip-wheel-916r1rr9 --python-tag cp35:
  running bdist_wheel
  running build
  running build_ext
  creating tmp
  x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.5m -I/homes/alexandrov/.virtualenvs/pytorch/include/python3.5m -c /tmp/tmpwekdswov.cpp -o tmp/tmpwekdswov.o -std=c++14
  cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
  x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I/usr/include/python3.5m -I/homes/alexandrov/.virtualenvs/pytorch/include/python3.5m -c /tmp/tmpyyphh022.cpp -o tmp/tmpyyphh022.o -fvisibility=hidden
  cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
  building 'nmslib' extension
  creating build
  creating build/temp.linux-x86_64-3.5
  creating build/temp.linux-x86_64-3.5/nmslib
  creating build/temp.linux-x86_64-3.5/nmslib/similarity_search
  creating build/temp.linux-x86_64-3.5/nmslib/similarity_search/src
  creating build/temp.linux-x86_64-3.5/nmslib/similarity_search/src/method
  creating build/temp.linux-x86_64-3.5/nmslib/similarity_search/src/space
  x86_64-linux-gnu-gcc -pthread -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -g -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -fPIC -I./nmslib/similarity_search/include -Iinclude -Iinclude -I/homes/alexandrov/.virtualenvs/pytorch/lib/python3.5/site-packages/numpy/core/include -I/usr/include/python3.5m -I/homes/alexandrov/.virtualenvs/pytorch/include/python3.5m -c nmslib.cc -o build/temp.linux-x86_64-3.5/nmslib.o -O3 -march=native -fopenmp -DVERSION_INFO="1.7.2" -std=c++14 -fvisibility=hidden
  cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
  nmslib.cc:16:31: fatal error: pybind11/pybind11.h: No such file or directory
  compilation terminated.
  error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

Clearly, pybind11 headers were not installed on my machine. This library is not packaged for apt-get (at least not for Ubuntu 16.04), so I needed to manually install from source.

Would be nice if nmslib install script took care of this.

opened by taketwo 23

Optimized index raises RuntimeError on load when saved with `negdotprod` space

Basically, this is what I am trying to do

import nmslib

space = 'negdotprod'

vectors = [[1, 2], [3, 4], [5, 6]]

index = nmslib.init(space=space, method='hnsw')
index.addDataPointBatch(vectors)
index.createIndex(
    {'M': 15, 'efConstruction': 200, 'skip_optimized_index': 0, 'post': 0}
)
index.saveIndex('test.index')

new_index = nmslib.init(space=space, method='hnsw')
new_index.loadIndex('test.index')

and it raises

Check failed: totalElementsStored_ == this->data_.size() The number of stored elements 3 doesn't match the number of data points ! Did you forget to re-load data?
Traceback (most recent call last):
  File "8.py", line 15, in <module>
    new_index.loadIndex('test.index')
RuntimeError: Check failed: The number of stored elements 3 doesn't match the number of data points ! Did you forget to re-load data?

If I change space variable to cosinesimil, it works just fine. It seems that data points are not stored, even though hnsw method with skip_optimized_index=0 is used.

opened by chomechome 22

Unable to pip install nmslib, including historic versions

Hey sorry to bother you,

I've been trying to download scispacy via pip on windows 10 using python 3.10.0 today and it keeps failing due to errors about nmslib I've tried pip installing nmslib versions: 1.7.3.6 1.8 2.1.1

None of them have worked though, curiously. I've had a long look around scispacys github and yours but nothing I've read has given me any solutions.

I've also flagged it with scispacy on their github. Anyway I have no idea what's going on but just thought I'd let you know. Cheers Kind regards, Chris

opened by Cbezz 5

Strict typing is needed: Using wrong input can cause distances to be all one, e.g., with cosinesimil_sparse/HNSW when calling knnQueryBatch on a dense array

Hey, I'm trying to use nmslib's HNSW with a csr_matrix containing sparse vectors.

Creating the index works fine, adding the data and setting query time params too:

    items = ["foo is a kind of thing", "bar is another one", "this bar is a real one!", "I prefer to use a foo"] # etc, len=3000
    similar_items_index = nmslib.init(
        space="cosinesimil_sparse",
        method="hnsw",
        data_type=nmslib.DataType.SPARSE_VECTOR,
        dtype=nmslib.DistType.FLOAT,
    )
    vectorizer = TfidfVectorizer(dtype=np.float32, token_pattern=r"\S+")
    embeddings: csr_matrix = vectorizer.fit_transform(items)
    similar_items_index.addDataPointBatch(embeddings)
    similar_items_index.createIndex({"M": 128, "efConstruction": 32, "post": 2}, print_progress=False)
    similar_items_index.setQueryTimeParams({"ef": 512})

But when I search with knnQueryBatch, all the returned distances are equal to 1:

similar_items_index.knnQueryBatch([query_embedding], 5)[0]

-> Knn results: ids, with distances all set to 1

Am I missing something in the proper usage of HNSW with sparse vector data?

Setup for reproduction

This uses the text-similarity data from Kaggle, downloaded in /tmp/. Any other text dataset should be fine, as computing similarity scores is not required to see the problem with returned distances.


import csv
from typing import Dict

import nmslib
import numpy as np
from implicit.evaluation import csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer

CSV_PATH = "/tmp/data/"


def main():
    similar_items_index = nmslib.init(
        space="cosinesimil_sparse",
        method="hnsw",
        data_type=nmslib.DataType.SPARSE_VECTOR,
        dtype=nmslib.DistType.FLOAT,
    )
    items = set()
    ids: Dict[str, int] = {}
    rids: Dict[int, str] = {}
    similarities = {}
    for file in [
        f"{CSV_PATH}/similarity-test.csv",
        f"{CSV_PATH}/similarity-train.csv",
    ]:
        with open(file) as f:
            reader = csv.reader(f, delimiter=",", quotechar="|")
            header = next(reader)
            for i, l in enumerate(reader):
                desc_x = l[header.index("description_x")]
                desc_y = l[header.index("description_y")]
                similar = bool(l[header.index("same_security")])
                id = len(items)
                if desc_x not in items:
                    items.add(desc_x)
                    ids[desc_x] = id
                    rids[id] = desc_x
                    id_x = id
                    id += 1
                else:
                    id_x = ids[desc_x]
                if desc_y not in items:
                    items.add(desc_y)
                    ids[desc_y] = id
                    rids[id] = desc_y
                    id_y = id
                    id += 1
                else:
                    id_y = ids[desc_y]
                if similar:
                    similarities[id_x] = id_y
                    similarities[id_y] = id_x
         print(f"Loaded {len(items)}, total {len(similarities)/2} pairs of similar queries.")
         vectorizer = TfidfVectorizer(dtype=np.float32, token_pattern=r"\S+")
    embeddings: csr_matrix = vectorizer.fit_transform(items)
    print("Embedded items, adding datapoints..")
    similar_items_index.addDataPointBatch(embeddings)
    print("Creating index..")
    similar_items_index.createIndex({"M": 128, "efConstruction": 32, "post": 2}, print_progress=False)
    print("Setting index query params..")
    similar_items_index.setQueryTimeParams({"ef": 512})
    print("Searching...")
    score = 0
    total_similar = 0
    for item_id, item in enumerate(items):
        query_embedding = vectorizer.transform([item]).getrow(0).toarray()
        top_50, distances = similar_items_index.knnQueryBatch([query_embedding], 50)[0]
        top_50_texts = [rids[t] for t in top_50]
        try:
            expected = similarities[item_id]
            expected_text = rids[expected]
            if expected:
                score += 1 if expected in top_50 else 0
        except KeyError:
            continue  # No similar noted on this item.
        total_similar += 1
    print(
        f"After querying {len(items)} of which {total_similar}, we found the similar item in the top50 {score} times."
    )


if __name__ == "__main__":
    main()

opened by PLNech 6

More encompassing approach for Mac M1 chips

On a Mac architecture, platform.processor may return i386 even when on a Mac M1. The code below should be more accurate. See stack overflow comment, another stack overflow comment and stack overflow post for some more information / validation that the uname approach is more all encompassing.

I was personally running into this problem and the following fix solved it for me.

This PR is a slightly edited solution to what is contained in https://github.com/nmslib/nmslib/pull/485 with many thanks to @netj for getting this started.

opened by JewlsIOB 3

Calling setQueryTimeParams results in a SIGSEGV

Hi there! Trying to perform knnQuery on an indexed csr_matrix, I got the issue reported in #480 from this code:

        model = TfidfVectorizer(dtype=np.float32, token_pattern=r"\S+")
        embeddings = model.fit_transform(corpus_tfidf)
        logger.info(f"Creating vector index from a {len(corpus_tfidf)} corpus embedded as {embeddings.shape}...")
        index = nmslib.init(method="hnsw", space="cosinesimil_sparse", data_type=nmslib.DataType.SPARSE_VECTOR, dtype=nmslib.DistType.FLOAT)
        logger.info("Adding datapoints to index...")
        index.addDataPointBatch(embeddings)
        logger.info("Creating final index...")
        index.createIndex()

        logger.info(f"Search neightbors for first embedding {embeddings[0]})
        index.knnQuery(embeddings[0])

As described in #480, this results in an IndexError: tuple index out of range.

When trying to apply the index.setQueryTimeParams({'efSearch': efS, 'algoType': 'old'}) workaround mentioned in another issue , it results in a segmentation fault.

I can reproduce it with the following minimal example, looks like even without arguments the call errors:

index = nmslib.init(method="hnsw", space="cosinesimil_sparse", data_type=nmslib.DataType.SPARSE_VECTOR, dtype=nmslib.DistType.FLOAT)
print("Setting index queryParams...")
index.setQueryTimeParams()
print("Adding datapoints to index...")

Setting index queryParams...
Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

Env info

python -V -> Python 3.7.11
pip freeze | grep nmslib -> nmslib==2.1.1

opened by PLNech 3

NMSLIB doesn't work on Windows 11

Hello,

We use nmslib as default engine for TensorFlow Similarity due to it's broad compatibility with various OSes. We got multiple reports, and I was able to confirm it, that nmslib don't install on Windows 11, potentially related to the issue #498.

Do you have any idea if/when you will be able to take a look at this? With the increased adoption of Win11 it become problematic for us.

Thanks :)

opened by ebursztein 15

Releases(v2.1.1)

v2.1.1(Feb 3, 2021)
Note: We unfortunately had deployment issues. As a result we had to delete several versions between 2.0.6 and 2.1.1. If you installed one of these versions, please, delete them and install a more recent version (>=2.1.1).

The current build focuses on:

Providing more efficient ("optimized") implementations for spaces: negdotprod, l1, linf.

Binaries for ARM 64 (aarch64).

Source code(tar.gz)
Source code(zip)
v2.0.6(Apr 16, 2020)

Just providing Python 3.8 binaries.
Source code(tar.gz)
Source code(zip)
v2.0.5(Nov 7, 2019)
The main objective of this release to provide binary wheels. For compatibility reasons, we need to stick to basic SSE2 instructions. However, when the Python library is being imported, it prints a message suggesting that a more efficient version can be installed from sources (and tells how to do this).

Furthermore, this release removes a lot of old code, which speeds up compilation by 70%:

Non-performing methods

Double-indices

This is a step towards more lightweight NMSLIB library.
Source code(tar.gz)
Source code(zip)
v1.8.1(Jun 23, 2019)

#398 Fixing memory leak in loadIndex
Source code(tar.gz)
Source code(zip)
v1.8(Jun 6, 2019)
This is a clean-up release focusing on several important issues:

Fixing a bug with knnQuery #370

Added a possibility to save/load data efficiently from the Python bindings (and the query server) #356 Python notebooks are updated accordingly

We have bit Jaccard space (many thanks @gregfriedland)

Upgraded the query server to use a recent Apache Thrift

Importantly the documentation is reorganized quite a bit: 5.1 There is now a single entry point for all the docs 5.2 Most of the docs are now online and only fairly technical description of search spaces and methods is in the PDF manual.

Source code(tar.gz)
Source code(zip)
v1.7.3.6(Oct 4, 2018)

Hopefully this will fix the Windows build #348
Source code(tar.gz)
Source code(zip)
v1.7.3.4(Aug 6, 2018)

An additional fix for #327
Source code(tar.gz)
Source code(zip)
v1.7.3.2(Jul 13, 2018)

See https://github.com/nmslib/nmslib/issues/324#issuecomment-404524969
Source code(tar.gz)
Source code(zip)
v1.7.3.1(Jul 9, 2018)

Resolving issue #327
Source code(tar.gz)
Source code(zip)
v1.7.2(Feb 20, 2018)
Improving concurrency in Python (preventing hanging in a certain situation https://github.com/searchivarius/nmslib/issues/291)

Improving ParallelFor : passing thread ID and not starting threads in a single-thread mode.

Source code(tar.gz)
Source code(zip)
v1.7.1(Feb 14, 2018)

Source code(tar.gz)
Source code(zip)
v1.7(Feb 4, 2018)

This release mostly focuses on bug fixing and documentation improving.
Source code(tar.gz)
Source code(zip)
v1.6(Dec 15, 2016)
Here are the list of changes for the version 1.6 (manual isn't updated yet):

We especially thank the following people for the fixes:

Bileg Naidan (@bileg)

Bob Poekert (@bobpoekert)

@orgoro

We simplified the build by excluding the code that required 3rd party code from the core library. In other words, the core library does not have any 3rd party dependencies (not even boost). To build the full version of library you have to run cmake as follows: cmake . -DWITH_EXTRAS=1

It should now be possible to build on MAC.

We improve Python bindings (thanks to @bileg) and their installation process (thanks to @bobpoekert):

We merged our generic and vector bindings into a single module. We upgraded to a more standard installation process via distutils. You can run: python setup.py build and then sudo python setup.py install.

We improved our support for sparse spaces: you can pass data in the form of a numpy sparse array!

There are now batch multi-threaded querying and addition of data.

addDataPoint* functions return a position of an inserted entry. This can be useful if you use function getDataPoint

For examples of using Python API, please, see *.py files in the folder python_bindings.

Note that to execute unit tests you need: python-numpy, python-scipy, and python-pandas.

Because we got rid of boost, we, unfortunately, do not support command-line options WITHOUT arguments. Instead, you have pass values 0 or 1.

However, the utility experiment (experiment.exe) now accepts the option recallOnly. If this option has argument 1, then the only effectiveness metric computed is recall. This is useful for evaluation of HNSW, because (for efficiency reasons) HNSW does not return proper distance values (e.g., for L2 it's a squared distance, not the original one). This makes it impossible to compute effectiveness metrics other than recall (returning wrong distance values would also lead to experiment terminating with an error message).

Additional spaces:

negdotprod_sparse: negative inner (dot) product. This is a sparse space.

querynorm_negdotprod_sparse: query-normalized inner (dot) product, which is the dot product divded by the query norm.

renyi_diverg: Renyi divergence. It has the parameter alpha.

ab_diverg: α-β-divergence. It has two parameters: alpha and beta.

Additional search methods:

simple_invindx: A classical inverted index with a document-at-a-time processing (via a prirority queue). It doesn't have parameters, but works only with the sparse space negdotprod_sparse.

falconn: we ported (created a wrapper for) a June 2016's version of FALCONN library.

Unlike the original implementation, our wrapper works directly with sparse vector spaces as well as with dense vector spaces.

However, our wrapper has to duplicate data twice: so this method is useful mostly as a benchmark.

Our wrapper directly supports a data centering trick, which can boost performance sometimes.

Most parameters (hash_family, cross_polytope, hyperplane, storage_hash_table, num_hash_bits, num_hash_tables, num_probes, num_rotations, seed, feature_hashing_dimension) merely map to FALCONN parameters.

Setting additional parameters norm_data and center_data tells us to center and normalize data. Our implementation of the centering (which is done unfortunately before the hashing trick is applied) for sparse data is horribly inefficient, so we wouldn't recommend using it. Besides, it doesn't seem to improve results. Just in case, the number of sprase dimensions used for centering is controlled by the parameter max_sparse_dim_to_center.

Our FALCONN wrapper would normally use the distance provided by NMSLIB, but you can force using FALCONN's distance function implementation by setting: use_falconn_dist to 1.

Source code(tar.gz)
Source code(zip)
v1.5.3(Jul 11, 2016)
Releasing GIL to enable Python threading

A slightly faster VP-tree #52

New scalar-product spaces #110

Source code(tar.gz)
Source code(zip)
v1.5.2(Jul 2, 2016)

Performance improvement.
Source code(tar.gz)
Source code(zip)
v1.5.1(Jun 1, 2016)

This is a bugfix release to address issue #98
Source code(tar.gz)
Source code(zip)
v1.5(May 20, 2016)
A new efficient method: a hierarchical (navigable) small-world graph (HNSW), contributed by Yury Malkov (@yurymalkov). Works with g++, Visual Studio, Intel Compiler, but doesn't work with Clang yet.

A query server, which can have clients in C++, Java, Python, and other languages supported by Apache Thrift

Python bindings for vector and non-vector spaces

Improved performance of two core methods SW-graph and NAPP

Better handling of the gold standard data in the benchmarking utility experiment

Updated API that permits search methods to serialize indices

Improved documentation (e.g., we added tuning guidelines for best methods)

Source code(tar.gz)
Source code(zip)
v1.1(Oct 6, 2015)

Source code(tar.gz)
Source code(zip)
v1.0(Jul 13, 2014)

Source code(tar.gz)
Source code(zip)

Non-Metric Space Library (NMSLIB): An efficient similarity search library and a toolkit for evaluation of k-NN methods for generic non-metric spaces.

Related tags

Overview

Non-Metric Space Library (NMSLIB)

Important Notes

Objectives

Brief History

Credits and Citing

License

Funding

Related Publications

Comments

Env info

Releases(v2.1.1)

v2.1.1(Feb 3, 2021)

v2.0.6(Apr 16, 2020)

v2.0.5(Nov 7, 2019)

v1.8.1(Jun 23, 2019)

v1.8(Jun 6, 2019)

v1.7.3.6(Oct 4, 2018)

v1.7.3.4(Aug 6, 2018)

v1.7.3.2(Jul 13, 2018)

v1.7.3.1(Jul 9, 2018)

v1.7.2(Feb 20, 2018)

v1.7.1(Feb 14, 2018)

v1.7(Feb 4, 2018)

v1.6(Dec 15, 2016)

v1.5.3(Jul 11, 2016)

v1.5.2(Jul 2, 2016)

v1.5.1(Jun 1, 2016)

v1.5(May 20, 2016)

v1.1(Oct 6, 2015)

v1.0(Jul 13, 2014)

Owner

This is the repository for CVPR2021 Dynamic Metric Learning: Towards a Scalable Metric Space to Accommodate Multiple Semantic Scales

Implementation of temporal pooling methods studied in [ICIP'20] A Comparative Evaluation Of Temporal Pooling Methods For Blind Video Quality Assessment

TensorFlow Similarity is a python package focused on making similarity learning quick and easy.

Siamese-nn-semantic-text-similarity - A repository containing comprehensive Neural Networks based PyTorch implementations for the semantic text similarity task

Sharpened cosine similarity torch - A Sharpened Cosine Similarity layer for PyTorch

Paper: Cross-View Kernel Similarity Metric Learning Using Pairwise Constraints for Person Re-identification

Densely Connected Search Space for More Flexible Neural Architecture Search (CVPR2020)

Evaluation and Benchmarking of Speech Super-resolution Methods

A PyTorch-based open-source framework that provides methods for improving the weakly annotated data and allows researchers to efficiently develop and compare their own methods.

Deep Image Search is an AI-based image search engine that includes deep transfor learning features Extraction and tree-based vectorized search.

Space robot - (Course Project) Using the space robot to capture the target satellite that is disabled and spinning, then stabilize and fix it up

Cascaded Deep Video Deblurring Using Temporal Sharpness Prior and Non-local Spatial-Temporal Similarity

aka "Bayesian Methods for Hackers": An introduction to Bayesian methods + probabilistic programming with a computation/understanding-first, mathematics-second point of view. All in pure Python ;)

A variational Bayesian method for similarity learning in non-rigid image registration (CVPR 2022)

A curated list of awesome resources related to Semantic Search🔎 and Semantic Similarity tasks.

A deep learning based semantic search platform that computes similarity scores between provided query and documents

FuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space OptimizationFuseDream: Training-Free Text-to-Image Generationwith Improved CLIP+GAN Space Optimization

Reference implementation of code generation projects from Facebook AI Research. General toolkit to apply machine learning to code, from dataset creation to model training and evaluation. Comes with pretrained models.

An evaluation toolkit for voice conversion models.