Doing fast searching of nearest neighbors in high dimensional spaces is an increasingly important problem

Erik Bernhardsson

Last update: Jan 3, 2023

Related tags

Overview

Benchmarking nearest neighbors

Doing fast searching of nearest neighbors in high dimensional spaces is an increasingly important problem, but so far there has not been a lot of empirical attempts at comparing approaches in an objective way.

This project contains some tools to benchmark various implementations of approximate nearest neighbor (ANN) search for different metrics. We have pregenerated datasets (in HDF5) formats and we also have Docker containers for each algorithm. There's a test suite that makes sure every algorithm works.

Evaluated

Annoy
FLANN
scikit-learn: LSHForest, KDTree, BallTree
PANNS
NearPy
KGraph
NMSLIB (Non-Metric Space Library): SWGraph, HNSW, BallTree, MPLSH
hnswlib (a part of nmslib project)
RPForest
FAISS
DolphinnPy
Datasketch
PyNNDescent
MRPT
NGT: ONNG, PANNG, QG
SPTAG
PUFFINN
N2
ScaNN
Elastiknn
OpenSearch KNN
DiskANN: Vamana, Vamana-PQ
Vespa
scipy: cKDTree
vald

Data sets

We have a number of precomputed data sets for this. All data sets are pre-split into train/test and come with ground truth data in the form of the top 100 neighbors. We store them in a HDF5 format:

Dataset	Dimensions	Train size	Test size	Neighbors	Distance	Download
DEEP1B	96	9,990,000	10,000	100	Angular	HDF5 (3.6GB)
Fashion-MNIST	784	60,000	10,000	100	Euclidean	HDF5 (217MB)
GIST	960	1,000,000	1,000	100	Euclidean	HDF5 (3.6GB)
GloVe	25	1,183,514	10,000	100	Angular	HDF5 (121MB)
GloVe	50	1,183,514	10,000	100	Angular	HDF5 (235MB)
GloVe	100	1,183,514	10,000	100	Angular	HDF5 (463MB)
GloVe	200	1,183,514	10,000	100	Angular	HDF5 (918MB)
Kosarak	27983	74,962	500	100	Jaccard	HDF5 (2.0GB)
MNIST	784	60,000	10,000	100	Euclidean	HDF5 (217MB)
NYTimes	256	290,000	10,000	100	Angular	HDF5 (301MB)
SIFT	128	1,000,000	10,000	100	Euclidean	HDF5 (501MB)
Last.fm	65	292,385	50,000	100	Angular	HDF5 (135MB)

Results

Interactive plots can be found at http://ann-benchmarks.com. These are all as of December 2021, running all benchmarks on a r5.4xlarge machine on AWS with --parallelism 7:

glove-100-angular

sift-128-euclidean

fashion-mnist-784-euclidean

lastfm-64-dot

nytimes-256-angular

glove-25-angular

Install

The only prerequisite is Python (tested with 3.6) and Docker.

Clone the repo.
Run pip install -r requirements.txt.
Run python install.py to build all the libraries inside Docker containers (this can take a while, like 10-30 minutes).

Running

Run python run.py (this can take an extremely long time, potentially days)
Run python plot.py or python create_website.py to plot results.

You can customize the algorithms and datasets if you want to:

Check that algos.yaml contains the parameter settings that you want to test
To run experiments on SIFT, invoke python run.py --dataset glove-100-angular. See python run.py --help for more information on possible settings. Note that experiments can take a long time.
To process the results, either use python plot.py --dataset glove-100-angular or python create_website.py. An example call: python create_website.py --plottype recall/time --latex --scatter --outputdir website/.

Including your algorithm

Add your algorithm into ann_benchmarks/algorithms by providing a small Python wrapper.
Add a Dockerfile in install/ for it
Add it to algos.yaml
Add it to .github/workflows/benchmarks.yml

Principles

Everyone is welcome to submit pull requests with tweaks and changes to how each library is being used.
In particular: if you are the author of any of these libraries, and you think the benchmark can be improved, consider making the improvement and submitting a pull request.
This is meant to be an ongoing project and represent the current state.
Make everything easy to replicate, including installing and preparing the datasets.
Try many different values of parameters for each library and ignore the points that are not on the precision-performance frontier.
High-dimensional datasets with approximately 100-1000 dimensions. This is challenging but also realistic. Not more than 1000 dimensions because those problems should probably be solved by doing dimensionality reduction separately.
Single queries are used by default. ANN-Benchmarks enforces that only one CPU is saturated during experimentation, i.e., no multi-threading. A batch mode is available that provides all queries to the implementations at once. Add the flag --batch to run.py and plot.py to enable batch mode.
Avoid extremely costly index building (more than several hours).
Focus on datasets that fit in RAM. For billion-scale benchmarks, see the related big-ann-benchmarks project.
We mainly support CPU-based ANN algorithms. GPU support exists for FAISS, but it has to be compiled with GPU support locally and experiments must be run using the flags --local --batch.
Do proper train/test set of index data and query points.
Note that we consider that set similarity datasets are sparse and thus we pass a sorted array of integers to algorithms to represent the set of each user.

Authors

Built by Erik Bernhardsson with significant contributions from Martin Aumüller and Alexander Faithfull.

Related Publication

The following publication details design principles behind the benchmarking framework:

M. Aumüller, E. Bernhardsson, A. Faithfull: ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms. Information Systems 2019. DOI: 10.1016/j.is.2019.02.006

Related Projects

big-ann-benchmarks is a benchmarking effort for billion-scale approximate nearest neighbor search as part of the NeurIPS'21 Competition track.

Comments

use BLAS in brute force (via NumPy)

Replace slow brute force algo (sklearn) by a direct, fast BLAS call.

Make sure you have a fast BLAS installed -- a recent ATLAS, or OpenBlas, Intel's MKL, Apple's Accelerate etc. This can make a huge difference in performance.

PR not tested at all. @erikbern can you run it on the AWS machine? I just wrote the code, I didn't get a chance to run it (install assumes Debian). Sorry for any typos.

opened by piskvorky 43
Adding FALCONN

I've added FALCONN (http://falconn-lib.org/) tuned for the GloVe dataset.

Here are the results for GloVe on the dedicated c4.2xlarge instance: https://gitlab.com/ilyaraz/experiments/blob/master/glove_results/ec2_flat.txt (other numbers are taken from the upstream of ann-benchmarks).

Here is the comparison of Annoy vs. FALCONN in one point on the same instance (to make sure that the numbers from the above make sense): http://pastebin.com/gpDVFQhm

opened by ilyaraz 41
Major refactoring & added features
Hi Erik,

as discussed our changes in one very big pull request. A concise description of our work:

Moved the algorithm wrappers out of main.py and into separate files (see the ann_benchmarks/algorithms/ folder)

Added support for datasets containing different kinds of data point (see ann_benchmarks/data.py)

Added support for algorithm implementations running in external processes (see ann_benchmarks/algorithms/subprocess.py), the protocol directory, and the wrappers provided for annoy-hamming, dolphinn, and MIH in the install directory

Broke up the results files and created a directory hierarchy for them (see ann_benchmarks/results.py)

Made the results files store the complete results rather than precomputed quality metrics

Added an interactive HTML5 plot generator (see createwebsite.py)

Added new datasets (see install/data-*.sh)

Added new quality metrics (see ann_benchmarks/plotting/results.py)

Added new algorithms (see install/lib-*.sh)

The only core change that we have done is how we handle the ThreadPool in case of running the benchmark without the --single flag that forces single-core, single-thread. In our code, the query time of each query is recorded, while the the current ann-benchmarks records the time needed to answer all the queries "once". Our behavior is that"multi-threading" makes the experiments run faster, but the results we get are (almost) identical to when you run in a single-threaded mode. We also added support for a batch mode, in which the algorithm takes care of handling all the queries and might use multi-threading, up for it to decide. The batch mode was in particular needed for the FAISS GPU integration.

The following image shows a comparison between single-core/single-thread (annoy), threadpool with a single time measurement (annoy-pool), and threadpool with individual query time measurements (annoy-pool-query-based). We clearly see a difference between the single/individual run, but they keep the actual performance shape for the algorithm. The threadpool with a single measurement at the beginning and end seems to have a lot of overhead for the management included, only at very high recall we see a speedup that we would expect from parallelizing queries.

Looking forward to your thoughts!
opened by maumueller 31
Added ONNG results.

After #92 I was interested to see how ONNG actually performs, and it performs quite well! (Much better than their other variant.) Find the updated result plot files in the PR and on http://ann-benchmarks.com.

One problem with the plot script in this situation: The change done to PANNG in algos.yaml by https://github.com/erikbern/ann-benchmarks/commit/0b50eb2f149bd0dee343d8c8274b2d3b8d7827b8 made the PANNG result files inaccessible to plot.py, since their names changed. I dislike this tight connection between the algos.yaml file and the result files. The plotting done in create_website.py just enumerates all result files and plots them, which is much more robust. Maybe we should update plot.py to use the same approach?

opened by maumueller 22

Python 3.7 and fresh checkout from master: dependency installation issue

Hello! I've tried installing dependencies under Python 3.7.10 and got the following output. What Python version is supported / recommended?

    Running setup.py install for numpy ... error
    ERROR: Command errored out with exit status 1:
     command: /Users/dmitrykan/project/ann-benchmarks/venv/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/setup.py'"'"'; __file__='"'"'/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-record-wrmxmqjz/install-record.txt --single-version-externally-managed --compile --install-headers /Users/dmitrykan/project/ann-benchmarks/venv/include/site/python3.7/numpy
         cwd: /private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/
    Complete output (195 lines):
    Running from numpy source directory.
    
    Note: if you need reliable uninstall behavior, then install
    with pip instead of using `setup.py install`:
    
      - `pip install .`       (from a git repo or downloaded source
                               release)
      - `pip install numpy`   (last NumPy release on PyPi)
    
    
    blas_opt_info:
    blas_mkl_info:
      libraries mkl_rt not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
      NOT AVAILABLE
    
    blis_info:
      libraries blis not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
      NOT AVAILABLE
    
    openblas_info:
      libraries openblas not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
      NOT AVAILABLE
    
    atlas_3_10_blas_threads_info:
    Setting PTATLAS=ATLAS
      libraries tatlas not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
      NOT AVAILABLE
    
    atlas_3_10_blas_info:
      libraries satlas not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
      NOT AVAILABLE
    
    atlas_blas_threads_info:
    Setting PTATLAS=ATLAS
      libraries ptf77blas,ptcblas,atlas not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
      NOT AVAILABLE
    
    atlas_blas_info:
      libraries f77blas,cblas,atlas not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
      NOT AVAILABLE
    
      FOUND:
        extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
        extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
        define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
    
    /bin/sh: svnversion: command not found
    non-existing path in 'numpy/distutils': 'site.cfg'
    /bin/sh: svnversion: command not found
    F2PY Version 2
    lapack_opt_info:
    lapack_mkl_info:
      libraries mkl_rt not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
      NOT AVAILABLE
    
    openblas_lapack_info:
      libraries openblas not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
      NOT AVAILABLE
    
    atlas_3_10_threads_info:
    Setting PTATLAS=ATLAS
      libraries tatlas,tatlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
      libraries lapack_atlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
      libraries tatlas,tatlas not found in /usr/local/lib
      libraries lapack_atlas not found in /usr/local/lib
      libraries tatlas,tatlas not found in /usr/lib
      libraries lapack_atlas not found in /usr/lib
    <class 'numpy.distutils.system_info.atlas_3_10_threads_info'>
      NOT AVAILABLE
    
    atlas_3_10_info:
      libraries satlas,satlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
      libraries lapack_atlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
      libraries satlas,satlas not found in /usr/local/lib
      libraries lapack_atlas not found in /usr/local/lib
      libraries satlas,satlas not found in /usr/lib
      libraries lapack_atlas not found in /usr/lib
    <class 'numpy.distutils.system_info.atlas_3_10_info'>
      NOT AVAILABLE
    
    atlas_threads_info:
    Setting PTATLAS=ATLAS
      libraries ptf77blas,ptcblas,atlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
      libraries lapack_atlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
      libraries ptf77blas,ptcblas,atlas not found in /usr/local/lib
      libraries lapack_atlas not found in /usr/local/lib
      libraries ptf77blas,ptcblas,atlas not found in /usr/lib
      libraries lapack_atlas not found in /usr/lib
    <class 'numpy.distutils.system_info.atlas_threads_info'>
      NOT AVAILABLE
    
    atlas_info:
      libraries f77blas,cblas,atlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
      libraries lapack_atlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
      libraries f77blas,cblas,atlas not found in /usr/local/lib
      libraries lapack_atlas not found in /usr/local/lib
      libraries f77blas,cblas,atlas not found in /usr/lib
      libraries lapack_atlas not found in /usr/lib
    <class 'numpy.distutils.system_info.atlas_info'>
      NOT AVAILABLE
    
      FOUND:
        extra_compile_args = ['-msse3']
        extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
        define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
    
    /usr/local/Cellar/python@3.7/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/dist.py:274: UserWarning: Unknown distribution option: 'define_macros'
      warnings.warn(msg)
    running install
    running build
    running config_cc
    unifing config_cc, config, build_clib, build_ext, build commands --compiler options
    running config_fc
    unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options
    running build_src
    build_src
    building py_modules sources
    creating build
    creating build/src.macosx-11-x86_64-3.7
    creating build/src.macosx-11-x86_64-3.7/numpy
    creating build/src.macosx-11-x86_64-3.7/numpy/distutils
    building library "npymath" sources
    customize Gnu95FCompiler
    Found executable /usr/local/bin/gfortran
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/setup.py", line 392, in <module>
        setup_package()
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/setup.py", line 384, in setup_package
        setup(**metadata)
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/core.py", line 169, in setup
        return old_setup(**new_attr)
      File "/Users/dmitrykan/project/ann-benchmarks/venv/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
        return distutils.core.setup(**attrs)
      File "/usr/local/Cellar/python@3.7/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/usr/local/Cellar/python@3.7/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "/usr/local/Cellar/python@3.7/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/install.py", line 62, in run
        r = self.setuptools_run()
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/install.py", line 36, in setuptools_run
        return distutils_install.run(self)
      File "/usr/local/Cellar/python@3.7/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/command/install.py", line 545, in run
        self.run_command('build')
      File "/usr/local/Cellar/python@3.7/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/usr/local/Cellar/python@3.7/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/build.py", line 47, in run
        old_build.run(self)
      File "/usr/local/Cellar/python@3.7/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/command/build.py", line 135, in run
        self.run_command(cmd_name)
      File "/usr/local/Cellar/python@3.7/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/usr/local/Cellar/python@3.7/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/build_src.py", line 148, in run
        self.build_sources()
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/build_src.py", line 159, in build_sources
        self.build_library_sources(*libname_info)
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/build_src.py", line 294, in build_library_sources
        sources = self.generate_sources(sources, (lib_name, build_info))
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/build_src.py", line 377, in generate_sources
        source = func(extension, build_dir)
      File "numpy/core/setup.py", line 672, in get_mathlib_info
        st = config_cmd.try_link('int main(void) { return 0;}')
      File "/usr/local/Cellar/python@3.7/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/command/config.py", line 243, in try_link
        self._check_compiler()
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/config.py", line 81, in _check_compiler
        c_compiler=self.compiler)
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/fcompiler/__init__.py", line 842, in new_fcompiler
        c_compiler=c_compiler)
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/fcompiler/__init__.py", line 816, in get_default_fcompiler
        c_compiler=c_compiler)
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/fcompiler/__init__.py", line 765, in _find_existing_fcompiler
        c.customize(dist)
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/fcompiler/__init__.py", line 521, in customize
        linker_so_flags = self.flag_vars.linker_so
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/environment.py", line 39, in __getattr__
        return self._get_var(name, conf_desc)
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/environment.py", line 53, in _get_var
        var = self._hook_handler(name, hook)
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/fcompiler/__init__.py", line 700, in _environment_hook
        return hook()
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/fcompiler/gnu.py", line 309, in get_flags_linker_so
        flags = GnuFCompiler.get_flags_linker_so(self)
      File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/fcompiler/gnu.py", line 138, in get_flags_linker_so
        os.environ['MACOSX_DEPLOYMENT_TARGET'] = target
      File "/usr/local/Cellar/python@3.7/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/os.py", line 686, in __setitem__
        value = self.encodevalue(value)
      File "/usr/local/Cellar/python@3.7/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/os.py", line 756, in encode
        raise TypeError("str expected, not %s" % type(value).__name__)
    TypeError: str expected, not int
    ----------------------------------------
ERROR: Command errored out with exit status 1: /Users/dmitrykan/project/ann-benchmarks/venv/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/setup.py'"'"'; __file__='"'"'/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-record-wrmxmqjz/install-record.txt --single-version-externally-managed --compile --install-headers /Users/dmitrykan/project/ann-benchmarks/venv/include/site/python3.7/numpy Check the logs for full command output.

opened by DmitryKey 21

Adding HNSW, updating SW-graph, increasing the number of queries

Leo (@searchivarius) and I have added a new algorithm (Hierarchical NSW, HNSW) and updated the parameters and performance of the SW-graph algorithm (both from nmslib). Results of comparison to FALCONN and Annoy on the same amazon instance are attached below. Comparison on a Xeon E5-4650 v2 machine can be found in the HNSW preprint http://arxiv.org/abs/1603.09320 Note that SW-graph and HNSW indexes are now saved and reused later to strongly reduce the testing time.

As an addition, we have increased the number of queries to 10K (5K might be also OK), see Leo's comments below:

Eric is most welcome to use any other random_state.

Last time, Leo was testing using random_state==1 However, it is best to use a new random seed for each major re-evaluation, that we test on a truly bind data.

Increasing the number of queries or changing the seed, won't make truly weak methods much faster. However, in my experience, this can easily result in 10-30% changes in performance. This may affect relative performance of methods with similar performance.

opened by yurymalkov 21
Adds Open Distro Elastic Search's KNN plugin support. Closes #174.

Adding Open Distro Elasticsearch KNN plugin support by borrowing on the ES setup from elasticsearch and elastiknn. Below is a comparison on fashion-mnist between this opendistroknn and elastiknn, where opnedistroknn has ~3X better queries/s at comparable recall.

opened by stephenleo 18
New NearPy configurations, after fixing stuff in NearPy

I changed two things in NearPy. The really awful thing was, that no UniqueFilter was used by default. That caused the same indexed vector to be multiple times in the search candidate set. After computing the distance to the query vector and sorting with respect to that, in many cases this caused the cropped N=10 nearest neighbour set to contain the same vector multiple times. Also distance computation was more expensive because of the duplicates. The other thing is that I now store normalized vectors in the index which speeds up distance computation a lot.

Both changes are released via pypi in NearPy 0.2.2

I also ran your benchmark code with a reduced dataset to look for good parameters.

This example with just 50000 vectors shows that the above fixes helped a lot so far. Especially the missing UniqueFilter created an upper bound on the accuracy before.

opened by pixelogik 18
Add support for reusing built data structures

This commit adds support for reconfiguring an already-built data structure with new query parameters, making it possible to reuse one to perform many different experiments. (This was discussed in issue #41.)

opened by ale-f 17
Attempt at adding Milvus

A bit janky, since (a) Milvus is a server process, so I hacked a hacky entrypoint file to run a daemon (b) was impossible building it from source, but I realized I can just copy the relevant binaries from the "official" dockerfile and it seems to work

Running some benchmarks now, will see what we get

opened by erikbern 16
Updated kgraph parameters and etc
A command line argument "--algo" is added so one can run only one algorithm instead of all.

Updated kgraph configuration for two dataset separately.
opened by aaalgo 16
Can we use ann-benchmarks to benchmark the All-Nearest-Neighbors problem?

The All-Nearest-Neighbors problem (All-NN) asks us to find all pairs of nearest neighbors in a set of n points in d-dimensions. Classically, there is a deterministic O(n log n) algorithm, as well as stochastic O(n sqrt(n)) algorithm.

I would like to test the All-NN performance of algorithms like Faiss. According to a google insider, the closed version of faiss does have a mode that solve All-NN. However, I don't know if the public version does.

I suspect it may, and I suspect that ann-benchmarks has a mode that can effectively test All-NN performance via its batch mode. However I don't understand the terminology well enough to determine of batch mode is equivalent to solving All-NN. Can it?

opened by maxsu 1
Upgrade Ubuntu and Python

The repo is on Ubuntu 18 and Python 3.6.

I'm having trouble upgrading the Elastiknn benchmarks because the latest Elasticseach Ubuntu installer relies on some things that don't seem to be in Ubuntu 18. I've also jumped through some hoops with Python, e.g., intentionally installing older versions of scipy for compat w/ 3.6.

Any reason not to upgrade? Any dragons to be aware of in case I take a first pass at it?

opened by alexklibisz 2
category filter
Would it be in the spirit of this benchmarks to add a second benchmark category for ANN in conjunction with categorial filters?

Most real-world applications of ANN will required category filtering e.g. when searching for cloths in an e-commerce scenario via ANN one might filter by gender (categorial) or availability (categorial).

There are several software products which allow combining ANN and category filters e.g. Apache Solr, Elasticsearch, Vertex AI Matching Engine, weaviate, qdrant. However, they mainly differ to this benchmarks in that they are managed services or just services but not embeddable libraries.

In addition to recall vs. queries per second there should be a view which filters to a fraction of the date vs. recall vs. queries per second. For the proprietary managed services, also a cost dimension might be useful.

I have found the following blog articles covering the topic:

https://blog.vasnetsov.com/posts/categorical-hnsw/

https://towardsdatascience.com/effects-of-filtered-hnsw-searches-on-recall-and-latency-434becf8041c

https://towardsdatascience.com/using-approximate-nearest-neighbor-search-in-real-world-applications-a75c351445d

Given that this would be very useful for practical implementations but also the fact that it significantly complicates the benchmarks I would be interested in your opinion and/or how I could help with it. Also I would be great to know if someone has already done such benchmarks.
opened by gtsoukas 3
Relative error metric

I was wondering if there is a way of defining what is a “reasonable” relative error, something similar to recall 0.9, 0.95, 0.99. Furthermore, how much is this metric is effected by the distance metric? If you have any experience with this metric I’ll be happy to here about it. Thanks in advance, Guy

opened by GuyAv46 0
Evaluate Elasticsearch approximate kNN API

This was recently released in Elasticsearch 8.0+ and the API is documented here

Elasticsearch is distinct in both implementation and product from OpenSearch (Amazon maintained fork), and I expect that the two APIs will have different results. Both are HNSW based.

Thanks for maintaining this repo!

opened by rw-access 1

Owner

Erik Bernhardsson

Working on some weird ideas for data infra. Ex-CTO at better.com, likes to open source stuff sometimes and write random blog posts.

GitHub http://ann-benchmarks.com

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Annoy Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given quer

10.6k Jan 4, 2023

Weighted K Nearest Neighbors (kNN) algorithm implemented on python from scratch.

kNN_From_Scratch I implemented the k nearest neighbors (kNN) classification algorithm on python. This algorithm is used to predict the classes of new

1 Dec 14, 2021

Problem-943.-ACMP - Problem 943. ACMP

Problem-943.-ACMP В "main.py" расположен вариант моего решения задачи 943 с серв

2 Aug 19, 2022

Birthday-problem - The birthday problem asks for the probability that, in a set of n randomly chosen people, at least two will share a birthday

Birthday-problem In probability theory, the birthday problem asks for the probab

5 Jan 5, 2023

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression.

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression. Not an official Google product. Me

27 Dec 12, 2022

Torch-based tool for quantizing high-dimensional vectors using additive codebooks

Trainable multi-codebook quantization This repository implements a utility for use with PyTorch, and ideally GPUs, for training an efficient quantizer

41 Jan 7, 2023

The implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets.

Joint t-sne This is the implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets. abstract: We present Jo

7 Dec 18, 2022

Scikit-event-correlation - Event Correlation and Forecasting over High Dimensional Streaming Sensor Data algorithms

scikit-event-correlation Event Correlation and Changing Detection Algorithm Theo

5 Oct 30, 2022

Code for "Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency" paper

UNICORN ?? Webpage | Paper | BibTex PyTorch implementation of "Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency" pap

118 Jan 6, 2023

This repository contains an overview of important follow-up works based on the original Vision Transformer (ViT) by Google.

75 Dec 2, 2022

In this work, we will implement some basic but important algorithm of machine learning step by step.

WoRkS continued English 中文 Français Probability Density Estimation-Non-Parametric Methods(概率密度估计-非参数方法) 1. Kernel / k-Nearest Neighborhood Density Est

1 Dec 30, 2021

Feedback is important: response-aware feedback mechanism for background based conversation

RFM The code for the paper: "Feedback is important: response-aware feedback mechanism for background based conversation." Requirements python 3.7 pyto

2 Sep 29, 2022

Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence

Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence, etc. This article aims to provide an introduction on how to make use of the SpeechRecognition and pyttsx3 library of Python.

1 Feb 13, 2022

TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm.

Efficient implementations of Product Quantization and its variants using Pytorch and CUDA

146 Dec 28, 2022

Doing fast searching of nearest neighbors in high dimensional spaces is an increasingly important problem

Related tags

Overview

Benchmarking nearest neighbors

Evaluated

Data sets

Results

glove-100-angular

sift-128-euclidean

fashion-mnist-784-euclidean

lastfm-64-dot

nytimes-256-angular

glove-25-angular

Install

Running

Including your algorithm

Principles

Authors

Related Publication

Related Projects

Comments

Owner

Erik Bernhardsson

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Weighted K Nearest Neighbors (kNN) algorithm implemented on python from scratch.

Problem-943.-ACMP - Problem 943. ACMP

Birthday-problem - The birthday problem asks for the probability that, in a set of n randomly chosen people, at least two will share a birthday

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression.

Torch-based tool for quantizing high-dimensional vectors using additive codebooks

The implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets.

Scikit-event-correlation - Event Correlation and Forecasting over High Dimensional Streaming Sensor Data algorithms

Code for "Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency" paper

This repository contains an overview of important follow-up works based on the original Vision Transformer (ViT) by Google.

In this work, we will implement some basic but important algorithm of machine learning step by step.

Feedback is important: response-aware feedback mechanism for background based conversation

Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence

TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm.

A gesture recognition system powered by OpenPose, k-nearest neighbours, and local outlier factor.

Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning

Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

Optimal space decomposition based-product quantization for approximate nearest neighbor search

K-Nearest Neighbor in Pytorch