Doing fast searching of nearest neighbors in high dimensional spaces is an increasingly important problem

Overview

Benchmarking nearest neighbors

Build Status

Doing fast searching of nearest neighbors in high dimensional spaces is an increasingly important problem, but so far there has not been a lot of empirical attempts at comparing approaches in an objective way.

This project contains some tools to benchmark various implementations of approximate nearest neighbor (ANN) search for different metrics. We have pregenerated datasets (in HDF5) formats and we also have Docker containers for each algorithm. There's a test suite that makes sure every algorithm works.

Evaluated

Data sets

We have a number of precomputed data sets for this. All data sets are pre-split into train/test and come with ground truth data in the form of the top 100 neighbors. We store them in a HDF5 format:

Dataset Dimensions Train size Test size Neighbors Distance Download
DEEP1B 96 9,990,000 10,000 100 Angular HDF5 (3.6GB)
Fashion-MNIST 784 60,000 10,000 100 Euclidean HDF5 (217MB)
GIST 960 1,000,000 1,000 100 Euclidean HDF5 (3.6GB)
GloVe 25 1,183,514 10,000 100 Angular HDF5 (121MB)
GloVe 50 1,183,514 10,000 100 Angular HDF5 (235MB)
GloVe 100 1,183,514 10,000 100 Angular HDF5 (463MB)
GloVe 200 1,183,514 10,000 100 Angular HDF5 (918MB)
Kosarak 27983 74,962 500 100 Jaccard HDF5 (2.0GB)
MNIST 784 60,000 10,000 100 Euclidean HDF5 (217MB)
NYTimes 256 290,000 10,000 100 Angular HDF5 (301MB)
SIFT 128 1,000,000 10,000 100 Euclidean HDF5 (501MB)
Last.fm 65 292,385 50,000 100 Angular HDF5 (135MB)

Results

Interactive plots can be found at http://ann-benchmarks.com. These are all as of December 2021, running all benchmarks on a r5.4xlarge machine on AWS with --parallelism 7:

glove-100-angular

glove-100-angular

sift-128-euclidean

glove-100-angular

fashion-mnist-784-euclidean

fashion-mnist-784-euclidean

lastfm-64-dot

lastfm-64-dot

nytimes-256-angular

nytimes-256-angular

glove-25-angular

glove-25-angular

Install

The only prerequisite is Python (tested with 3.6) and Docker.

  1. Clone the repo.
  2. Run pip install -r requirements.txt.
  3. Run python install.py to build all the libraries inside Docker containers (this can take a while, like 10-30 minutes).

Running

  1. Run python run.py (this can take an extremely long time, potentially days)
  2. Run python plot.py or python create_website.py to plot results.

You can customize the algorithms and datasets if you want to:

  • Check that algos.yaml contains the parameter settings that you want to test
  • To run experiments on SIFT, invoke python run.py --dataset glove-100-angular. See python run.py --help for more information on possible settings. Note that experiments can take a long time.
  • To process the results, either use python plot.py --dataset glove-100-angular or python create_website.py. An example call: python create_website.py --plottype recall/time --latex --scatter --outputdir website/.

Including your algorithm

  1. Add your algorithm into ann_benchmarks/algorithms by providing a small Python wrapper.
  2. Add a Dockerfile in install/ for it
  3. Add it to algos.yaml
  4. Add it to .github/workflows/benchmarks.yml

Principles

  • Everyone is welcome to submit pull requests with tweaks and changes to how each library is being used.
  • In particular: if you are the author of any of these libraries, and you think the benchmark can be improved, consider making the improvement and submitting a pull request.
  • This is meant to be an ongoing project and represent the current state.
  • Make everything easy to replicate, including installing and preparing the datasets.
  • Try many different values of parameters for each library and ignore the points that are not on the precision-performance frontier.
  • High-dimensional datasets with approximately 100-1000 dimensions. This is challenging but also realistic. Not more than 1000 dimensions because those problems should probably be solved by doing dimensionality reduction separately.
  • Single queries are used by default. ANN-Benchmarks enforces that only one CPU is saturated during experimentation, i.e., no multi-threading. A batch mode is available that provides all queries to the implementations at once. Add the flag --batch to run.py and plot.py to enable batch mode.
  • Avoid extremely costly index building (more than several hours).
  • Focus on datasets that fit in RAM. For billion-scale benchmarks, see the related big-ann-benchmarks project.
  • We mainly support CPU-based ANN algorithms. GPU support exists for FAISS, but it has to be compiled with GPU support locally and experiments must be run using the flags --local --batch.
  • Do proper train/test set of index data and query points.
  • Note that we consider that set similarity datasets are sparse and thus we pass a sorted array of integers to algorithms to represent the set of each user.

Authors

Built by Erik Bernhardsson with significant contributions from Martin Aumüller and Alexander Faithfull.

Related Publication

The following publication details design principles behind the benchmarking framework:

Related Projects

Comments
  • use BLAS in brute force (via NumPy)

    use BLAS in brute force (via NumPy)

    Replace slow brute force algo (sklearn) by a direct, fast BLAS call.

    Make sure you have a fast BLAS installed -- a recent ATLAS, or OpenBlas, Intel's MKL, Apple's Accelerate etc. This can make a huge difference in performance.

    PR not tested at all. @erikbern can you run it on the AWS machine? I just wrote the code, I didn't get a chance to run it (install assumes Debian). Sorry for any typos.

    opened by piskvorky 43
  • Adding FALCONN

    Adding FALCONN

    I've added FALCONN (http://falconn-lib.org/) tuned for the GloVe dataset.

    Here are the results for GloVe on the dedicated c4.2xlarge instance: https://gitlab.com/ilyaraz/experiments/blob/master/glove_results/ec2_flat.txt (other numbers are taken from the upstream of ann-benchmarks).

    Here is the comparison of Annoy vs. FALCONN in one point on the same instance (to make sure that the numbers from the above make sense): http://pastebin.com/gpDVFQhm

    opened by ilyaraz 41
  • Major refactoring & added features

    Major refactoring & added features

    Hi Erik,

    as discussed our changes in one very big pull request. A concise description of our work:

    • Moved the algorithm wrappers out of main.py and into separate files (see the ann_benchmarks/algorithms/ folder)
    • Added support for datasets containing different kinds of data point (see ann_benchmarks/data.py)
    • Added support for algorithm implementations running in external processes (see ann_benchmarks/algorithms/subprocess.py), the protocol directory, and the wrappers provided for annoy-hamming, dolphinn, and MIH in the install directory
    • Broke up the results files and created a directory hierarchy for them (see ann_benchmarks/results.py)
    • Made the results files store the complete results rather than precomputed quality metrics
    • Added an interactive HTML5 plot generator (see createwebsite.py)
    • Added new datasets (see install/data-*.sh)
    • Added new quality metrics (see ann_benchmarks/plotting/results.py)
    • Added new algorithms (see install/lib-*.sh)

    The only core change that we have done is how we handle the ThreadPool in case of running the benchmark without the --single flag that forces single-core, single-thread. In our code, the query time of each query is recorded, while the the current ann-benchmarks records the time needed to answer all the queries "once". Our behavior is that"multi-threading" makes the experiments run faster, but the results we get are (almost) identical to when you run in a single-threaded mode. We also added support for a batch mode, in which the algorithm takes care of handling all the queries and might use multi-threading, up for it to decide. The batch mode was in particular needed for the FAISS GPU integration.

    The following image shows a comparison between single-core/single-thread (annoy), threadpool with a single time measurement (annoy-pool), and threadpool with individual query time measurements (annoy-pool-query-based). We clearly see a difference between the single/individual run, but they keep the actual performance shape for the algorithm. The threadpool with a single measurement at the beginning and end seems to have a lot of overhead for the management included, only at very high recall we see a speedup that we would expect from parallelizing queries. screen shot 2017-10-17 at 21 29 58

    Looking forward to your thoughts!

    opened by maumueller 31
  • Added ONNG results.

    Added ONNG results.

    After #92 I was interested to see how ONNG actually performs, and it performs quite well! (Much better than their other variant.) Find the updated result plot files in the PR and on http://ann-benchmarks.com.

    One problem with the plot script in this situation: The change done to PANNG in algos.yaml by https://github.com/erikbern/ann-benchmarks/commit/0b50eb2f149bd0dee343d8c8274b2d3b8d7827b8 made the PANNG result files inaccessible to plot.py, since their names changed. I dislike this tight connection between the algos.yaml file and the result files. The plotting done in create_website.py just enumerates all result files and plots them, which is much more robust. Maybe we should update plot.py to use the same approach?

    opened by maumueller 22
  • Python 3.7 and fresh checkout from master: dependency installation issue

    Python 3.7 and fresh checkout from master: dependency installation issue

    Hello! I've tried installing dependencies under Python 3.7.10 and got the following output. What Python version is supported / recommended?

        Running setup.py install for numpy ... error
        ERROR: Command errored out with exit status 1:
         command: /Users/dmitrykan/project/ann-benchmarks/venv/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/setup.py'"'"'; __file__='"'"'/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-record-wrmxmqjz/install-record.txt --single-version-externally-managed --compile --install-headers /Users/dmitrykan/project/ann-benchmarks/venv/include/site/python3.7/numpy
             cwd: /private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/
        Complete output (195 lines):
        Running from numpy source directory.
        
        Note: if you need reliable uninstall behavior, then install
        with pip instead of using `setup.py install`:
        
          - `pip install .`       (from a git repo or downloaded source
                                   release)
          - `pip install numpy`   (last NumPy release on PyPi)
        
        
        blas_opt_info:
        blas_mkl_info:
          libraries mkl_rt not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
          NOT AVAILABLE
        
        blis_info:
          libraries blis not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
          NOT AVAILABLE
        
        openblas_info:
          libraries openblas not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
          NOT AVAILABLE
        
        atlas_3_10_blas_threads_info:
        Setting PTATLAS=ATLAS
          libraries tatlas not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
          NOT AVAILABLE
        
        atlas_3_10_blas_info:
          libraries satlas not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
          NOT AVAILABLE
        
        atlas_blas_threads_info:
        Setting PTATLAS=ATLAS
          libraries ptf77blas,ptcblas,atlas not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
          NOT AVAILABLE
        
        atlas_blas_info:
          libraries f77blas,cblas,atlas not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
          NOT AVAILABLE
        
          FOUND:
            extra_compile_args = ['-msse3', '-I/System/Library/Frameworks/vecLib.framework/Headers']
            extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
            define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
        
        /bin/sh: svnversion: command not found
        non-existing path in 'numpy/distutils': 'site.cfg'
        /bin/sh: svnversion: command not found
        F2PY Version 2
        lapack_opt_info:
        lapack_mkl_info:
          libraries mkl_rt not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
          NOT AVAILABLE
        
        openblas_lapack_info:
          libraries openblas not found in ['/Users/dmitrykan/project/ann-benchmarks/venv/lib', '/usr/local/lib', '/usr/lib']
          NOT AVAILABLE
        
        atlas_3_10_threads_info:
        Setting PTATLAS=ATLAS
          libraries tatlas,tatlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
          libraries lapack_atlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
          libraries tatlas,tatlas not found in /usr/local/lib
          libraries lapack_atlas not found in /usr/local/lib
          libraries tatlas,tatlas not found in /usr/lib
          libraries lapack_atlas not found in /usr/lib
        <class 'numpy.distutils.system_info.atlas_3_10_threads_info'>
          NOT AVAILABLE
        
        atlas_3_10_info:
          libraries satlas,satlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
          libraries lapack_atlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
          libraries satlas,satlas not found in /usr/local/lib
          libraries lapack_atlas not found in /usr/local/lib
          libraries satlas,satlas not found in /usr/lib
          libraries lapack_atlas not found in /usr/lib
        <class 'numpy.distutils.system_info.atlas_3_10_info'>
          NOT AVAILABLE
        
        atlas_threads_info:
        Setting PTATLAS=ATLAS
          libraries ptf77blas,ptcblas,atlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
          libraries lapack_atlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
          libraries ptf77blas,ptcblas,atlas not found in /usr/local/lib
          libraries lapack_atlas not found in /usr/local/lib
          libraries ptf77blas,ptcblas,atlas not found in /usr/lib
          libraries lapack_atlas not found in /usr/lib
        <class 'numpy.distutils.system_info.atlas_threads_info'>
          NOT AVAILABLE
        
        atlas_info:
          libraries f77blas,cblas,atlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
          libraries lapack_atlas not found in /Users/dmitrykan/project/ann-benchmarks/venv/lib
          libraries f77blas,cblas,atlas not found in /usr/local/lib
          libraries lapack_atlas not found in /usr/local/lib
          libraries f77blas,cblas,atlas not found in /usr/lib
          libraries lapack_atlas not found in /usr/lib
        <class 'numpy.distutils.system_info.atlas_info'>
          NOT AVAILABLE
        
          FOUND:
            extra_compile_args = ['-msse3']
            extra_link_args = ['-Wl,-framework', '-Wl,Accelerate']
            define_macros = [('NO_ATLAS_INFO', 3), ('HAVE_CBLAS', None)]
        
        /usr/local/Cellar/[email protected]/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/dist.py:274: UserWarning: Unknown distribution option: 'define_macros'
          warnings.warn(msg)
        running install
        running build
        running config_cc
        unifing config_cc, config, build_clib, build_ext, build commands --compiler options
        running config_fc
        unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options
        running build_src
        build_src
        building py_modules sources
        creating build
        creating build/src.macosx-11-x86_64-3.7
        creating build/src.macosx-11-x86_64-3.7/numpy
        creating build/src.macosx-11-x86_64-3.7/numpy/distutils
        building library "npymath" sources
        customize Gnu95FCompiler
        Found executable /usr/local/bin/gfortran
        Traceback (most recent call last):
          File "<string>", line 1, in <module>
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/setup.py", line 392, in <module>
            setup_package()
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/setup.py", line 384, in setup_package
            setup(**metadata)
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/core.py", line 169, in setup
            return old_setup(**new_attr)
          File "/Users/dmitrykan/project/ann-benchmarks/venv/lib/python3.7/site-packages/setuptools/__init__.py", line 153, in setup
            return distutils.core.setup(**attrs)
          File "/usr/local/Cellar/[email protected]/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/core.py", line 148, in setup
            dist.run_commands()
          File "/usr/local/Cellar/[email protected]/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/dist.py", line 966, in run_commands
            self.run_command(cmd)
          File "/usr/local/Cellar/[email protected]/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/dist.py", line 985, in run_command
            cmd_obj.run()
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/install.py", line 62, in run
            r = self.setuptools_run()
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/install.py", line 36, in setuptools_run
            return distutils_install.run(self)
          File "/usr/local/Cellar/[email protected]/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/command/install.py", line 545, in run
            self.run_command('build')
          File "/usr/local/Cellar/[email protected]/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/cmd.py", line 313, in run_command
            self.distribution.run_command(command)
          File "/usr/local/Cellar/[email protected]/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/dist.py", line 985, in run_command
            cmd_obj.run()
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/build.py", line 47, in run
            old_build.run(self)
          File "/usr/local/Cellar/[email protected]/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/command/build.py", line 135, in run
            self.run_command(cmd_name)
          File "/usr/local/Cellar/[email protected]/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/cmd.py", line 313, in run_command
            self.distribution.run_command(command)
          File "/usr/local/Cellar/[email protected]/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/dist.py", line 985, in run_command
            cmd_obj.run()
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/build_src.py", line 148, in run
            self.build_sources()
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/build_src.py", line 159, in build_sources
            self.build_library_sources(*libname_info)
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/build_src.py", line 294, in build_library_sources
            sources = self.generate_sources(sources, (lib_name, build_info))
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/build_src.py", line 377, in generate_sources
            source = func(extension, build_dir)
          File "numpy/core/setup.py", line 672, in get_mathlib_info
            st = config_cmd.try_link('int main(void) { return 0;}')
          File "/usr/local/Cellar/[email protected]/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/distutils/command/config.py", line 243, in try_link
            self._check_compiler()
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/command/config.py", line 81, in _check_compiler
            c_compiler=self.compiler)
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/fcompiler/__init__.py", line 842, in new_fcompiler
            c_compiler=c_compiler)
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/fcompiler/__init__.py", line 816, in get_default_fcompiler
            c_compiler=c_compiler)
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/fcompiler/__init__.py", line 765, in _find_existing_fcompiler
            c.customize(dist)
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/fcompiler/__init__.py", line 521, in customize
            linker_so_flags = self.flag_vars.linker_so
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/environment.py", line 39, in __getattr__
            return self._get_var(name, conf_desc)
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/environment.py", line 53, in _get_var
            var = self._hook_handler(name, hook)
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/fcompiler/__init__.py", line 700, in _environment_hook
            return hook()
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/fcompiler/gnu.py", line 309, in get_flags_linker_so
            flags = GnuFCompiler.get_flags_linker_so(self)
          File "/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/numpy/distutils/fcompiler/gnu.py", line 138, in get_flags_linker_so
            os.environ['MACOSX_DEPLOYMENT_TARGET'] = target
          File "/usr/local/Cellar/[email protected]/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/os.py", line 686, in __setitem__
            value = self.encodevalue(value)
          File "/usr/local/Cellar/[email protected]/3.7.10_2/Frameworks/Python.framework/Versions/3.7/lib/python3.7/os.py", line 756, in encode
            raise TypeError("str expected, not %s" % type(value).__name__)
        TypeError: str expected, not int
        ----------------------------------------
    ERROR: Command errored out with exit status 1: /Users/dmitrykan/project/ann-benchmarks/venv/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/setup.py'"'"'; __file__='"'"'/private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-install-l_rkp2wp/numpy_7ba5c87712f044d5af4cf340e9f6bc24/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/2l/f04dpd917vx50cyl8fftcxzr0000gn/T/pip-record-wrmxmqjz/install-record.txt --single-version-externally-managed --compile --install-headers /Users/dmitrykan/project/ann-benchmarks/venv/include/site/python3.7/numpy Check the logs for full command output.
    
    opened by DmitryKey 21
  • Adding HNSW, updating SW-graph, increasing the number of queries

    Adding HNSW, updating SW-graph, increasing the number of queries

    Leo (@searchivarius) and I have added a new algorithm (Hierarchical NSW, HNSW) and updated the parameters and performance of the SW-graph algorithm (both from nmslib). Results of comparison to FALCONN and Annoy on the same amazon instance are attached below. Comparison on a Xeon E5-4650 v2 machine can be found in the HNSW preprint http://arxiv.org/abs/1603.09320 Note that SW-graph and HNSW indexes are now saved and reused later to strongly reduce the testing time.

    As an addition, we have increased the number of queries to 10K (5K might be also OK), see Leo's comments below:

    Eric is most welcome to use any other random_state.

    Last time, Leo was testing using random_state==1 However, it is best to use a new random seed for each major re-evaluation, that we test on a truly bind data.

    Increasing the number of queries or changing the seed, won't make truly weak methods much faster. However, in my experience, this can easily result in 10-30% changes in performance. This may affect relative performance of methods with similar performance.

    glove sift

    opened by yurymalkov 21
  • Adds Open Distro Elastic Search's KNN plugin support. Closes #174.

    Adds Open Distro Elastic Search's KNN plugin support. Closes #174.

    Adding Open Distro Elasticsearch KNN plugin support by borrowing on the ES setup from elasticsearch and elastiknn. Below is a comparison on fashion-mnist between this opendistroknn and elastiknn, where opnedistroknn has ~3X better queries/s at comparable recall.

    image

    opened by stephenleo 18
  • New NearPy configurations, after fixing stuff in NearPy

    New NearPy configurations, after fixing stuff in NearPy

    I changed two things in NearPy. The really awful thing was, that no UniqueFilter was used by default. That caused the same indexed vector to be multiple times in the search candidate set. After computing the distance to the query vector and sorting with respect to that, in many cases this caused the cropped N=10 nearest neighbour set to contain the same vector multiple times. Also distance computation was more expensive because of the duplicates. The other thing is that I now store normalized vectors in the index which speeds up distance computation a lot.

    Both changes are released via pypi in NearPy 0.2.2

    I also ran your benchmark code with a reduced dataset to look for good parameters.

    This example with just 50000 vectors shows that the above fixes helped a lot so far. Especially the missing UniqueFilter created an upper bound on the accuracy before.

    plot_10bit_50000vectors

    opened by pixelogik 18
  • Add support for reusing built data structures

    Add support for reusing built data structures

    This commit adds support for reconfiguring an already-built data structure with new query parameters, making it possible to reuse one to perform many different experiments. (This was discussed in issue #41.)

    opened by ale-f 17
  • Attempt at adding Milvus

    Attempt at adding Milvus

    A bit janky, since (a) Milvus is a server process, so I hacked a hacky entrypoint file to run a daemon (b) was impossible building it from source, but I realized I can just copy the relevant binaries from the "official" dockerfile and it seems to work

    Running some benchmarks now, will see what we get

    opened by erikbern 16
  • Updated kgraph parameters and etc

    Updated kgraph parameters and etc

    • A command line argument "--algo" is added so one can run only one algorithm instead of all.
    • Updated kgraph configuration for two dataset separately.

    sift glove

    opened by aaalgo 16
  • Can we use ann-benchmarks to benchmark the All-Nearest-Neighbors problem?

    Can we use ann-benchmarks to benchmark the All-Nearest-Neighbors problem?

    The All-Nearest-Neighbors problem (All-NN) asks us to find all pairs of nearest neighbors in a set of n points in d-dimensions. Classically, there is a deterministic O(n log n) algorithm, as well as stochastic O(n sqrt(n)) algorithm.

    I would like to test the All-NN performance of algorithms like Faiss. According to a google insider, the closed version of faiss does have a mode that solve All-NN. However, I don't know if the public version does.

    I suspect it may, and I suspect that ann-benchmarks has a mode that can effectively test All-NN performance via its batch mode. However I don't understand the terminology well enough to determine of batch mode is equivalent to solving All-NN. Can it?

    opened by maxsu 1
  • Upgrade Ubuntu and Python

    Upgrade Ubuntu and Python

    The repo is on Ubuntu 18 and Python 3.6.

    I'm having trouble upgrading the Elastiknn benchmarks because the latest Elasticseach Ubuntu installer relies on some things that don't seem to be in Ubuntu 18. I've also jumped through some hoops with Python, e.g., intentionally installing older versions of scipy for compat w/ 3.6.

    Any reason not to upgrade? Any dragons to be aware of in case I take a first pass at it?

    opened by alexklibisz 2
  • category filter

    category filter

    Would it be in the spirit of this benchmarks to add a second benchmark category for ANN in conjunction with categorial filters?

    Most real-world applications of ANN will required category filtering e.g. when searching for cloths in an e-commerce scenario via ANN one might filter by gender (categorial) or availability (categorial).

    There are several software products which allow combining ANN and category filters e.g. Apache Solr, Elasticsearch, Vertex AI Matching Engine, weaviate, qdrant. However, they mainly differ to this benchmarks in that they are managed services or just services but not embeddable libraries.

    In addition to recall vs. queries per second there should be a view which filters to a fraction of the date vs. recall vs. queries per second. For the proprietary managed services, also a cost dimension might be useful.

    I have found the following blog articles covering the topic:

    • https://blog.vasnetsov.com/posts/categorical-hnsw/
    • https://towardsdatascience.com/effects-of-filtered-hnsw-searches-on-recall-and-latency-434becf8041c
    • https://towardsdatascience.com/using-approximate-nearest-neighbor-search-in-real-world-applications-a75c351445d

    Given that this would be very useful for practical implementations but also the fact that it significantly complicates the benchmarks I would be interested in your opinion and/or how I could help with it. Also I would be great to know if someone has already done such benchmarks.

    opened by gtsoukas 3
  • Relative error metric

    Relative error metric

    I was wondering if there is a way of defining what is a “reasonable” relative error, something similar to recall 0.9, 0.95, 0.99. Furthermore, how much is this metric is effected by the distance metric? If you have any experience with this metric I’ll be happy to here about it. Thanks in advance, Guy

    opened by GuyAv46 0
  • Evaluate Elasticsearch approximate kNN API

    Evaluate Elasticsearch approximate kNN API

    This was recently released in Elasticsearch 8.0+ and the API is documented here

    Elasticsearch is distinct in both implementation and product from OpenSearch (Amazon maintained fork), and I expect that the two APIs will have different results. Both are HNSW based.

    Thanks for maintaining this repo!

    opened by rw-access 1
Owner
Erik Bernhardsson
Working on some weird ideas for data infra. Ex-CTO at better.com, likes to open source stuff sometimes and write random blog posts.
Erik Bernhardsson
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Annoy Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given quer

Spotify 10.6k Jan 4, 2023
Weighted K Nearest Neighbors (kNN) algorithm implemented on python from scratch.

kNN_From_Scratch I implemented the k nearest neighbors (kNN) classification algorithm on python. This algorithm is used to predict the classes of new

null 1 Dec 14, 2021
Problem-943.-ACMP - Problem 943. ACMP

Problem-943.-ACMP В "main.py" расположен вариант моего решения задачи 943 с серв

Konstantin Dyomshin 2 Aug 19, 2022
null 5 Jan 5, 2023
Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression.

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression. Not an official Google product. Me

Google Research 27 Dec 12, 2022
Torch-based tool for quantizing high-dimensional vectors using additive codebooks

Trainable multi-codebook quantization This repository implements a utility for use with PyTorch, and ideally GPUs, for training an efficient quantizer

Daniel Povey 41 Jan 7, 2023
The implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets.

Joint t-sne This is the implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets. abstract: We present Jo

IDEAS Lab 7 Dec 18, 2022
Scikit-event-correlation - Event Correlation and Forecasting over High Dimensional Streaming Sensor Data algorithms

scikit-event-correlation Event Correlation and Changing Detection Algorithm Theo

Intellia ICT 5 Oct 30, 2022
Code for "Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency" paper

UNICORN ?? Webpage | Paper | BibTex PyTorch implementation of "Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency" pap

null 118 Jan 6, 2023
This repository contains an overview of important follow-up works based on the original Vision Transformer (ViT) by Google.

This repository contains an overview of important follow-up works based on the original Vision Transformer (ViT) by Google.

null 75 Dec 2, 2022
In this work, we will implement some basic but important algorithm of machine learning step by step.

WoRkS continued English 中文 Français Probability Density Estimation-Non-Parametric Methods(概率密度估计-非参数方法) 1. Kernel / k-Nearest Neighborhood Density Est

liziyu0104 1 Dec 30, 2021
Feedback is important: response-aware feedback mechanism for background based conversation

RFM The code for the paper: "Feedback is important: response-aware feedback mechanism for background based conversation." Requirements python 3.7 pyto

Jiatao Chen 2 Sep 29, 2022
Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence

Speech Recognition is an important feature in several applications used such as home automation, artificial intelligence, etc. This article aims to provide an introduction on how to make use of the SpeechRecognition and pyttsx3 library of Python.

RISHABH MISHRA 1 Feb 13, 2022
A gesture recognition system powered by OpenPose, k-nearest neighbours, and local outlier factor.

OpenHands OpenHands is a gesture recognition system powered by OpenPose, k-nearest neighbours, and local outlier factor. Currently the system can iden

Paul Treanor 12 Jan 10, 2022
Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning

structshot Code and data for paper "Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning", Yi Yang and Arz

ASAPP Research 47 Dec 27, 2022
Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

Pytorch implementation of paper "Efficient Nearest Neighbor Language Models" (EMNLP 2021)

Junxian He 57 Jan 1, 2023
Optimal space decomposition based-product quantization for approximate nearest neighbor search

Optimal space decomposition based-product quantization for approximate nearest neighbor search Abstract Product quantization(PQ) is an effective neare

Mylove 1 Nov 19, 2021
K-Nearest Neighbor in Pytorch

Pytorch KNN CUDA 2019/11/02 This repository will no longer be maintained as pytorch supports sort() and kthvalue on tensors. git clone https://github.

Chris Choy 65 Dec 1, 2022