Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Overview

Annoy

Annoy example
https://img.shields.io/travis/spotify/annoy/master.svg?style=flat https://ci.appveyor.com/api/projects/status/github/spotify/annoy?svg=true&pendingText=windows%20-%20Pending&passingText=windows%20-%20OK&failingText=windows%20-%20Failing https://img.shields.io/pypi/v/annoy.svg?style=flat

Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

Install

To install, simply do pip install --user annoy to pull down the latest version from PyPI.

For the C++ version, just clone the repo and #include "annoylib.h".

Background

There are some other libraries to do nearest neighbor search. Annoy is almost as fast as the fastest libraries, (see below), but there is actually another feature that really sets Annoy apart: it has the ability to use static files as indexes. In particular, this means you can share index across processes. Annoy also decouples creating indexes from loading them, so you can pass around indexes as files and map them into memory quickly. Another nice thing of Annoy is that it tries to minimize memory footprint so the indexes are quite small.

Why is this useful? If you want to find nearest neighbors and you have many CPU's, you only need to build the index once. You can also pass around and distribute static files to use in production environment, in Hadoop jobs, etc. Any process will be able to load (mmap) the index into memory and will be able to do lookups immediately.

We use it at Spotify for music recommendations. After running matrix factorization algorithms, every user/item can be represented as a vector in f-dimensional space. This library helps us search for similar users/items. We have many millions of tracks in a high-dimensional space, so memory usage is a prime concern.

Annoy was built by Erik Bernhardsson in a couple of afternoons during Hack Week.

Summary of features

  • Euclidean distance, Manhattan distance, cosine distance, Hamming distance, or Dot (Inner) Product distance
  • Cosine distance is equivalent to Euclidean distance of normalized vectors = sqrt(2-2*cos(u, v))
  • Works better if you don't have too many dimensions (like <100) but seems to perform surprisingly well even up to 1,000 dimensions
  • Small memory usage
  • Lets you share memory between multiple processes
  • Index creation is separate from lookup (in particular you can not add more items once the tree has been created)
  • Native Python support, tested with 2.7, 3.6, and 3.7.
  • Build index on disk to enable indexing big datasets that won't fit into memory (contributed by Rene Hollander)

Python code example

from annoy import AnnoyIndex
import random

f = 40
t = AnnoyIndex(f, 'angular')  # Length of item vector that will be indexed
for i in range(1000):
    v = [random.gauss(0, 1) for z in range(f)]
    t.add_item(i, v)

t.build(10) # 10 trees
t.save('test.ann')

# ...

u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors

Right now it only accepts integers as identifiers for items. Note that it will allocate memory for max(id)+1 items because it assumes your items are numbered 0 … n-1. If you need other id's, you will have to keep track of a map yourself.

Full Python API

  • AnnoyIndex(f, metric) returns a new index that's read-write and stores vector of f dimensions. Metric can be "angular", "euclidean", "manhattan", "hamming", or "dot".
  • a.add_item(i, v) adds item i (any nonnegative integer) with vector v. Note that it will allocate memory for max(i)+1 items.
  • a.build(n_trees, n_jobs=-1) builds a forest of n_trees trees. More trees gives higher precision when querying. After calling build, no more items can be added. n_jobs specifies the number of threads used to build the trees. n_jobs=-1 uses all available CPU cores.
  • a.save(fn, prefault=False) saves the index to disk and loads it (see next function). After saving, no more items can be added.
  • a.load(fn, prefault=False) loads (mmaps) an index from disk. If prefault is set to True, it will pre-read the entire file into memory (using mmap with MAP_POPULATE). Default is False.
  • a.unload() unloads.
  • a.get_nns_by_item(i, n, search_k=-1, include_distances=False) returns the n closest items. During the query it will inspect up to search_k nodes which defaults to n_trees * n if not provided. search_k gives you a run-time tradeoff between better accuracy and speed. If you set include_distances to True, it will return a 2 element tuple with two lists in it: the second one containing all corresponding distances.
  • a.get_nns_by_vector(v, n, search_k=-1, include_distances=False) same but query by vector v.
  • a.get_item_vector(i) returns the vector for item i that was previously added.
  • a.get_distance(i, j) returns the distance between items i and j. NOTE: this used to return the squared distance, but has been changed as of Aug 2016.
  • a.get_n_items() returns the number of items in the index.
  • a.get_n_trees() returns the number of trees in the index.
  • a.on_disk_build(fn) prepares annoy to build the index in the specified file instead of RAM (execute before adding items, no need to save after build)
  • a.set_seed(seed) will initialize the random number generator with the given seed. Only used for building up the tree, i. e. only necessary to pass this before adding the items. Will have no effect after calling a.build(n_trees) or a.load(fn).

Notes:

  • There's no bounds checking performed on the values so be careful.
  • Annoy uses Euclidean distance of normalized vectors for its angular distance, which for two vectors u,v is equal to sqrt(2(1-cos(u,v)))

The C++ API is very similar: just #include "annoylib.h" to get access to it.

Tradeoffs

There are just two main parameters needed to tune Annoy: the number of trees n_trees and the number of nodes to inspect during searching search_k.

  • n_trees is provided during build time and affects the build time and the index size. A larger value will give more accurate results, but larger indexes.
  • search_k is provided in runtime and affects the search performance. A larger value will give more accurate results, but will take longer time to return.

If search_k is not provided, it will default to n * n_trees where n is the number of approximate nearest neighbors. Otherwise, search_k and n_trees are roughly independent, i.e. the value of n_trees will not affect search time if search_k is held constant and vice versa. Basically it's recommended to set n_trees as large as possible given the amount of memory you can afford, and it's recommended to set search_k as large as possible given the time constraints you have for the queries.

You can also accept slower search times in favour of reduced loading times, memory usage, and disk IO. On supported platforms the index is prefaulted during load and save, causing the file to be pre-emptively read from disk into memory. If you set prefault to False, pages of the mmapped index are instead read from disk and cached in memory on-demand, as necessary for a search to complete. This can significantly increase early search times but may be better suited for systems with low memory compared to index size, when few queries are executed against a loaded index, and/or when large areas of the index are unlikely to be relevant to search queries.

How does it work

Using random projections and by building up a tree. At every intermediate node in the tree, a random hyperplane is chosen, which divides the space into two subspaces. This hyperplane is chosen by sampling two points from the subset and taking the hyperplane equidistant from them.

We do this k times so that we get a forest of trees. k has to be tuned to your need, by looking at what tradeoff you have between precision and performance.

Hamming distance (contributed by Martin Aumüller) packs the data into 64-bit integers under the hood and uses built-in bit count primitives so it could be quite fast. All splits are axis-aligned.

Dot Product distance (contributed by Peter Sobot) reduces the provided vectors from dot (or "inner-product") space to a more query-friendly cosine space using a method by Bachrach et al., at Microsoft Research, published in 2014.

More info

ANN benchmarks

Source code

It's all written in C++ with a handful of ugly optimizations for performance and memory usage. You have been warned :)

The code should support Windows, thanks to Qiang Kou and Timothy Riley.

To run the tests, execute python setup.py nosetests. The test suite includes a big real world dataset that is downloaded from the internet, so it will take a few minutes to execute.

Discuss

Feel free to post any questions or comments to the annoy-user group. I'm @fulhack on Twitter.

Comments
  • Failures in _annoy_test.py

    Failures in _annoy_test.py

    Hello!

    I ran tests from _annoy_test.py and got 7 failures:

    FAIL: test_get_nns_by_item (_annoy_test.AngularIndexTest)

    Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 41, in test_get_nns_by_item self.assertEquals(i.get_nns_by_item(0, 3), [0,1,2]) AssertionError: Lists differ: [0, 1] != [0, 1, 2]

    Second list contains 1 additional elements. First extra element 2: 2

    • [0, 1]
    • [0, 1, 2] ? +++

    FAIL: test_get_nns_by_vector (_annoy_test.AngularIndexTest)

    Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 31, in test_get_nns_by_vector self.assertEquals(i.get_nns_by_vector([3,2,1], 3), [0,1,2]) AssertionError: Lists differ: [0, 1] != [0, 1, 2]

    Second list contains 1 additional elements. First extra element 2: 2

    • [0, 1]
    • [0, 1, 2] ? +++

    FAIL: test_large_index (_annoy_test.AngularIndexTest)

    Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 93, in test_large_index self.assertEquals(i.get_nns_by_item(j, 2), [j, j+1]) AssertionError: Lists differ: [5, 4] != [504, 505]

    First differing element 0: 5 504

    • [5, 4]
    • [504, 505]

    FAIL: test_large_index (_annoy_test.EuclideanIndexTest)

    Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 130, in test_large_index self.assertEquals(i.get_nns_by_item(j+1, 2), [j+1, j]) AssertionError: Lists differ: [228, 229] != [535, 534]

    First differing element 0: 228 535

    • [228, 229]'ve added test/
    • [535, 534]

    FAIL: test_precision_10 (_annoy_test.EuclideanIndexTest)

    Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 154, in test_precision_10 self.assertTrue(self.precision(10) == 1.0) AssertionError: False is not true

    FAIL: test_precision_100 (_annoy_test.EuclideanIndexTest)

    Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 157, in test_precision_100 self.assertTrue(self.precision(100) >= 0.99) AssertionError: False is not true

    FAIL: test_precision_1000 (_annoy_test.EuclideanIndexTest)

    Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 160, in test_precision_1000 self.assertTrue(self.precision(1000) >= 0.99) AssertionError: False is not true


    Ran 14 tests in 2.622s

    FAILED (failures=7)

    Process finished with exit code 7

    opened by dina-iofe 38
  • Does windows support now?

    Does windows support now?

    I am working on a win10 desktop, encountering the same issue that C:\Users\v-xusha\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\cl.exe' failed with exit status 2.

    opened by cadobe 31
  • OSError: Index size is not a multiple of vector size

    OSError: Index size is not a multiple of vector size

    i use annoy to build ,which the number of words is 5844240 and the vector size is 200. it raise a error: 【OSError: Index size is not a multiple of vector size】。please help me

    opened by dtMndas 28
  • Windows support?

    Windows support?

    The readme suggests that pip install annoy should work on Windows, but I get the following error:

    c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or directory
    

    My googling tells me that unistd.h is a Linux-only file.

    Is there an easy fix for this?

    Thanks for the awesome package!

    And in case the full error message is useful:

    C:\Anaconda>pip install annoy
    Collecting annoy
      Downloading annoy-1.5.2.tar.gz (625kB)
        100% |################################| 626kB 409kB/s
    Building wheels for collected packages: annoy
      Running setup.py bdist_wheel for annoy
      Complete output from command C:\Anaconda\python.exe -c "import setuptools;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9\\annoy\\set
    up.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d c:\users\s\appdata\local\temp\tmpwwwitjpip-wheel-:
    
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-2.7
      creating build\lib.win-amd64-2.7\annoy
      copying annoy\__init__.py -> build\lib.win-amd64-2.7\annoy
      running build_ext
      building 'annoy.annoylib' extension
      creating build\temp.win-amd64-2.7
      creating build\temp.win-amd64-2.7\Release
      creating build\temp.win-amd64-2.7\Release\src
      C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Anaconda\inclu
    de -IC:\Anaconda\PC /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-2.7\Release\src/annoymodule.obj -O3 -march=native -ffast-math
      cl : Command line warning D9002 : ignoring unknown option '-O3'
      cl : Command line warning D9002 : ignoring unknown option '-march=native'
      cl : Command line warning D9002 : ignoring unknown option '-ffast-math'
      annoymodule.cc
      C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Include\xlocale(342) : warning C4530: C++ exception handler used, but unw
    ind semantics are not enabled. Specify /EHsc
      c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or d
    irectory
      error: command 'C:\\Program Files (x86)\\Common Files\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\amd64\\cl.exe' failed with exit status 2
    
      ----------------------------------------
      Failed building wheel for annoy
    Failed to build annoy
    Installing collected packages: annoy
      Running setup.py install for annoy
        Complete output from command C:\Anaconda\python.exe -c "import setuptools, tokenize;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9
    \\annoy\\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record c:\users\
    s\appdata\local\temp\pip-5cyqoc-record\install-record.txt --single-version-externally-managed --compile:
        running install
        running build
        running build_py
        running build_ext
        building 'annoy.annoylib' extension
        C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Anaconda\inc
    lude -IC:\Anaconda\PC /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-2.7\Release\src/annoymodule.obj -O3 -march=native -ffast-math
        cl : Command line warning D9002 : ignoring unknown option '-O3'
        cl : Command line warning D9002 : ignoring unknown option '-march=native'
        cl : Command line warning D9002 : ignoring unknown option '-ffast-math'
        annoymodule.cc
        C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Include\xlocale(342) : warning C4530: C++ exception handler used, but u
    nwind semantics are not enabled. Specify /EHsc
        c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or
     directory
        error: command 'C:\\Program Files (x86)\\Common Files\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\amd64\\cl.exe' failed with exit status 2
    
        ----------------------------------------
    Command "C:\Anaconda\python.exe -c "import setuptools, tokenize;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9\\annoy\\setup.py';exec(
    compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record c:\users\s\appdata\local\temp\pip
    -5cyqoc-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in c:\users\s\appdata\local\temp\pip-build-j
    intu9\annoy
    
    opened by sergeyf 28
  • On disk index build

    On disk index build

    Index build can now be performed directly on disk. The resulting index does not need to fit into memory anymore. This fixes #85 and probably fixes #250.

    TODO:

    • [x] Add support for on_disk_build to LUA module
    • [x] Add support for on_disk_build to Go module
    • [x] Fix Windows support (there is not ftruncate on windows)
    • [ ] Fix WSL compatability (ftruncate to cut off extra allocated memory fails)
    • [ ] Check return value of ftruncate
    opened by ReneHollander 25
  • First lookup for an ID is slow

    First lookup for an ID is slow

    I'm seeing a problem when using the library where calling get_nns_by_vector is very slow on first execution for a given value (15s for an index with ~2m vectors of length 200), and then much faster in subsequent calls (subsecond). I've also seen similar behavior in get_item_vector, but I'm not sure if it's related. I've set prefault to true when loading the index from disk.

    The strangest part is that it only occurs on some computers, and I'm not able to repro on my local machine to get details on what is going on. If I load the same index on my computer, the execution time is always subsecond. I'm continuing to look into this to see if I can find a reliable method of reproducing.

    Has anyone seen performance patterns like this? Do you have thoughts on what the problem may be? I'm not sure that it's Annoy specifically, but would be good to know if it is.

    My current thoughts are that it has to do with the MMAPing and that the index is not all saved into RAM and the disk lookups are slow on some machines. I'm not very familiar with MMAPing and would appreciate outside thoughts on whether that's reasonable and/or how to verify if it's the problem.

    opened by scottbreyfogle 24
  • Add dot product as a distance metric.

    Add dot product as a distance metric.

    This PR:

    • adds a DotProduct metric, using the negative of the dot product between two vectors as a pseudo-distance metric.
    • makes this metric available under the string name "dot" in the standard Annoy interface
    • adds tests for this new metric, similar to the existing tests for the Euclidean metric.

    You might ask - why add dot when we already have angular, which is basically just dot but divided by magnitude? Well, for some recommender systems (including some matrix factorization collaborative filters), the dot product between the user and item pair is a prediction of affinity between that user and item. (Here's a great example from @jfkirk.)

    enhancement 
    opened by psobot 22
  • Windows 7 x64 And Anaconda v5.0.1 - Unable to Install annoy

    Windows 7 x64 And Anaconda v5.0.1 - Unable to Install annoy

    Using latest (as of yesterday) Anaconda install using python 3.6.4, When I attempt to install via: pip install annoy

    I get the following error:

    (Seq2Seq) C:\Users\lrichards>pip install annoy Collecting annoy Using cached annoy-1.11.1.tar.gz Building wheels for collected packages: annoy Running setup.py bdist_wheel for annoy ... error Complete output from command C:\Users\lrichards\AppData\Local\Continuum\anaco da3\envs\Seq2Seq\python.exe -u -c "import setuptools, tokenize;file='C:\Us rs\LRICHA~1\AppData\Local\Temp\pip-build-xg1qh4j5\annoy\setup.py';f=geta tr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.clos ();exec(compile(code, file, 'exec'))" bdist_wheel -d C:\Users\LRICHA~1\AppD ta\Local\Temp\tmp25o7_qvdpip-wheel- --python-tag cp36: running bdist_wheel running build running build_py creating build creating build\lib.win-amd64-3.6 creating build\lib.win-amd64-3.6\annoy copying annoy_init_.py -> build\lib.win-amd64-3.6\annoy running build_ext building 'annoy.annoylib' extension creating build\temp.win-amd64-3.6 creating build\temp.win-amd64-3.6\Release creating build\temp.win-amd64-3.6\Release\src C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\cl.exe /c /nologo Ox /W3 /GL /DNDEBUG /MT -IC:\Users\lrichards\AppData\Local\Continuum\anaconda3
    nvs\Seq2Seq\include -IC:\Users\lrichards\AppData\Local\Continuum\anaconda3\envs Seq2Seq\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\Inclu e" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" /EHsc /Tps c/annoymodule.cc /Fobuild\temp.win-amd64-3.6\Release\src/annoymodule.obj annoymodule.cc c:\users\lrichards\appdata\local\temp\pip-build-xg1qh4j5\annoy\src\annoylib.h 19) : fatal error C1083: Cannot open include file: 'stdio.h': No such file or d rectory error: command 'C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bi \cl.exe' failed with exit status 2


    Failed building wheel for annoy Running setup.py clean for annoy Failed to build annoy Installing collected packages: annoy Running setup.py install for annoy ... error Complete output from command C:\Users\lrichards\AppData\Local\Continuum\ana onda3\envs\Seq2Seq\python.exe -u -c "import setuptools, tokenize;file='C:\ sers\LRICHA~1\AppData\Local\Temp\pip-build-xg1qh4j5\annoy\setup.py';f=ge attr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.cl se();exec(compile(code, file, 'exec'))" install --record C:\Users\LRICHA~1
    ppData\Local\Temp\pip-s36zmsav-record\install-record.txt --single-version-exter ally-managed --compile: running install running build running build_py creating build creating build\lib.win-amd64-3.6 creating build\lib.win-amd64-3.6\annoy copying annoy_init_.py -> build\lib.win-amd64-3.6\annoy running build_ext building 'annoy.annoylib' extension creating build\temp.win-amd64-3.6 creating build\temp.win-amd64-3.6\Release creating build\temp.win-amd64-3.6\Release\src C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\cl.exe /c /nolog /Ox /W3 /GL /DNDEBUG /MT -IC:\Users\lrichards\AppData\Local\Continuum\anaconda \envs\Seq2Seq\include -IC:\Users\lrichards\AppData\Local\Continuum\anaconda3\en s\Seq2Seq\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\Inc ude" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" /EHsc /T src/annoymodule.cc /Fobuild\temp.win-amd64-3.6\Release\src/annoymodule.obj annoymodule.cc c:\users\lrichards\appdata\local\temp\pip-build-xg1qh4j5\annoy\src\annoylib h(19) : fatal error C1083: Cannot open include file: 'stdio.h': No such file or directory error: command 'C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\ in\cl.exe' failed with exit status 2

    ----------------------------------------
    

    Command "C:\Users\lrichards\AppData\Local\Continuum\anaconda3\envs\Seq2Seq\pyth n.exe -u -c "import setuptools, tokenize;file='C:\Users\LRICHA~1\AppData \Local\Temp\pip-build-xg1qh4j5\annoy\setup.py';f=getattr(tokenize, 'open', pen)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\LRICHA~1\AppData\Local\Temp\pip-s 6zmsav-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\LRICHA~1\AppData\Local\Temp\pip-build-xg1q 4j5\annoy\

    There is no annoy package available via Conda for Windows and no package listed in Anaconda Navigator.

    How can I get this lib installed under Window?

    opened by wergeld 22
  • Windows build issue

    Windows build issue

    Hi,

    I'm trying to build with mingw32 on windows (Anaconda python v3.6) and getting errors that look like they're in the python rather than the C:

    C:\annoy>python setup.py build --compiler=mingw32 running build running build_py running build_ext Traceback (most recent call last): File "setup.py", line 69, in setup_requires=['nose>=1.0'] File "C:\Users\Ian\Anaconda3\lib\distutils\core.py", line 148, in setup dist.run_commands() File "C:\Users\Ian\Anaconda3\lib\distutils\dist.py", line 955, in run_commands self.run_command(cmd) File "C:\Users\Ian\Anaconda3\lib\distutils\dist.py", line 974, in run_command cmd_obj.run() File "C:\Users\Ian\Anaconda3\lib\distutils\command\build.py", line 135, in run self.run_command(cmd_name) File "C:\Users\Ian\Anaconda3\lib\distutils\cmd.py", line 313, in run_command self.distribution.run_command(command) File "C:\Users\Ian\Anaconda3\lib\distutils\dist.py", line 974, in run_command cmd_obj.run() File "C:\Users\Ian\Anaconda3\lib\site-packages\setuptools\command\build_ext.py", line 75, in run _build_ext.run(self) File "C:\Users\Ian\Anaconda3\lib\site-packages\Cython\Distutils\old_build_ext.py", line 185, in run _build_ext.build_ext.run(self) File "C:\Users\Ian\Anaconda3\lib\distutils\command\build_ext.py", line 308, in run force=self.force) File "C:\Users\Ian\Anaconda3\lib\distutils\ccompiler.py", line 1031, in new_compiler return klass(None, dry_run, force) File "C:\Users\Ian\Anaconda3\lib\distutils\cygwinccompiler.py", line 282, in init CygwinCCompiler.init (self, verbose, dry_run, force) File "C:\Users\Ian\Anaconda3\lib\distutils\cygwinccompiler.py", line 126, in init if self.ld_version >= "2.10.90": TypeError: '>=' not supported between instances of 'NoneType' and 'str'

    Any ideas what's causing it ?

    opened by MonkeyChap 22
  • Rewrite existing file issue

    Rewrite existing file issue

    Hello! I'm using annoy 1.15.2 on python 3.7.3. Here is the issue:

    >>>from annoy import AnnoyIndex
    >>>a=AnnoyIndex(1)
    >>>a.add_item(1,[1])
    >>>a.add_item(2,[2])
    >>>a.build(1)
    True
    >>>a.save("temp.ann")
    True
    >>>b=AnnoyIndex(1)
    >>>b.add_item(3,[3])
    >>>b.add_item(4,[4])
    >>>b.build(1)
    True
    >>>b.save("temp.ann")
    Traceback (most recent call last):
      File "<input>", line 1, in <module>
    OSError: [Errno 22] Invalid argument
    

    Could you tell why I get the error?

    opened by natural-chop 21
  • Multithreaded build

    Multithreaded build

    Issue: https://github.com/spotify/annoy/issues/70

    This is my work-in-progress branch for supporting multithreaded builds in Annoy. Locally (Ubuntu 18.04) all C++, Python and Lua tests seem to pass (I've re-ran them multiple times). The only thing remaining is fixing the Go build and I would appreciate some help with this :smile:

    I've ran some speed tests on glove-angular-25 index with n_trees=32. Reference time on master (without threads) was 49.9s. Here are results for multiple threads: | Num. threads | Time (s) | |--------------|-------------| | 1 | 51.28759694 | | 2 | 26.675179 | | 4 | 14.82837749 | | 8 | 8.897883892 | | 16 | 7.353658915 |

    If you have any questions or concerns feel free to comment.

    opened by novoselrok 20
  • Decrease memory usage by decreasing the number of trees, or decreasing dimensionality?

    Decrease memory usage by decreasing the number of trees, or decreasing dimensionality?

    I have about ~1m vectors I would like to search through, each with 1000 dimensions. Ideally, I would like to keep the size of the index under 500mb. I believe I have two levers to pull for allowing the index to fit in memory:

    1. Reduce the number of trees. For example, in my testing, reducing the # of trees from 70 to 10 cut the memory usage in ~half.
    2. Reduce dimensionality. For example, reducing dimensionality from 1000 to 500 to also cut the memory usage in ~half. I could do something like TruncatedSVD (sklearn docs) to do this (but lmk if there's a better method).

    Of course, I could also do some combination of these two. What are the tradeoffs between both methods of reducing memory usage?

    opened by gusgordon 0
  • Dot index recall issue on last.fm dataset

    Dot index recall issue on last.fm dataset

    Hi!

    I had pretty low recall for dot index on my data. Then I decided to test the lastfm benchmark manually.

    Then I realized that this benchmark is actually for an angular index with manually computed 65th component -- sqrt(max_norm^2 - ||x||^2). I've reproduced results for an angular index, but for a dot index (with the 65th being deleted) recall was again pretty low!

    There's the notebook with calculations.

    What's going on in the notebook:

    1. Calculate exact dot products using brute force, then calculate top candidates
    2. Construct an angular index with the additional component sqrt(max_norm^2 - ||x||^2), compute recalls for various search_k's
    3. Construct a dot index without the additional component sqrt(max_norm^2 - ||x||^2), again compute recalls for various search_k's

    The results are as follows (10 trees):

    | search_k | 1000 | 10000 | 50000 | |----------------------------|------------|-----------|-----------| | Angular index recall (65 components) | 0.563 | 0.736 | 0.783 | | Dot index recall (64 components) | 0.003 | 0.034 | 0.175 |

    But, as I understand, results should be at least comparable.

    May there be a bug? If my results are correct, I would be glad to contribute with some fixes. However, by now I don't see what can be wrong in the source code of dot index.

    opened by pkorobov 2
  • How realistic is serializing the internal C++ AnnoyIndex state?

    How realistic is serializing the internal C++ AnnoyIndex state?

    Basically what I would want is to run something like this:

    import os
    import random
    
    from annoy import AnnoyIndex
    
    num_rows = 10000
    num_trees = 10
    num_dims = 512
    
    try:
        os.remove("annoy_idx.ann")
    except FileNotFoundError:
        pass
    
    annoy_idx = AnnoyIndex(num_dims, "angular")
    annoy_idx.on_disk_build("annoy_idx.ann")
    for idx in range(num_rows):
        vector = [random.gauss(0, 1) for _ in range(num_dims)]
        annoy_idx.add_item(idx, vector)
    
    annoy_idx.serialize("annoy_idx.state") # XXX - This is the magic I'm looking for
    

    and then (after that program is done and exited) I would like to continue appending data like something like this (this adds 10,000 new rows with indices 10,000, ..., 19,999):

    import os
    import random
    
    from annoy import AnnoyIndex
    
    num_rows = 10000
    num_trees = 10
    num_dims = 512
    
    try:
        os.remove("annoy_idx.ann")
    except FileNotFoundError:
        pass
    
    annoy_idx = AnnoyIndex(num_dims, "angular")
    annoy_idx.deserialize("annoy_idx.state") # XXX - This is the magic I'm looking for
    for idx in range(num_rows):
        vector = [random.gauss(0, 1) for _ in range(num_dims)]
        annoy_idx.add_item(idx +num_rows, vector) # XXX - Note the increase in idx variable
    

    So basically what I want is for there to be a serialize/deserialize ability so that I can continue the flow. It seems to me like the protected data here would need to be serialized:

    https://github.com/spotify/annoy/blob/master/src/annoylib.h#L847-L885

    In my case it seems to basically serializing the node here:

    https://github.com/spotify/annoy/blob/master/src/annoylib.h#L442-L463

    So my question is the following:

    How realistic is this? More specifically, assuming that I am able to successfully serialize/deserialize the state, does it seem like this would play well with the mmap in the on_disk_build() step? This is maybe too general a question, but basically my point is: is this totally crazy? Are there obvious flaws with my thinking if I decided to go this route?

    Thanks for any help!

    opened by ApproximateIdentity 0
  • Annoy build index failing in docker image built from Github Actions

    Annoy build index failing in docker image built from Github Actions

    Hi!

    We tried to implement Annoy angular index into our FastAPI python service which runs on Azure k8s service.

    An index is being built every time from the pandas dataframe when the service starts. The index build code is snipped below.

    Data from which we build the index has 10k rows and a float vector of size 9.

    n_trees = 10
    f = 9
    ANNOY_INDEX = AnnoyIndex(9, "angular")
    for i, row in dfMl.iterrows():
      wineId = int(row[:1].values[0])
      wineValues = row[1:].values
      ANNOY_INDEX.add_item(wineId, wineValues)
      
    ANNOY_INDEX.build(n_trees, n_jobs=1)
    

    Locally on Apple M1 machine, the service works fine, and when we build a docker image locally and push it to ACR from which we deploy the image to k8s, everything works too.

    docker buildx build --platform linux/amd64 -t apiservice:1.0.0 .
    docker push apiservice:1.0.0
    

    The issue appears when we try to build the image with the GitHub Actions pipeline (docker push and build). The build will not fail and we are able to deploy image built in GitHub actions to our container registry. But when we try to deploy this image into k8s, the pod fails with the status CrashLoopBackOff.

    In the GitHub actions pipeline, we specify the platform as linux/amd64 same as when we are building the image locally.

    From the pod description, we can get this exit code - 132 which refers to an unsupported processor instruction.

    Last State:     Terminated                                                                                                                                                                                                                     
       Reason:       Error                                                                                                                                                                                                                         
       Exit Code:    132
    

    We tried to log this issue and found out the service will fail at index.build().

    Did someone encounter a similar issue or could help us figure this one out?

    Thanks!

    opened by williambrach 4
  • cmake: missing install instruction

    cmake: missing install instruction

    Hi there,

    Typically to be able to use a C++ library with cmake externally, there needs to be an 'install' section. I am trying to use this library with cpm and the library headers are not installed to the expected location.

    I see there are some directives to copy things over but I can't make this work. I will try to create a reproducible example.

    opened by bsergean 0
Releases(v1.17.1)
Owner
Spotify
Spotify
Weighted K Nearest Neighbors (kNN) algorithm implemented on python from scratch.

kNN_From_Scratch I implemented the k nearest neighbors (kNN) classification algorithm on python. This algorithm is used to predict the classes of new

null 1 Dec 14, 2021
Optimal space decomposition based-product quantization for approximate nearest neighbor search

Optimal space decomposition based-product quantization for approximate nearest neighbor search Abstract Product quantization(PQ) is an effective neare

Mylove 1 Nov 19, 2021
Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval This repo provides personal implementation of paper Approximate Ne

John 8 Oct 7, 2022
The Dual Memory is build from a simple CNN for the deep memory and Linear Regression fro the fast Memory

Simple-DMA a simple Dual Memory Architecture for classifications. based on the paper Dual-Memory Deep Learning Architectures for Lifelong Learning of

null 1 Jan 27, 2022
This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

Mesa: A Memory-saving Training Framework for Transformers This is the official PyTorch implementation for Mesa: A Memory-saving Training Framework for

Zhuang AI Group 105 Dec 6, 2022
In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

In-Place Activated BatchNorm In-Place Activated BatchNorm for Memory-Optimized Training of DNNs In-Place Activated BatchNorm (InPlace-ABN) is a novel

null 1.3k Dec 29, 2022
[ICSE2020] MemLock: Memory Usage Guided Fuzzing

MemLock: Memory Usage Guided Fuzzing This repository provides the tool and the evaluation subjects for the paper "MemLock: Memory Usage Guided Fuzzing

Cheng Wen 54 Jan 7, 2023
Segcache: a memory-efficient and scalable in-memory key-value cache for small objects

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects This repo contains the code of Segcache described in the followi

TheSys Group @ CMU CS 78 Jan 7, 2023
Code for "Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency" paper

UNICORN ?? Webpage | Paper | BibTex PyTorch implementation of "Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency" pap

null 118 Jan 6, 2023
PyTorch Code of "Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics"

Memory In Memory Networks It is based on the paper Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spati

Yang Li 12 May 30, 2022
Episodic-memory - Ego4D Episodic Memory Benchmark

Ego4D Episodic Memory Benchmark EGO4D is the world's largest egocentric (first p

null 3 Feb 18, 2022
Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

Phil Wang 180 Jan 5, 2023
Simple tools for logging and visualizing, loading and training

TNT TNT is a library providing powerful dataloading, logging and visualization utilities for Python. It is closely integrated with PyTorch and is desi

null 1.5k Jan 2, 2023
Calling Julia from Python - an experiment on data loading

Calling Julia from Python - an experiment on data loading See the slides. TLDR After reading Patrick's blog post, we decided to try to replace C++ wit

Abel Siqueira 8 Jun 7, 2022
Python code for loading the Aschaffenburg Pose Dataset.

Aschaffenburg Pose Dataset (APD) This repository contains Python code for loading and filtering the Aschaffenburg Pose Dataset. The dataset itself and

null 1 Nov 26, 2021
A gesture recognition system powered by OpenPose, k-nearest neighbours, and local outlier factor.

OpenHands OpenHands is a gesture recognition system powered by OpenPose, k-nearest neighbours, and local outlier factor. Currently the system can iden

Paul Treanor 12 Jan 10, 2022
Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning

structshot Code and data for paper "Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning", Yi Yang and Arz

ASAPP Research 47 Dec 27, 2022