Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Overview

Annoy

Annoy example
https://img.shields.io/travis/spotify/annoy/master.svg?style=flat https://ci.appveyor.com/api/projects/status/github/spotify/annoy?svg=true&pendingText=windows%20-%20Pending&passingText=windows%20-%20OK&failingText=windows%20-%20Failing https://img.shields.io/pypi/v/annoy.svg?style=flat

Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

Install

To install, simply do pip install --user annoy to pull down the latest version from PyPI.

For the C++ version, just clone the repo and #include "annoylib.h".

Background

There are some other libraries to do nearest neighbor search. Annoy is almost as fast as the fastest libraries, (see below), but there is actually another feature that really sets Annoy apart: it has the ability to use static files as indexes. In particular, this means you can share index across processes. Annoy also decouples creating indexes from loading them, so you can pass around indexes as files and map them into memory quickly. Another nice thing of Annoy is that it tries to minimize memory footprint so the indexes are quite small.

Why is this useful? If you want to find nearest neighbors and you have many CPU's, you only need to build the index once. You can also pass around and distribute static files to use in production environment, in Hadoop jobs, etc. Any process will be able to load (mmap) the index into memory and will be able to do lookups immediately.

We use it at Spotify for music recommendations. After running matrix factorization algorithms, every user/item can be represented as a vector in f-dimensional space. This library helps us search for similar users/items. We have many millions of tracks in a high-dimensional space, so memory usage is a prime concern.

Annoy was built by Erik Bernhardsson in a couple of afternoons during Hack Week.

Summary of features

  • Euclidean distance, Manhattan distance, cosine distance, Hamming distance, or Dot (Inner) Product distance
  • Cosine distance is equivalent to Euclidean distance of normalized vectors = sqrt(2-2*cos(u, v))
  • Works better if you don't have too many dimensions (like <100) but seems to perform surprisingly well even up to 1,000 dimensions
  • Small memory usage
  • Lets you share memory between multiple processes
  • Index creation is separate from lookup (in particular you can not add more items once the tree has been created)
  • Native Python support, tested with 2.7, 3.6, and 3.7.
  • Build index on disk to enable indexing big datasets that won't fit into memory (contributed by Rene Hollander)

Python code example

from annoy import AnnoyIndex
import random

f = 40
t = AnnoyIndex(f, 'angular')  # Length of item vector that will be indexed
for i in range(1000):
    v = [random.gauss(0, 1) for z in range(f)]
    t.add_item(i, v)

t.build(10) # 10 trees
t.save('test.ann')

# ...

u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors

Right now it only accepts integers as identifiers for items. Note that it will allocate memory for max(id)+1 items because it assumes your items are numbered 0 … n-1. If you need other id's, you will have to keep track of a map yourself.

Full Python API

  • AnnoyIndex(f, metric) returns a new index that's read-write and stores vector of f dimensions. Metric can be "angular", "euclidean", "manhattan", "hamming", or "dot".
  • a.add_item(i, v) adds item i (any nonnegative integer) with vector v. Note that it will allocate memory for max(i)+1 items.
  • a.build(n_trees, n_jobs=-1) builds a forest of n_trees trees. More trees gives higher precision when querying. After calling build, no more items can be added. n_jobs specifies the number of threads used to build the trees. n_jobs=-1 uses all available CPU cores.
  • a.save(fn, prefault=False) saves the index to disk and loads it (see next function). After saving, no more items can be added.
  • a.load(fn, prefault=False) loads (mmaps) an index from disk. If prefault is set to True, it will pre-read the entire file into memory (using mmap with MAP_POPULATE). Default is False.
  • a.unload() unloads.
  • a.get_nns_by_item(i, n, search_k=-1, include_distances=False) returns the n closest items. During the query it will inspect up to search_k nodes which defaults to n_trees * n if not provided. search_k gives you a run-time tradeoff between better accuracy and speed. If you set include_distances to True, it will return a 2 element tuple with two lists in it: the second one containing all corresponding distances.
  • a.get_nns_by_vector(v, n, search_k=-1, include_distances=False) same but query by vector v.
  • a.get_item_vector(i) returns the vector for item i that was previously added.
  • a.get_distance(i, j) returns the distance between items i and j. NOTE: this used to return the squared distance, but has been changed as of Aug 2016.
  • a.get_n_items() returns the number of items in the index.
  • a.get_n_trees() returns the number of trees in the index.
  • a.on_disk_build(fn) prepares annoy to build the index in the specified file instead of RAM (execute before adding items, no need to save after build)
  • a.set_seed(seed) will initialize the random number generator with the given seed. Only used for building up the tree, i. e. only necessary to pass this before adding the items. Will have no effect after calling a.build(n_trees) or a.load(fn).

Notes:

  • There's no bounds checking performed on the values so be careful.
  • Annoy uses Euclidean distance of normalized vectors for its angular distance, which for two vectors u,v is equal to sqrt(2(1-cos(u,v)))

The C++ API is very similar: just #include "annoylib.h" to get access to it.

Tradeoffs

There are just two main parameters needed to tune Annoy: the number of trees n_trees and the number of nodes to inspect during searching search_k.

  • n_trees is provided during build time and affects the build time and the index size. A larger value will give more accurate results, but larger indexes.
  • search_k is provided in runtime and affects the search performance. A larger value will give more accurate results, but will take longer time to return.

If search_k is not provided, it will default to n * n_trees where n is the number of approximate nearest neighbors. Otherwise, search_k and n_trees are roughly independent, i.e. the value of n_trees will not affect search time if search_k is held constant and vice versa. Basically it's recommended to set n_trees as large as possible given the amount of memory you can afford, and it's recommended to set search_k as large as possible given the time constraints you have for the queries.

You can also accept slower search times in favour of reduced loading times, memory usage, and disk IO. On supported platforms the index is prefaulted during load and save, causing the file to be pre-emptively read from disk into memory. If you set prefault to False, pages of the mmapped index are instead read from disk and cached in memory on-demand, as necessary for a search to complete. This can significantly increase early search times but may be better suited for systems with low memory compared to index size, when few queries are executed against a loaded index, and/or when large areas of the index are unlikely to be relevant to search queries.

How does it work

Using random projections and by building up a tree. At every intermediate node in the tree, a random hyperplane is chosen, which divides the space into two subspaces. This hyperplane is chosen by sampling two points from the subset and taking the hyperplane equidistant from them.

We do this k times so that we get a forest of trees. k has to be tuned to your need, by looking at what tradeoff you have between precision and performance.

Hamming distance (contributed by Martin Aumüller) packs the data into 64-bit integers under the hood and uses built-in bit count primitives so it could be quite fast. All splits are axis-aligned.

Dot Product distance (contributed by Peter Sobot) reduces the provided vectors from dot (or "inner-product") space to a more query-friendly cosine space using a method by Bachrach et al., at Microsoft Research, published in 2014.

More info

ANN benchmarks

Source code

It's all written in C++ with a handful of ugly optimizations for performance and memory usage. You have been warned :)

The code should support Windows, thanks to Qiang Kou and Timothy Riley.

To run the tests, execute python setup.py nosetests. The test suite includes a big real world dataset that is downloaded from the internet, so it will take a few minutes to execute.

Discuss

Feel free to post any questions or comments to the annoy-user group. I'm @fulhack on Twitter.

Comments
  • Failures in _annoy_test.py

    Failures in _annoy_test.py

    Hello!

    I ran tests from _annoy_test.py and got 7 failures:

    FAIL: test_get_nns_by_item (_annoy_test.AngularIndexTest)

    Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 41, in test_get_nns_by_item self.assertEquals(i.get_nns_by_item(0, 3), [0,1,2]) AssertionError: Lists differ: [0, 1] != [0, 1, 2]

    Second list contains 1 additional elements. First extra element 2: 2

    • [0, 1]
    • [0, 1, 2] ? +++

    FAIL: test_get_nns_by_vector (_annoy_test.AngularIndexTest)

    Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 31, in test_get_nns_by_vector self.assertEquals(i.get_nns_by_vector([3,2,1], 3), [0,1,2]) AssertionError: Lists differ: [0, 1] != [0, 1, 2]

    Second list contains 1 additional elements. First extra element 2: 2

    • [0, 1]
    • [0, 1, 2] ? +++

    FAIL: test_large_index (_annoy_test.AngularIndexTest)

    Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 93, in test_large_index self.assertEquals(i.get_nns_by_item(j, 2), [j, j+1]) AssertionError: Lists differ: [5, 4] != [504, 505]

    First differing element 0: 5 504

    • [5, 4]
    • [504, 505]

    FAIL: test_large_index (_annoy_test.EuclideanIndexTest)

    Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 130, in test_large_index self.assertEquals(i.get_nns_by_item(j+1, 2), [j+1, j]) AssertionError: Lists differ: [228, 229] != [535, 534]

    First differing element 0: 228 535

    • [228, 229]'ve added test/
    • [535, 534]

    FAIL: test_precision_10 (_annoy_test.EuclideanIndexTest)

    Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 154, in test_precision_10 self.assertTrue(self.precision(10) == 1.0) AssertionError: False is not true

    FAIL: test_precision_100 (_annoy_test.EuclideanIndexTest)

    Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 157, in test_precision_100 self.assertTrue(self.precision(100) >= 0.99) AssertionError: False is not true

    FAIL: test_precision_1000 (_annoy_test.EuclideanIndexTest)

    Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 160, in test_precision_1000 self.assertTrue(self.precision(1000) >= 0.99) AssertionError: False is not true


    Ran 14 tests in 2.622s

    FAILED (failures=7)

    Process finished with exit code 7

    opened by dina-iofe 38
  • Does windows support now?

    Does windows support now?

    I am working on a win10 desktop, encountering the same issue that C:\Users\v-xusha\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\cl.exe' failed with exit status 2.

    opened by cadobe 31
  • OSError: Index size is not a multiple of vector size

    OSError: Index size is not a multiple of vector size

    i use annoy to build ,which the number of words is 5844240 and the vector size is 200. it raise a error: 【OSError: Index size is not a multiple of vector size】。please help me

    opened by dtMndas 28
  • Windows support?

    Windows support?

    The readme suggests that pip install annoy should work on Windows, but I get the following error:

    c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or directory
    

    My googling tells me that unistd.h is a Linux-only file.

    Is there an easy fix for this?

    Thanks for the awesome package!

    And in case the full error message is useful:

    C:\Anaconda>pip install annoy
    Collecting annoy
      Downloading annoy-1.5.2.tar.gz (625kB)
        100% |################################| 626kB 409kB/s
    Building wheels for collected packages: annoy
      Running setup.py bdist_wheel for annoy
      Complete output from command C:\Anaconda\python.exe -c "import setuptools;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9\\annoy\\set
    up.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d c:\users\s\appdata\local\temp\tmpwwwitjpip-wheel-:
    
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-2.7
      creating build\lib.win-amd64-2.7\annoy
      copying annoy\__init__.py -> build\lib.win-amd64-2.7\annoy
      running build_ext
      building 'annoy.annoylib' extension
      creating build\temp.win-amd64-2.7
      creating build\temp.win-amd64-2.7\Release
      creating build\temp.win-amd64-2.7\Release\src
      C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Anaconda\inclu
    de -IC:\Anaconda\PC /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-2.7\Release\src/annoymodule.obj -O3 -march=native -ffast-math
      cl : Command line warning D9002 : ignoring unknown option '-O3'
      cl : Command line warning D9002 : ignoring unknown option '-march=native'
      cl : Command line warning D9002 : ignoring unknown option '-ffast-math'
      annoymodule.cc
      C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Include\xlocale(342) : warning C4530: C++ exception handler used, but unw
    ind semantics are not enabled. Specify /EHsc
      c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or d
    irectory
      error: command 'C:\\Program Files (x86)\\Common Files\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\amd64\\cl.exe' failed with exit status 2
    
      ----------------------------------------
      Failed building wheel for annoy
    Failed to build annoy
    Installing collected packages: annoy
      Running setup.py install for annoy
        Complete output from command C:\Anaconda\python.exe -c "import setuptools, tokenize;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9
    \\annoy\\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record c:\users\
    s\appdata\local\temp\pip-5cyqoc-record\install-record.txt --single-version-externally-managed --compile:
        running install
        running build
        running build_py
        running build_ext
        building 'annoy.annoylib' extension
        C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Anaconda\inc
    lude -IC:\Anaconda\PC /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-2.7\Release\src/annoymodule.obj -O3 -march=native -ffast-math
        cl : Command line warning D9002 : ignoring unknown option '-O3'
        cl : Command line warning D9002 : ignoring unknown option '-march=native'
        cl : Command line warning D9002 : ignoring unknown option '-ffast-math'
        annoymodule.cc
        C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Include\xlocale(342) : warning C4530: C++ exception handler used, but u
    nwind semantics are not enabled. Specify /EHsc
        c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or
     directory
        error: command 'C:\\Program Files (x86)\\Common Files\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\amd64\\cl.exe' failed with exit status 2
    
        ----------------------------------------
    Command "C:\Anaconda\python.exe -c "import setuptools, tokenize;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9\\annoy\\setup.py';exec(
    compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record c:\users\s\appdata\local\temp\pip
    -5cyqoc-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in c:\users\s\appdata\local\temp\pip-build-j
    intu9\annoy
    
    opened by sergeyf 28
  • On disk index build

    On disk index build

    Index build can now be performed directly on disk. The resulting index does not need to fit into memory anymore. This fixes #85 and probably fixes #250.

    TODO:

    • [x] Add support for on_disk_build to LUA module
    • [x] Add support for on_disk_build to Go module
    • [x] Fix Windows support (there is not ftruncate on windows)
    • [ ] Fix WSL compatability (ftruncate to cut off extra allocated memory fails)
    • [ ] Check return value of ftruncate
    opened by ReneHollander 25
  • First lookup for an ID is slow

    First lookup for an ID is slow

    I'm seeing a problem when using the library where calling get_nns_by_vector is very slow on first execution for a given value (15s for an index with ~2m vectors of length 200), and then much faster in subsequent calls (subsecond). I've also seen similar behavior in get_item_vector, but I'm not sure if it's related. I've set prefault to true when loading the index from disk.

    The strangest part is that it only occurs on some computers, and I'm not able to repro on my local machine to get details on what is going on. If I load the same index on my computer, the execution time is always subsecond. I'm continuing to look into this to see if I can find a reliable method of reproducing.

    Has anyone seen performance patterns like this? Do you have thoughts on what the problem may be? I'm not sure that it's Annoy specifically, but would be good to know if it is.

    My current thoughts are that it has to do with the MMAPing and that the index is not all saved into RAM and the disk lookups are slow on some machines. I'm not very familiar with MMAPing and would appreciate outside thoughts on whether that's reasonable and/or how to verify if it's the problem.

    opened by scottbreyfogle 24
  • Add dot product as a distance metric.

    Add dot product as a distance metric.

    This PR:

    • adds a DotProduct metric, using the negative of the dot product between two vectors as a pseudo-distance metric.
    • makes this metric available under the string name "dot" in the standard Annoy interface
    • adds tests for this new metric, similar to the existing tests for the Euclidean metric.

    You might ask - why add dot when we already have angular, which is basically just dot but divided by magnitude? Well, for some recommender systems (including some matrix factorization collaborative filters), the dot product between the user and item pair is a prediction of affinity between that user and item. (Here's a great example from @jfkirk.)

    enhancement 
    opened by psobot 22
  • Windows 7 x64 And Anaconda v5.0.1 - Unable to Install annoy

    Windows 7 x64 And Anaconda v5.0.1 - Unable to Install annoy

    Using latest (as of yesterday) Anaconda install using python 3.6.4, When I attempt to install via: pip install annoy

    I get the following error:

    (Seq2Seq) C:\Users\lrichards>pip install annoy Collecting annoy Using cached annoy-1.11.1.tar.gz Building wheels for collected packages: annoy Running setup.py bdist_wheel for annoy ... error Complete output from command C:\Users\lrichards\AppData\Local\Continuum\anaco da3\envs\Seq2Seq\python.exe -u -c "import setuptools, tokenize;file='C:\Us rs\LRICHA~1\AppData\Local\Temp\pip-build-xg1qh4j5\annoy\setup.py';f=geta tr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.clos ();exec(compile(code, file, 'exec'))" bdist_wheel -d C:\Users\LRICHA~1\AppD ta\Local\Temp\tmp25o7_qvdpip-wheel- --python-tag cp36: running bdist_wheel running build running build_py creating build creating build\lib.win-amd64-3.6 creating build\lib.win-amd64-3.6\annoy copying annoy_init_.py -> build\lib.win-amd64-3.6\annoy running build_ext building 'annoy.annoylib' extension creating build\temp.win-amd64-3.6 creating build\temp.win-amd64-3.6\Release creating build\temp.win-amd64-3.6\Release\src C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\cl.exe /c /nologo Ox /W3 /GL /DNDEBUG /MT -IC:\Users\lrichards\AppData\Local\Continuum\anaconda3
    nvs\Seq2Seq\include -IC:\Users\lrichards\AppData\Local\Continuum\anaconda3\envs Seq2Seq\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\Inclu e" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" /EHsc /Tps c/annoymodule.cc /Fobuild\temp.win-amd64-3.6\Release\src/annoymodule.obj annoymodule.cc c:\users\lrichards\appdata\local\temp\pip-build-xg1qh4j5\annoy\src\annoylib.h 19) : fatal error C1083: Cannot open include file: 'stdio.h': No such file or d rectory error: command 'C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bi \cl.exe' failed with exit status 2


    Failed building wheel for annoy Running setup.py clean for annoy Failed to build annoy Installing collected packages: annoy Running setup.py install for annoy ... error Complete output from command C:\Users\lrichards\AppData\Local\Continuum\ana onda3\envs\Seq2Seq\python.exe -u -c "import setuptools, tokenize;file='C:\ sers\LRICHA~1\AppData\Local\Temp\pip-build-xg1qh4j5\annoy\setup.py';f=ge attr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.cl se();exec(compile(code, file, 'exec'))" install --record C:\Users\LRICHA~1
    ppData\Local\Temp\pip-s36zmsav-record\install-record.txt --single-version-exter ally-managed --compile: running install running build running build_py creating build creating build\lib.win-amd64-3.6 creating build\lib.win-amd64-3.6\annoy copying annoy_init_.py -> build\lib.win-amd64-3.6\annoy running build_ext building 'annoy.annoylib' extension creating build\temp.win-amd64-3.6 creating build\temp.win-amd64-3.6\Release creating build\temp.win-amd64-3.6\Release\src C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\cl.exe /c /nolog /Ox /W3 /GL /DNDEBUG /MT -IC:\Users\lrichards\AppData\Local\Continuum\anaconda \envs\Seq2Seq\include -IC:\Users\lrichards\AppData\Local\Continuum\anaconda3\en s\Seq2Seq\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\Inc ude" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" /EHsc /T src/annoymodule.cc /Fobuild\temp.win-amd64-3.6\Release\src/annoymodule.obj annoymodule.cc c:\users\lrichards\appdata\local\temp\pip-build-xg1qh4j5\annoy\src\annoylib h(19) : fatal error C1083: Cannot open include file: 'stdio.h': No such file or directory error: command 'C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\ in\cl.exe' failed with exit status 2

    ----------------------------------------
    

    Command "C:\Users\lrichards\AppData\Local\Continuum\anaconda3\envs\Seq2Seq\pyth n.exe -u -c "import setuptools, tokenize;file='C:\Users\LRICHA~1\AppData \Local\Temp\pip-build-xg1qh4j5\annoy\setup.py';f=getattr(tokenize, 'open', pen)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\LRICHA~1\AppData\Local\Temp\pip-s 6zmsav-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\LRICHA~1\AppData\Local\Temp\pip-build-xg1q 4j5\annoy\

    There is no annoy package available via Conda for Windows and no package listed in Anaconda Navigator.

    How can I get this lib installed under Window?

    opened by wergeld 22
  • Windows build issue

    Windows build issue

    Hi,

    I'm trying to build with mingw32 on windows (Anaconda python v3.6) and getting errors that look like they're in the python rather than the C:

    C:\annoy>python setup.py build --compiler=mingw32 running build running build_py running build_ext Traceback (most recent call last): File "setup.py", line 69, in setup_requires=['nose>=1.0'] File "C:\Users\Ian\Anaconda3\lib\distutils\core.py", line 148, in setup dist.run_commands() File "C:\Users\Ian\Anaconda3\lib\distutils\dist.py", line 955, in run_commands self.run_command(cmd) File "C:\Users\Ian\Anaconda3\lib\distutils\dist.py", line 974, in run_command cmd_obj.run() File "C:\Users\Ian\Anaconda3\lib\distutils\command\build.py", line 135, in run self.run_command(cmd_name) File "C:\Users\Ian\Anaconda3\lib\distutils\cmd.py", line 313, in run_command self.distribution.run_command(command) File "C:\Users\Ian\Anaconda3\lib\distutils\dist.py", line 974, in run_command cmd_obj.run() File "C:\Users\Ian\Anaconda3\lib\site-packages\setuptools\command\build_ext.py", line 75, in run _build_ext.run(self) File "C:\Users\Ian\Anaconda3\lib\site-packages\Cython\Distutils\old_build_ext.py", line 185, in run _build_ext.build_ext.run(self) File "C:\Users\Ian\Anaconda3\lib\distutils\command\build_ext.py", line 308, in run force=self.force) File "C:\Users\Ian\Anaconda3\lib\distutils\ccompiler.py", line 1031, in new_compiler return klass(None, dry_run, force) File "C:\Users\Ian\Anaconda3\lib\distutils\cygwinccompiler.py", line 282, in init CygwinCCompiler.init (self, verbose, dry_run, force) File "C:\Users\Ian\Anaconda3\lib\distutils\cygwinccompiler.py", line 126, in init if self.ld_version >= "2.10.90": TypeError: '>=' not supported between instances of 'NoneType' and 'str'

    Any ideas what's causing it ?

    opened by MonkeyChap 22
  • Rewrite existing file issue

    Rewrite existing file issue

    Hello! I'm using annoy 1.15.2 on python 3.7.3. Here is the issue:

    >>>from annoy import AnnoyIndex
    >>>a=AnnoyIndex(1)
    >>>a.add_item(1,[1])
    >>>a.add_item(2,[2])
    >>>a.build(1)
    True
    >>>a.save("temp.ann")
    True
    >>>b=AnnoyIndex(1)
    >>>b.add_item(3,[3])
    >>>b.add_item(4,[4])
    >>>b.build(1)
    True
    >>>b.save("temp.ann")
    Traceback (most recent call last):
      File "<input>", line 1, in <module>
    OSError: [Errno 22] Invalid argument
    

    Could you tell why I get the error?

    opened by natural-chop 21
  • Multithreaded build

    Multithreaded build

    Issue: https://github.com/spotify/annoy/issues/70

    This is my work-in-progress branch for supporting multithreaded builds in Annoy. Locally (Ubuntu 18.04) all C++, Python and Lua tests seem to pass (I've re-ran them multiple times). The only thing remaining is fixing the Go build and I would appreciate some help with this :smile:

    I've ran some speed tests on glove-angular-25 index with n_trees=32. Reference time on master (without threads) was 49.9s. Here are results for multiple threads: | Num. threads | Time (s) | |--------------|-------------| | 1 | 51.28759694 | | 2 | 26.675179 | | 4 | 14.82837749 | | 8 | 8.897883892 | | 16 | 7.353658915 |

    If you have any questions or concerns feel free to comment.

    opened by novoselrok 20
  • Decrease memory usage by decreasing the number of trees, or decreasing dimensionality?

    Decrease memory usage by decreasing the number of trees, or decreasing dimensionality?

    I have about ~1m vectors I would like to search through, each with 1000 dimensions. Ideally, I would like to keep the size of the index under 500mb. I believe I have two levers to pull for allowing the index to fit in memory:

    1. Reduce the number of trees. For example, in my testing, reducing the # of trees from 70 to 10 cut the memory usage in ~half.
    2. Reduce dimensionality. For example, reducing dimensionality from 1000 to 500 to also cut the memory usage in ~half. I could do something like TruncatedSVD (sklearn docs) to do this (but lmk if there's a better method).

    Of course, I could also do some combination of these two. What are the tradeoffs between both methods of reducing memory usage?

    opened by gusgordon 0
  • Dot index recall issue on last.fm dataset

    Dot index recall issue on last.fm dataset

    Hi!

    I had pretty low recall for dot index on my data. Then I decided to test the lastfm benchmark manually.

    Then I realized that this benchmark is actually for an angular index with manually computed 65th component -- sqrt(max_norm^2 - ||x||^2). I've reproduced results for an angular index, but for a dot index (with the 65th being deleted) recall was again pretty low!

    There's the notebook with calculations.

    What's going on in the notebook:

    1. Calculate exact dot products using brute force, then calculate top candidates
    2. Construct an angular index with the additional component sqrt(max_norm^2 - ||x||^2), compute recalls for various search_k's
    3. Construct a dot index without the additional component sqrt(max_norm^2 - ||x||^2), again compute recalls for various search_k's

    The results are as follows (10 trees):

    | search_k | 1000 | 10000 | 50000 | |----------------------------|------------|-----------|-----------| | Angular index recall (65 components) | 0.563 | 0.736 | 0.783 | | Dot index recall (64 components) | 0.003 | 0.034 | 0.175 |

    But, as I understand, results should be at least comparable.

    May there be a bug? If my results are correct, I would be glad to contribute with some fixes. However, by now I don't see what can be wrong in the source code of dot index.

    opened by pkorobov 2
  • How realistic is serializing the internal C++ AnnoyIndex state?

    How realistic is serializing the internal C++ AnnoyIndex state?

    Basically what I would want is to run something like this:

    import os
    import random
    
    from annoy import AnnoyIndex
    
    num_rows = 10000
    num_trees = 10
    num_dims = 512
    
    try:
        os.remove("annoy_idx.ann")
    except FileNotFoundError:
        pass
    
    annoy_idx = AnnoyIndex(num_dims, "angular")
    annoy_idx.on_disk_build("annoy_idx.ann")
    for idx in range(num_rows):
        vector = [random.gauss(0, 1) for _ in range(num_dims)]
        annoy_idx.add_item(idx, vector)
    
    annoy_idx.serialize("annoy_idx.state") # XXX - This is the magic I'm looking for
    

    and then (after that program is done and exited) I would like to continue appending data like something like this (this adds 10,000 new rows with indices 10,000, ..., 19,999):

    import os
    import random
    
    from annoy import AnnoyIndex
    
    num_rows = 10000
    num_trees = 10
    num_dims = 512
    
    try:
        os.remove("annoy_idx.ann")
    except FileNotFoundError:
        pass
    
    annoy_idx = AnnoyIndex(num_dims, "angular")
    annoy_idx.deserialize("annoy_idx.state") # XXX - This is the magic I'm looking for
    for idx in range(num_rows):
        vector = [random.gauss(0, 1) for _ in range(num_dims)]
        annoy_idx.add_item(idx +num_rows, vector) # XXX - Note the increase in idx variable
    

    So basically what I want is for there to be a serialize/deserialize ability so that I can continue the flow. It seems to me like the protected data here would need to be serialized:

    https://github.com/spotify/annoy/blob/master/src/annoylib.h#L847-L885

    In my case it seems to basically serializing the node here:

    https://github.com/spotify/annoy/blob/master/src/annoylib.h#L442-L463

    So my question is the following:

    How realistic is this? More specifically, assuming that I am able to successfully serialize/deserialize the state, does it seem like this would play well with the mmap in the on_disk_build() step? This is maybe too general a question, but basically my point is: is this totally crazy? Are there obvious flaws with my thinking if I decided to go this route?

    Thanks for any help!

    opened by ApproximateIdentity 0
  • Annoy build index failing in docker image built from Github Actions

    Annoy build index failing in docker image built from Github Actions

    Hi!

    We tried to implement Annoy angular index into our FastAPI python service which runs on Azure k8s service.

    An index is being built every time from the pandas dataframe when the service starts. The index build code is snipped below.

    Data from which we build the index has 10k rows and a float vector of size 9.

    n_trees = 10
    f = 9
    ANNOY_INDEX = AnnoyIndex(9, "angular")
    for i, row in dfMl.iterrows():
      wineId = int(row[:1].values[0])
      wineValues = row[1:].values
      ANNOY_INDEX.add_item(wineId, wineValues)
      
    ANNOY_INDEX.build(n_trees, n_jobs=1)
    

    Locally on Apple M1 machine, the service works fine, and when we build a docker image locally and push it to ACR from which we deploy the image to k8s, everything works too.

    docker buildx build --platform linux/amd64 -t apiservice:1.0.0 .
    docker push apiservice:1.0.0
    

    The issue appears when we try to build the image with the GitHub Actions pipeline (docker push and build). The build will not fail and we are able to deploy image built in GitHub actions to our container registry. But when we try to deploy this image into k8s, the pod fails with the status CrashLoopBackOff.

    In the GitHub actions pipeline, we specify the platform as linux/amd64 same as when we are building the image locally.

    From the pod description, we can get this exit code - 132 which refers to an unsupported processor instruction.

    Last State:     Terminated                                                                                                                                                                                                                     
       Reason:       Error                                                                                                                                                                                                                         
       Exit Code:    132
    

    We tried to log this issue and found out the service will fail at index.build().

    Did someone encounter a similar issue or could help us figure this one out?

    Thanks!

    opened by williambrach 4
  • cmake: missing install instruction

    cmake: missing install instruction

    Hi there,

    Typically to be able to use a C++ library with cmake externally, there needs to be an 'install' section. I am trying to use this library with cpm and the library headers are not installed to the expected location.

    I see there are some directives to copy things over but I can't make this work. I will try to create a reproducible example.

    opened by bsergean 0
Releases(v1.17.1)
Owner
Spotify
Spotify
A TensorFlow recommendation algorithm and framework in Python.

TensorRec A TensorFlow recommendation algorithm and framework in Python. NOTE: TensorRec is not under active development TensorRec will not be receivi

James Kirk 1.2k Jan 4, 2023
Fast Python Collaborative Filtering for Implicit Feedback Datasets

Implicit Fast Python Collaborative Filtering for Implicit Datasets. This project provides fast Python implementations of several different popular rec

Ben Frederickson 3k Dec 31, 2022
A Python implementation of LightFM, a hybrid recommendation algorithm.

LightFM Build status Linux OSX (OpenMP disabled) Windows (OpenMP disabled) LightFM is a Python implementation of a number of popular recommendation al

Lyst 4.2k Jan 2, 2023
EXEMPLO DE SISTEMA ESPECIALISTA PARA RECOMENDAR SERIADOS EM PYTHON

exemplo-de-sistema-especialista EXEMPLO DE SISTEMA ESPECIALISTA PARA RECOMENDAR SERIADOS EM PYTHON Resumo O objetivo de auxiliar o usuário na escolha

Josue Lopes 3 Aug 31, 2021
QRec: A Python Framework for quick implementation of recommender systems (TensorFlow Based)

QRec is a Python framework for recommender systems (Supported by Python 3.7.4 and Tensorflow 1.14+) in which a number of influential and newly state-of-the-art recommendation models are implemented. QRec has a lightweight architecture and provides user-friendly interfaces. It can facilitate model implementation and evaluation.

Yu 1.4k Dec 27, 2022
Books Recommendation With Python

Books-Recommendation Business Problem During the last few decades, with the rise

Çağrı Karadeniz 7 Mar 12, 2022
Persine is an automated tool to study and reverse-engineer algorithmic recommendation systems.

Persine, the Persona Engine Persine is an automated tool to study and reverse-engineer algorithmic recommendation systems. It has a simple interface a

Jonathan Soma 87 Nov 29, 2022
E-Commerce recommender demo with real-time data and a graph database

?? E-Commerce recommender demo ?? This is a simple stream setup that uses Memgraph to ingest real-time data from a simulated online store. Data is str

g-despot 3 Feb 23, 2022
An open source movie recommendation WebApp build by movie buffs and mathematicians that uses cosine similarity on the backend.

Movie Pundit Find your next flick by asking the (almost) all-knowing Movie Pundit Jump to Project Source » View Demo · Report Bug · Request Feature Ta

Kapil Pramod Deshmukh 8 May 28, 2022
Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Annoy Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given quer

Spotify 10.6k Jan 1, 2023
GPU implementation of $k$-Nearest Neighbors and Shared-Nearest Neighbors

GPU implementation of kNN and SNN GPU implementation of $k$-Nearest Neighbors and Shared-Nearest Neighbors Supported by numba cuda and faiss library E

Hyeon Jeon 7 Nov 23, 2022
Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memories using approximate nearest neighbors, in Pytorch

Memorizing Transformers - Pytorch Implementation of Memorizing Transformers (ICLR 2022), attention net augmented with indexing and retrieval of memori

Phil Wang 364 Jan 6, 2023
Weighted K Nearest Neighbors (kNN) algorithm implemented on python from scratch.

kNN_From_Scratch I implemented the k nearest neighbors (kNN) classification algorithm on python. This algorithm is used to predict the classes of new

null 1 Dec 14, 2021
My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data

kNN-vs-RFR My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data In many areas, rental bikes have been launched to

null 1 Oct 28, 2021
Implementation of K-Nearest Neighbors Algorithm Using PySpark

KNN With Spark Implementation of KNN using PySpark. The KNN was used on two separate datasets (https://archive.ics.uci.edu/ml/datasets/iris and https:

Zachary Petroff 4 Dec 30, 2022
Optimal space decomposition based-product quantization for approximate nearest neighbor search

Optimal space decomposition based-product quantization for approximate nearest neighbor search Abstract Product quantization(PQ) is an effective neare

Mylove 1 Nov 19, 2021
Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval This repo provides personal implementation of paper Approximate Ne

John 8 Oct 7, 2022
Visual disk-usage analyser for docker images

whaler What? A command-line tool for visually investigating the disk usage of docker images Why? Large images are slow to move and expensive to store.

Treebeard Technologies 194 Sep 1, 2022