Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Spotify

Last update: Jan 4, 2023

Related tags

Deep Learning python golang c-plus-plus lua nearest-neighbor-search locality-sensitive-hashing approximate-nearest-neighbor-search

Overview

Annoy

https://ci.appveyor.com/api/projects/status/github/spotify/annoy?svg=true&pendingText=windows%20-%20Pending&passingText=windows%20-%20OK&failingText=windows%20-%20Failing

https://img.shields.io/pypi/v/annoy.svg?style=flat

Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings to search for points in space that are close to a given query point. It also creates large read-only file-based data structures that are mmapped into memory so that many processes may share the same data.

Install

To install, simply do pip install --user annoy to pull down the latest version from PyPI.

For the C++ version, just clone the repo and #include "annoylib.h".

Background

There are some other libraries to do nearest neighbor search. Annoy is almost as fast as the fastest libraries, (see below), but there is actually another feature that really sets Annoy apart: it has the ability to use static files as indexes. In particular, this means you can share index across processes. Annoy also decouples creating indexes from loading them, so you can pass around indexes as files and map them into memory quickly. Another nice thing of Annoy is that it tries to minimize memory footprint so the indexes are quite small.

Why is this useful? If you want to find nearest neighbors and you have many CPU's, you only need to build the index once. You can also pass around and distribute static files to use in production environment, in Hadoop jobs, etc. Any process will be able to load (mmap) the index into memory and will be able to do lookups immediately.

We use it at Spotify for music recommendations. After running matrix factorization algorithms, every user/item can be represented as a vector in f-dimensional space. This library helps us search for similar users/items. We have many millions of tracks in a high-dimensional space, so memory usage is a prime concern.

Annoy was built by Erik Bernhardsson in a couple of afternoons during Hack Week.

Summary of features

Euclidean distance, Manhattan distance, cosine distance, Hamming distance, or Dot (Inner) Product distance
Cosine distance is equivalent to Euclidean distance of normalized vectors = sqrt(2-2*cos(u, v))
Works better if you don't have too many dimensions (like <100) but seems to perform surprisingly well even up to 1,000 dimensions
Small memory usage
Lets you share memory between multiple processes
Index creation is separate from lookup (in particular you can not add more items once the tree has been created)
Native Python support, tested with 2.7, 3.6, and 3.7.
Build index on disk to enable indexing big datasets that won't fit into memory (contributed by Rene Hollander)

Python code example

from annoy import AnnoyIndex
import random

f = 40
t = AnnoyIndex(f, 'angular')  # Length of item vector that will be indexed
for i in range(1000):
    v = [random.gauss(0, 1) for z in range(f)]
    t.add_item(i, v)

t.build(10) # 10 trees
t.save('test.ann')

# ...

u = AnnoyIndex(f, 'angular')
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors

Right now it only accepts integers as identifiers for items. Note that it will allocate memory for max(id)+1 items because it assumes your items are numbered 0 … n-1. If you need other id's, you will have to keep track of a map yourself.

Full Python API

AnnoyIndex(f, metric) returns a new index that's read-write and stores vector of f dimensions. Metric can be "angular", "euclidean", "manhattan", "hamming", or "dot".
a.add_item(i, v) adds item i (any nonnegative integer) with vector v. Note that it will allocate memory for max(i)+1 items.
a.build(n_trees, n_jobs=-1) builds a forest of n_trees trees. More trees gives higher precision when querying. After calling build, no more items can be added. n_jobs specifies the number of threads used to build the trees. n_jobs=-1 uses all available CPU cores.
a.save(fn, prefault=False) saves the index to disk and loads it (see next function). After saving, no more items can be added.
a.load(fn, prefault=False) loads (mmaps) an index from disk. If prefault is set to True, it will pre-read the entire file into memory (using mmap with MAP_POPULATE). Default is False.
a.unload() unloads.
a.get_nns_by_item(i, n, search_k=-1, include_distances=False) returns the n closest items. During the query it will inspect up to search_k nodes which defaults to n_trees * n if not provided. search_k gives you a run-time tradeoff between better accuracy and speed. If you set include_distances to True, it will return a 2 element tuple with two lists in it: the second one containing all corresponding distances.
a.get_nns_by_vector(v, n, search_k=-1, include_distances=False) same but query by vector v.
a.get_item_vector(i) returns the vector for item i that was previously added.
a.get_distance(i, j) returns the distance between items i and j. NOTE: this used to return the squared distance, but has been changed as of Aug 2016.
a.get_n_items() returns the number of items in the index.
a.get_n_trees() returns the number of trees in the index.
a.on_disk_build(fn) prepares annoy to build the index in the specified file instead of RAM (execute before adding items, no need to save after build)
a.set_seed(seed) will initialize the random number generator with the given seed. Only used for building up the tree, i. e. only necessary to pass this before adding the items. Will have no effect after calling a.build(n_trees) or a.load(fn).

Notes:

There's no bounds checking performed on the values so be careful.
Annoy uses Euclidean distance of normalized vectors for its angular distance, which for two vectors u,v is equal to sqrt(2(1-cos(u,v)))

The C++ API is very similar: just #include "annoylib.h" to get access to it.

Tradeoffs

There are just two main parameters needed to tune Annoy: the number of trees n_trees and the number of nodes to inspect during searching search_k.

n_trees is provided during build time and affects the build time and the index size. A larger value will give more accurate results, but larger indexes.
search_k is provided in runtime and affects the search performance. A larger value will give more accurate results, but will take longer time to return.

If search_k is not provided, it will default to n * n_trees where n is the number of approximate nearest neighbors. Otherwise, search_k and n_trees are roughly independent, i.e. the value of n_trees will not affect search time if search_k is held constant and vice versa. Basically it's recommended to set n_trees as large as possible given the amount of memory you can afford, and it's recommended to set search_k as large as possible given the time constraints you have for the queries.

You can also accept slower search times in favour of reduced loading times, memory usage, and disk IO. On supported platforms the index is prefaulted during load and save, causing the file to be pre-emptively read from disk into memory. If you set prefault to False, pages of the mmapped index are instead read from disk and cached in memory on-demand, as necessary for a search to complete. This can significantly increase early search times but may be better suited for systems with low memory compared to index size, when few queries are executed against a loaded index, and/or when large areas of the index are unlikely to be relevant to search queries.

How does it work

Using random projections and by building up a tree. At every intermediate node in the tree, a random hyperplane is chosen, which divides the space into two subspaces. This hyperplane is chosen by sampling two points from the subset and taking the hyperplane equidistant from them.

We do this k times so that we get a forest of trees. k has to be tuned to your need, by looking at what tradeoff you have between precision and performance.

Hamming distance (contributed by Martin Aumüller) packs the data into 64-bit integers under the hood and uses built-in bit count primitives so it could be quite fast. All splits are axis-aligned.

Dot Product distance (contributed by Peter Sobot) reduces the provided vectors from dot (or "inner-product") space to a more query-friendly cosine space using a method by Bachrach et al., at Microsoft Research, published in 2014.

More info

Dirk Eddelbuettel provides an R version of Annoy.
Andy Sloane provides a Java version of Annoy although currently limited to cosine and read-only.
Pishen Tsai provides a Scala wrapper of Annoy which uses JNA to call the C++ library of Annoy.
Atsushi Tatsuma provides Ruby bindings for Annoy.
There is experimental support for Go provided by Taneli Leppä.
Boris Nagaev wrote Lua bindings.
During part of Spotify Hack Week 2016 (and a bit afterward), Jim Kang wrote Node bindings for Annoy.
Min-Seok Kim built a Scala version of Annoy.
Presentation from New York Machine Learning meetup about Annoy
Annoy is available as a conda package on Linux, OS X, and Windows.
ann-benchmarks is a benchmark for several approximate nearest neighbor libraries. Annoy seems to be fairly competitive, especially at higher precisions:

Source code

It's all written in C++ with a handful of ugly optimizations for performance and memory usage. You have been warned :)

The code should support Windows, thanks to Qiang Kou and Timothy Riley.

To run the tests, execute python setup.py nosetests. The test suite includes a big real world dataset that is downloaded from the internet, so it will take a few minutes to execute.

Discuss

Feel free to post any questions or comments to the annoy-user group. I'm @fulhack on Twitter.

Comments

Failures in _annoy_test.py
Hello!

I ran tests from _annoy_test.py and got 7 failures:

FAIL: test_get_nns_by_item (_annoy_test.AngularIndexTest)

Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 41, in test_get_nns_by_item self.assertEquals(i.get_nns_by_item(0, 3), [0,1,2]) AssertionError: Lists differ: [0, 1] != [0, 1, 2]

Second list contains 1 additional elements. First extra element 2: 2

[0, 1]

[0, 1, 2] ? +++

FAIL: test_get_nns_by_vector (_annoy_test.AngularIndexTest)

Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 31, in test_get_nns_by_vector self.assertEquals(i.get_nns_by_vector([3,2,1], 3), [0,1,2]) AssertionError: Lists differ: [0, 1] != [0, 1, 2]

Second list contains 1 additional elements. First extra element 2: 2

[0, 1]

[0, 1, 2] ? +++

FAIL: test_large_index (_annoy_test.AngularIndexTest)

Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 93, in test_large_index self.assertEquals(i.get_nns_by_item(j, 2), [j, j+1]) AssertionError: Lists differ: [5, 4] != [504, 505]

First differing element 0: 5 504

[5, 4]

[504, 505]

FAIL: test_large_index (_annoy_test.EuclideanIndexTest)

Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 130, in test_large_index self.assertEquals(i.get_nns_by_item(j+1, 2), [j+1, j]) AssertionError: Lists differ: [228, 229] != [535, 534]

First differing element 0: 228 535

[228, 229]'ve added test/

[535, 534]

FAIL: test_precision_10 (_annoy_test.EuclideanIndexTest)

Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 154, in test_precision_10 self.assertTrue(self.precision(10) == 1.0) AssertionError: False is not true

FAIL: test_precision_100 (_annoy_test.EuclideanIndexTest)

Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 157, in test_precision_100 self.assertTrue(self.precision(100) >= 0.99) AssertionError: False is not true

FAIL: test_precision_1000 (_annoy_test.EuclideanIndexTest)

Traceback (most recent call last): File "/home/dina/Downloads/annoy-master/test/_annoy_test.py", line 160, in test_precision_1000 self.assertTrue(self.precision(1000) >= 0.99) AssertionError: False is not true

Ran 14 tests in 2.622s

FAILED (failures=7)

Process finished with exit code 7
opened by dina-iofe 38
Does windows support now?

I am working on a win10 desktop, encountering the same issue that C:\Users\v-xusha\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python\9.0\VC\Bin\cl.exe' failed with exit status 2.

opened by cadobe 31
OSError: Index size is not a multiple of vector size

i use annoy to build ,which the number of words is 5844240 and the vector size is 200. it raise a error: 【OSError: Index size is not a multiple of vector size】。please help me

opened by dtMndas 28

Windows support?

The readme suggests that pip install annoy should work on Windows, but I get the following error:

c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or directory

My googling tells me that unistd.h is a Linux-only file.

Is there an easy fix for this?

Thanks for the awesome package!

And in case the full error message is useful:

C:\Anaconda>pip install annoy
Collecting annoy
  Downloading annoy-1.5.2.tar.gz (625kB)
    100% |################################| 626kB 409kB/s
Building wheels for collected packages: annoy
  Running setup.py bdist_wheel for annoy
  Complete output from command C:\Anaconda\python.exe -c "import setuptools;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9\\annoy\\set
up.py';exec(compile(open(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" bdist_wheel -d c:\users\s\appdata\local\temp\tmpwwwitjpip-wheel-:

  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-2.7
  creating build\lib.win-amd64-2.7\annoy
  copying annoy\__init__.py -> build\lib.win-amd64-2.7\annoy
  running build_ext
  building 'annoy.annoylib' extension
  creating build\temp.win-amd64-2.7
  creating build\temp.win-amd64-2.7\Release
  creating build\temp.win-amd64-2.7\Release\src
  C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Anaconda\inclu
de -IC:\Anaconda\PC /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-2.7\Release\src/annoymodule.obj -O3 -march=native -ffast-math
  cl : Command line warning D9002 : ignoring unknown option '-O3'
  cl : Command line warning D9002 : ignoring unknown option '-march=native'
  cl : Command line warning D9002 : ignoring unknown option '-ffast-math'
  annoymodule.cc
  C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Include\xlocale(342) : warning C4530: C++ exception handler used, but unw
ind semantics are not enabled. Specify /EHsc
  c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or d
irectory
  error: command 'C:\\Program Files (x86)\\Common Files\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\amd64\\cl.exe' failed with exit status 2

  ----------------------------------------
  Failed building wheel for annoy
Failed to build annoy
Installing collected packages: annoy
  Running setup.py install for annoy
    Complete output from command C:\Anaconda\python.exe -c "import setuptools, tokenize;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9
\\annoy\\setup.py';exec(compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record c:\users\
s\appdata\local\temp\pip-5cyqoc-record\install-record.txt --single-version-externally-managed --compile:
    running install
    running build
    running build_py
    running build_ext
    building 'annoy.annoylib' extension
    C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -IC:\Anaconda\inc
lude -IC:\Anaconda\PC /Tpsrc/annoymodule.cc /Fobuild\temp.win-amd64-2.7\Release\src/annoymodule.obj -O3 -march=native -ffast-math
    cl : Command line warning D9002 : ignoring unknown option '-O3'
    cl : Command line warning D9002 : ignoring unknown option '-march=native'
    cl : Command line warning D9002 : ignoring unknown option '-ffast-math'
    annoymodule.cc
    C:\Program Files (x86)\Common Files\Microsoft\Visual C++ for Python\9.0\VC\Include\xlocale(342) : warning C4530: C++ exception handler used, but u
nwind semantics are not enabled. Specify /EHsc
    c:\users\s\appdata\local\temp\pip-build-jintu9\annoy\src\annoylib.h(21) : fatal error C1083: Cannot open include file: 'unistd.h': No such file or
 directory
    error: command 'C:\\Program Files (x86)\\Common Files\\Microsoft\\Visual C++ for Python\\9.0\\VC\\Bin\\amd64\\cl.exe' failed with exit status 2

    ----------------------------------------
Command "C:\Anaconda\python.exe -c "import setuptools, tokenize;__file__='c:\\users\\s\\appdata\\local\\temp\\pip-build-jintu9\\annoy\\setup.py';exec(
compile(getattr(tokenize, 'open', open)(__file__).read().replace('\r\n', '\n'), __file__, 'exec'))" install --record c:\users\s\appdata\local\temp\pip
-5cyqoc-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in c:\users\s\appdata\local\temp\pip-build-j
intu9\annoy

opened by sergeyf 28

On disk index build
Index build can now be performed directly on disk. The resulting index does not need to fit into memory anymore. This fixes #85 and probably fixes #250.

TODO:

[x] Add support for on_disk_build to LUA module

[x] Add support for on_disk_build to Go module

[x] Fix Windows support (there is not ftruncate on windows)

[ ] Fix WSL compatability (ftruncate to cut off extra allocated memory fails)

[ ] Check return value of ftruncate
opened by ReneHollander 25
First lookup for an ID is slow

I'm seeing a problem when using the library where calling get_nns_by_vector is very slow on first execution for a given value (15s for an index with ~2m vectors of length 200), and then much faster in subsequent calls (subsecond). I've also seen similar behavior in get_item_vector, but I'm not sure if it's related. I've set prefault to true when loading the index from disk.

The strangest part is that it only occurs on some computers, and I'm not able to repro on my local machine to get details on what is going on. If I load the same index on my computer, the execution time is always subsecond. I'm continuing to look into this to see if I can find a reliable method of reproducing.

Has anyone seen performance patterns like this? Do you have thoughts on what the problem may be? I'm not sure that it's Annoy specifically, but would be good to know if it is.

My current thoughts are that it has to do with the MMAPing and that the index is not all saved into RAM and the disk lookups are slow on some machines. I'm not very familiar with MMAPing and would appreciate outside thoughts on whether that's reasonable and/or how to verify if it's the problem.

opened by scottbreyfogle 24
Add dot product as a distance metric.
This PR:

adds a DotProduct metric, using the negative of the dot product between two vectors as a pseudo-distance metric.

makes this metric available under the string name "dot" in the standard Annoy interface

adds tests for this new metric, similar to the existing tests for the Euclidean metric.

You might ask - why add dot when we already have angular, which is basically just dot but divided by magnitude? Well, for some recommender systems (including some matrix factorization collaborative filters), the dot product between the user and item pair is a prediction of affinity between that user and item. (Here's a great example from @jfkirk.)
enhancement
opened by psobot 22
Windows 7 x64 And Anaconda v5.0.1 - Unable to Install annoy
Using latest (as of yesterday) Anaconda install using python 3.6.4, When I attempt to install via: pip install annoy

I get the following error:

(Seq2Seq) C:\Users\lrichards>pip install annoy Collecting annoy Using cached annoy-1.11.1.tar.gz Building wheels for collected packages: annoy Running setup.py bdist_wheel for annoy ... error Complete output from command C:\Users\lrichards\AppData\Local\Continuum\anaco da3\envs\Seq2Seq\python.exe -u -c "import setuptools, tokenize;file='C:\Us rs\LRICHA~1\AppData\Local\Temp\pip-build-xg1qh4j5\annoy\setup.py';f=geta tr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.clos ();exec(compile(code, file, 'exec'))" bdist_wheel -d C:\Users\LRICHA~1\AppD ta\Local\Temp\tmp25o7_qvdpip-wheel- --python-tag cp36: running bdist_wheel running build running build_py creating build creating build\lib.win-amd64-3.6 creating build\lib.win-amd64-3.6\annoy copying annoy_init_.py -> build\lib.win-amd64-3.6\annoy running build_ext building 'annoy.annoylib' extension creating build\temp.win-amd64-3.6 creating build\temp.win-amd64-3.6\Release creating build\temp.win-amd64-3.6\Release\src C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\cl.exe /c /nologo Ox /W3 /GL /DNDEBUG /MT -IC:\Users\lrichards\AppData\Local\Continuum\anaconda3
nvs\Seq2Seq\include -IC:\Users\lrichards\AppData\Local\Continuum\anaconda3\envs Seq2Seq\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\Inclu e" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" /EHsc /Tps c/annoymodule.cc /Fobuild\temp.win-amd64-3.6\Release\src/annoymodule.obj annoymodule.cc c:\users\lrichards\appdata\local\temp\pip-build-xg1qh4j5\annoy\src\annoylib.h 19) : fatal error C1083: Cannot open include file: 'stdio.h': No such file or d rectory error: command 'C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bi \cl.exe' failed with exit status 2

Failed building wheel for annoy Running setup.py clean for annoy Failed to build annoy Installing collected packages: annoy Running setup.py install for annoy ... error Complete output from command C:\Users\lrichards\AppData\Local\Continuum\ana onda3\envs\Seq2Seq\python.exe -u -c "import setuptools, tokenize;file='C:\ sers\LRICHA~1\AppData\Local\Temp\pip-build-xg1qh4j5\annoy\setup.py';f=ge attr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.cl se();exec(compile(code, file, 'exec'))" install --record C:\Users\LRICHA~1
ppData\Local\Temp\pip-s36zmsav-record\install-record.txt --single-version-exter ally-managed --compile: running install running build running build_py creating build creating build\lib.win-amd64-3.6 creating build\lib.win-amd64-3.6\annoy copying annoy_init_.py -> build\lib.win-amd64-3.6\annoy running build_ext building 'annoy.annoylib' extension creating build\temp.win-amd64-3.6 creating build\temp.win-amd64-3.6\Release creating build\temp.win-amd64-3.6\Release\src C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\bin\cl.exe /c /nolog /Ox /W3 /GL /DNDEBUG /MT -IC:\Users\lrichards\AppData\Local\Continuum\anaconda \envs\Seq2Seq\include -IC:\Users\lrichards\AppData\Local\Continuum\anaconda3\en s\Seq2Seq\include "-IC:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\Inc ude" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.6.1\include\um" /EHsc /T src/annoymodule.cc /Fobuild\temp.win-amd64-3.6\Release\src/annoymodule.obj annoymodule.cc c:\users\lrichards\appdata\local\temp\pip-build-xg1qh4j5\annoy\src\annoylib h(19) : fatal error C1083: Cannot open include file: 'stdio.h': No such file or directory error: command 'C:\Program Files (x86)\Microsoft Visual Studio 12.0\VC\ in\cl.exe' failed with exit status 2

----------------------------------------

Command "C:\Users\lrichards\AppData\Local\Continuum\anaconda3\envs\Seq2Seq\pyth n.exe -u -c "import setuptools, tokenize;file='C:\Users\LRICHA~1\AppData \Local\Temp\pip-build-xg1qh4j5\annoy\setup.py';f=getattr(tokenize, 'open', pen)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record C:\Users\LRICHA~1\AppData\Local\Temp\pip-s 6zmsav-record\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\LRICHA~1\AppData\Local\Temp\pip-build-xg1q 4j5\annoy\

There is no annoy package available via Conda for Windows and no package listed in Anaconda Navigator.

How can I get this lib installed under Window?
opened by wergeld 22
Windows build issue

Hi,

I'm trying to build with mingw32 on windows (Anaconda python v3.6) and getting errors that look like they're in the python rather than the C:

C:\annoy>python setup.py build --compiler=mingw32 running build running build_py running build_ext Traceback (most recent call last): File "setup.py", line 69, in setup_requires=['nose>=1.0'] File "C:\Users\Ian\Anaconda3\lib\distutils\core.py", line 148, in setup dist.run_commands() File "C:\Users\Ian\Anaconda3\lib\distutils\dist.py", line 955, in run_commands self.run_command(cmd) File "C:\Users\Ian\Anaconda3\lib\distutils\dist.py", line 974, in run_command cmd_obj.run() File "C:\Users\Ian\Anaconda3\lib\distutils\command\build.py", line 135, in run self.run_command(cmd_name) File "C:\Users\Ian\Anaconda3\lib\distutils\cmd.py", line 313, in run_command self.distribution.run_command(command) File "C:\Users\Ian\Anaconda3\lib\distutils\dist.py", line 974, in run_command cmd_obj.run() File "C:\Users\Ian\Anaconda3\lib\site-packages\setuptools\command\build_ext.py", line 75, in run _build_ext.run(self) File "C:\Users\Ian\Anaconda3\lib\site-packages\Cython\Distutils\old_build_ext.py", line 185, in run _build_ext.build_ext.run(self) File "C:\Users\Ian\Anaconda3\lib\distutils\command\build_ext.py", line 308, in run force=self.force) File "C:\Users\Ian\Anaconda3\lib\distutils\ccompiler.py", line 1031, in new_compiler return klass(None, dry_run, force) File "C:\Users\Ian\Anaconda3\lib\distutils\cygwinccompiler.py", line 282, in init CygwinCCompiler.init (self, verbose, dry_run, force) File "C:\Users\Ian\Anaconda3\lib\distutils\cygwinccompiler.py", line 126, in init if self.ld_version >= "2.10.90": TypeError: '>=' not supported between instances of 'NoneType' and 'str'

Any ideas what's causing it ?

opened by MonkeyChap 22

Rewrite existing file issue

Hello! I'm using annoy 1.15.2 on python 3.7.3. Here is the issue:

>>>from annoy import AnnoyIndex
>>>a=AnnoyIndex(1)
>>>a.add_item(1,[1])
>>>a.add_item(2,[2])
>>>a.build(1)
True
>>>a.save("temp.ann")
True
>>>b=AnnoyIndex(1)
>>>b.add_item(3,[3])
>>>b.add_item(4,[4])
>>>b.build(1)
True
>>>b.save("temp.ann")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
OSError: [Errno 22] Invalid argument

Could you tell why I get the error?

opened by natural-chop 21

Multithreaded build

Issue: https://github.com/spotify/annoy/issues/70

This is my work-in-progress branch for supporting multithreaded builds in Annoy. Locally (Ubuntu 18.04) all C++, Python and Lua tests seem to pass (I've re-ran them multiple times). The only thing remaining is fixing the Go build and I would appreciate some help with this :smile:

I've ran some speed tests on glove-angular-25 index with n_trees=32. Reference time on master (without threads) was 49.9s. Here are results for multiple threads: | Num. threads | Time (s) | |--------------|-------------| | 1 | 51.28759694 | | 2 | 26.675179 | | 4 | 14.82837749 | | 8 | 8.897883892 | | 16 | 7.353658915 |

If you have any questions or concerns feel free to comment.

opened by novoselrok 20
Decrease memory usage by decreasing the number of trees, or decreasing dimensionality?
I have about ~1m vectors I would like to search through, each with 1000 dimensions. Ideally, I would like to keep the size of the index under 500mb. I believe I have two levers to pull for allowing the index to fit in memory:

Reduce the number of trees. For example, in my testing, reducing the # of trees from 70 to 10 cut the memory usage in ~half.

Reduce dimensionality. For example, reducing dimensionality from 1000 to 500 to also cut the memory usage in ~half. I could do something like TruncatedSVD (sklearn docs) to do this (but lmk if there's a better method).

Of course, I could also do some combination of these two. What are the tradeoffs between both methods of reducing memory usage?
opened by gusgordon 0
Dot index recall issue on last.fm dataset
Hi!

I had pretty low recall for dot index on my data. Then I decided to test the lastfm benchmark manually.

Then I realized that this benchmark is actually for an angular index with manually computed 65th component -- sqrt(max_norm^2 - ||x||^2). I've reproduced results for an angular index, but for a dot index (with the 65th being deleted) recall was again pretty low!

There's the notebook with calculations.

What's going on in the notebook:

Calculate exact dot products using brute force, then calculate top candidates

Construct an angular index with the additional component sqrt(max_norm^2 - ||x||^2), compute recalls for various search_k's

Construct a dot index without the additional component sqrt(max_norm^2 - ||x||^2), again compute recalls for various search_k's

The results are as follows (10 trees):

| search_k | 1000 | 10000 | 50000 | |----------------------------|------------|-----------|-----------| | Angular index recall (65 components) | 0.563 | 0.736 | 0.783 | | Dot index recall (64 components) | 0.003 | 0.034 | 0.175 |

But, as I understand, results should be at least comparable.

May there be a bug? If my results are correct, I would be glad to contribute with some fixes. However, by now I don't see what can be wrong in the source code of dot index.
opened by pkorobov 2
How realistic is serializing the internal C++ AnnoyIndex state?
Basically what I would want is to run something like this:

import os import random from annoy import AnnoyIndex num_rows = 10000 num_trees = 10 num_dims = 512 try: os.remove("annoy_idx.ann") except FileNotFoundError: pass annoy_idx = AnnoyIndex(num_dims, "angular") annoy_idx.on_disk_build("annoy_idx.ann") for idx in range(num_rows): vector = [random.gauss(0, 1) for _ in range(num_dims)] annoy_idx.add_item(idx, vector) annoy_idx.serialize("annoy_idx.state") # XXX - This is the magic I'm looking for

and then (after that program is done and exited) I would like to continue appending data like something like this (this adds 10,000 new rows with indices 10,000, ..., 19,999):

import os import random from annoy import AnnoyIndex num_rows = 10000 num_trees = 10 num_dims = 512 try: os.remove("annoy_idx.ann") except FileNotFoundError: pass annoy_idx = AnnoyIndex(num_dims, "angular") annoy_idx.deserialize("annoy_idx.state") # XXX - This is the magic I'm looking for for idx in range(num_rows): vector = [random.gauss(0, 1) for _ in range(num_dims)] annoy_idx.add_item(idx +num_rows, vector) # XXX - Note the increase in idx variable

So basically what I want is for there to be a serialize/deserialize ability so that I can continue the flow. It seems to me like the protected data here would need to be serialized:

https://github.com/spotify/annoy/blob/master/src/annoylib.h#L847-L885

In my case it seems to basically serializing the node here:

https://github.com/spotify/annoy/blob/master/src/annoylib.h#L442-L463

So my question is the following:

How realistic is this? More specifically, assuming that I am able to successfully serialize/deserialize the state, does it seem like this would play well with the mmap in the on_disk_build() step? This is maybe too general a question, but basically my point is: is this totally crazy? Are there obvious flaws with my thinking if I decided to go this route?

Thanks for any help!
opened by ApproximateIdentity 0
Annoy build index failing in docker image built from Github Actions
Hi!

We tried to implement Annoy angular index into our FastAPI python service which runs on Azure k8s service.

An index is being built every time from the pandas dataframe when the service starts. The index build code is snipped below.

Data from which we build the index has 10k rows and a float vector of size 9.

n_trees = 10 f = 9 ANNOY_INDEX = AnnoyIndex(9, "angular") for i, row in dfMl.iterrows(): wineId = int(row[:1].values[0]) wineValues = row[1:].values ANNOY_INDEX.add_item(wineId, wineValues) ANNOY_INDEX.build(n_trees, n_jobs=1)

Locally on Apple M1 machine, the service works fine, and when we build a docker image locally and push it to ACR from which we deploy the image to k8s, everything works too.

docker buildx build --platform linux/amd64 -t apiservice:1.0.0 . docker push apiservice:1.0.0

The issue appears when we try to build the image with the GitHub Actions pipeline (docker push and build). The build will not fail and we are able to deploy image built in GitHub actions to our container registry. But when we try to deploy this image into k8s, the pod fails with the status CrashLoopBackOff.

In the GitHub actions pipeline, we specify the platform as linux/amd64 same as when we are building the image locally.

From the pod description, we can get this exit code - 132 which refers to an unsupported processor instruction.

Last State: Terminated Reason: Error Exit Code: 132

We tried to log this issue and found out the service will fail at index.build().

Did someone encounter a similar issue or could help us figure this one out?

Thanks!
opened by williambrach 4
cmake: missing install instruction

Hi there,

Typically to be able to use a C++ library with cmake externally, there needs to be an 'install' section. I am trying to use this library with cpm and the library headers are not installed to the expected location.

I see there are some directives to copy things over but I can't make this work. I will try to create a reproducible example.

opened by bsergean 0

Releases(v1.17.1)

v1.17.1(Aug 8, 2022)

Add -fpermissive plus some other minor things
Source code(tar.gz)
Source code(zip)
v1.17.0(Sep 18, 2020)
Multithreaded building by @novoselrok

Misc updates of README to link to Ruby bindings and clean up docs

Better tree splitting, try multiple times ot find more balanced ones

Fix building on disk and saving large files for Windows

Misc compilation warnings cleanup

Source code(tar.gz)
Source code(zip)
v1.16.3(Dec 26, 2019)
mostly fixing windows issues with large files

some error message improvements as well

doing more stack allocations instead of heap

fixes to a few tests that were failing on other platforms

Source code(tar.gz)
Source code(zip)
v1.16.2(Oct 11, 2019)

Something is busted with version 1.16.1 where I think I must have accidentally uploaded it a long time ago.

Bumping straight to 1.16.2

This is same content as 1.16.1
Source code(tar.gz)
Source code(zip)
v1.16.1(Oct 11, 2019)

Mostly minor things to catch user errors and/or explain things better. No functionality change really, except preventing some questionable usage patterns that lead to lots of complexity.

#430 – AVX issues in certain GCC version #427 – Prevent saving an index that hasn't been built #425 – Prevent building an index that has already been built #426 – Handle lseek failures #419 – ftruncate issues #410 – swig update #405 – Use std::sort

The remaining changes were to documentation/test and don't change any functionality
Source code(tar.gz)
Source code(zip)
1.16(Jul 9, 2019)

Source code(tar.gz)
Source code(zip)
v1.15.2(Apr 17, 2019)
Fix #379 error handling in save()

Fix compile issue in linux

Fix popcount #366

Some more misc

Source code(tar.gz)
Source code(zip)
1.15(Apr 17, 2019)

Source code(tar.gz)
Source code(zip)
v1.15.1(Feb 22, 2019)

See #349
Source code(tar.gz)
Source code(zip)
v1.15.0(Dec 10, 2018)
Changes:

#274 – on-disk build

Fix for xcode: #334

Minor fix when saving an index to the same file as a current index: #335

Source code(tar.gz)
Source code(zip)
v1.14.0(Nov 4, 2018)
Fixes to the Euclidean distance function (avoid catastrophic cancellation)

Don't MAP_POPULATE by default

Source code(tar.gz)
Source code(zip)
v1.13.0(Aug 31, 2018)

Main (only?) change is that dot products are now supported thanks to @psobot!
Source code(tar.gz)
Source code(zip)
v1.12.0(May 7, 2018)

Finally squashed a remaining issue with holes (see #295) and together with a few other holes-related fixes it felt worthy of bumping the minor version number.
Source code(tar.gz)
Source code(zip)
v1.11.5(Mar 31, 2018)

Fixes #279 and #261
Source code(tar.gz)
Source code(zip)
v1.11.4(Feb 6, 2018)

Includes #276
Source code(tar.gz)
Source code(zip)
v1.11.1(Jan 20, 2018)

This version features AVX instructions thanks to @ReneHollander
Source code(tar.gz)
Source code(zip)
v1.10.0(Nov 13, 2017)
Two big updates

Windows support now official – with CI pipeline to prove it. Only tested on Python 3.6 so far

Experimental Hamming distances

Source code(tar.gz)
Source code(zip)
v1.9.5(Sep 30, 2017)
Fixes a problem with holes in the index reported by @yonromai in #223

Source code(tar.gz)
Source code(zip)

Owner

Spotify

GitHub

Weighted K Nearest Neighbors (kNN) algorithm implemented on python from scratch.

kNN_From_Scratch I implemented the k nearest neighbors (kNN) classification algorithm on python. This algorithm is used to predict the classes of new

1 Dec 14, 2021

TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm.

Efficient implementations of Product Quantization and its variants using Pytorch and CUDA

146 Dec 28, 2022

Optimal space decomposition based-product quantization for approximate nearest neighbor search

Optimal space decomposition based-product quantization for approximate nearest neighbor search Abstract Product quantization(PQ) is an effective neare

1 Nov 19, 2021

Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval This repo provides personal implementation of paper Approximate Ne

8 Oct 7, 2022

The Dual Memory is build from a simple CNN for the deep memory and Linear Regression fro the fast Memory

Simple-DMA a simple Dual Memory Architecture for classifications. based on the paper Dual-Memory Deep Learning Architectures for Lifelong Learning of

1 Jan 27, 2022

This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

Mesa: A Memory-saving Training Framework for Transformers This is the official PyTorch implementation for Mesa: A Memory-saving Training Framework for

105 Dec 6, 2022

In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

In-Place Activated BatchNorm In-Place Activated BatchNorm for Memory-Optimized Training of DNNs In-Place Activated BatchNorm (InPlace-ABN) is a novel

1.3k Dec 29, 2022

[ICSE2020] MemLock: Memory Usage Guided Fuzzing

MemLock: Memory Usage Guided Fuzzing This repository provides the tool and the evaluation subjects for the paper "MemLock: Memory Usage Guided Fuzzing

54 Jan 7, 2023

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects This repo contains the code of Segcache described in the followi

78 Jan 7, 2023

Code for "Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency" paper

UNICORN ?? Webpage | Paper | BibTex PyTorch implementation of "Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency" pap

118 Jan 6, 2023

PyTorch Code of "Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics"

Memory In Memory Networks It is based on the paper Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spati

12 May 30, 2022

Episodic-memory - Ego4D Episodic Memory Benchmark

Ego4D Episodic Memory Benchmark EGO4D is the world's largest egocentric (first p

3 Feb 18, 2022

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Memory Efficient Attention Pytorch Implementation of a memory efficient multi-head attention as proposed in the paper, Self-attention Does Not Need O(

180 Jan 5, 2023

Simple tools for logging and visualizing, loading and training

TNT TNT is a library providing powerful dataloading, logging and visualization utilities for Python. It is closely integrated with PyTorch and is desi

1.5k Jan 2, 2023

Calling Julia from Python - an experiment on data loading

Calling Julia from Python - an experiment on data loading See the slides. TLDR After reading Patrick's blog post, we decided to try to replace C++ wit

8 Jun 7, 2022

Python code for loading the Aschaffenburg Pose Dataset.

Aschaffenburg Pose Dataset (APD) This repository contains Python code for loading and filtering the Aschaffenburg Pose Dataset. The dataset itself and

1 Nov 26, 2021

pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

Bayesian inference in HSMMs and HMMs This is a Python library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and expli

527 Dec 4, 2022

A gesture recognition system powered by OpenPose, k-nearest neighbours, and local outlier factor.

OpenHands OpenHands is a gesture recognition system powered by OpenPose, k-nearest neighbours, and local outlier factor. Currently the system can iden

12 Jan 10, 2022

Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning

structshot Code and data for paper "Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning", Yi Yang and Arz

47 Dec 27, 2022

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Related tags

Overview

Annoy

Install

Background

Summary of features

Python code example

Full Python API

Tradeoffs

How does it work

More info

Source code

Discuss

Comments

FAIL: test_get_nns_by_item (_annoy_test.AngularIndexTest)

FAIL: test_get_nns_by_vector (_annoy_test.AngularIndexTest)

FAIL: test_large_index (_annoy_test.AngularIndexTest)

FAIL: test_large_index (_annoy_test.EuclideanIndexTest)

FAIL: test_precision_10 (_annoy_test.EuclideanIndexTest)

FAIL: test_precision_100 (_annoy_test.EuclideanIndexTest)

FAIL: test_precision_1000 (_annoy_test.EuclideanIndexTest)

Releases(v1.17.1)

v1.17.1(Aug 8, 2022)

v1.17.0(Sep 18, 2020)

v1.16.3(Dec 26, 2019)

v1.16.2(Oct 11, 2019)

v1.16.1(Oct 11, 2019)

1.16(Jul 9, 2019)

v1.15.2(Apr 17, 2019)

1.15(Apr 17, 2019)

v1.15.1(Feb 22, 2019)

v1.15.0(Dec 10, 2018)

v1.14.0(Nov 4, 2018)

v1.13.0(Aug 31, 2018)

v1.12.0(May 7, 2018)

v1.11.5(Mar 31, 2018)

v1.11.4(Feb 6, 2018)

v1.11.1(Jan 20, 2018)

v1.10.0(Nov 13, 2017)

v1.9.5(Sep 30, 2017)

Owner

Spotify

Weighted K Nearest Neighbors (kNN) algorithm implemented on python from scratch.

TorchPQ is a python library for Approximate Nearest Neighbor Search (ANNS) and Maximum Inner Product Search (MIPS) on GPU using Product Quantization (PQ) algorithm.

Optimal space decomposition based-product quantization for approximate nearest neighbor search

Personal implementation of paper "Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval"

The Dual Memory is build from a simple CNN for the deep memory and Linear Regression fro the fast Memory

This is the official PyTorch implementation for "Mesa: A Memory-saving Training Framework for Transformers".

In-Place Activated BatchNorm for Memory-Optimized Training of DNNs

[ICSE2020] MemLock: Memory Usage Guided Fuzzing

Segcache: a memory-efficient and scalable in-memory key-value cache for small objects

Code for "Share With Thy Neighbors: Single-View Reconstruction by Cross-Instance Consistency" paper

PyTorch Code of "Memory In Memory: A Predictive Neural Network for Learning Higher-Order Non-Stationarity from Spatiotemporal Dynamics"

Episodic-memory - Ego4D Episodic Memory Benchmark

Implementation of a memory efficient multi-head attention as proposed in the paper, "Self-attention Does Not Need O(n²) Memory"

Simple tools for logging and visualizing, loading and training

Calling Julia from Python - an experiment on data loading

Python code for loading the Aschaffenburg Pose Dataset.

pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

A gesture recognition system powered by OpenPose, k-nearest neighbours, and local outlier factor.

Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning