Framework for evaluating ANNS algorithms on billion scale datasets.

Harsha Vardhan Simhadri

Last update: Dec 24, 2022

Related tags

Overview

Billion-Scale ANN

http://big-ann-benchmarks.com/

Install

The only prerequisite is Python (tested with 3.6) and Docker. Works with newer versions of Python as well but probably requires an updated requirements.txt on the host. (Suggestion: copy requirements.txt to requirements${PYTHON_VERSION}.txt and remove all fixed versions. requirements.txt has to be kept for the docker containers.)

Clone the repo.
Run pip install -r requirements.txt (Use requirements_py38.txt if you have Python 3.8.)
Install docker by following instructions here. You might also want to follow the post-install steps for running docker in non-root user mode.
Run python install.py to build all the libraries inside Docker containers.

Storing Data

The framework assumes that all data is stored in data/. Please use a symlink if your datasets and indices are supposed to be stored somewhere else. The location of the linked folder matters a great deal for SSD-based search performance in T2. A local SSD such as the one found on Azure Ls-series VMs is better than remote disks, even premium ones. See T1/T2 for more details.

Data sets

See http://big-ann-benchmarks.com/ for details on the different datasets.

Dataset Preparation

Before running experiments, datasets have to be downloaded. All preparation can be carried out by calling

python create_dataset.py --dataset [bigann-1B | deep-1B | text2image-1B | ssnpp-1B | msturing-1B | msspacev-1B]

Note that downloading the datasets can potentially take many hours.

For local testing, there exist smaller random datasets random-xs and random-range-xs. Furthermore, most datasets have 1M, 10M and 100M versions, run python create_dataset -h to get an overview.

Running the benchmark

Run python run.py --dataset $DS --algorithm $ALGO where DS is the dataset you are running on, and ALGO is the name of the algorithm. (Use python run.py --list-algorithms) to get an overview. python run.py -h provides you with further options.

The parameters used by the implementation to build and query the index can be found in algos.yaml.

Running the track 1 baseline

After running the installation, we can evaluate the baseline as follows.

for DS in bigann-1B  deep-1B  text2image-1B  ssnpp-1B  msturing-1B  msspacev-1B;
do
    python run.py --dataset $DS --algorithm faiss-t1;
done

On a 28-core Xeon E5-2690 v4 that provided 100MB/s downloads, carrying out the baseline experiments took roughly 7 days.

To evaluate the results, run

sudo chmod -R 777 results/
python data_export.py --output res.csv
python3.8 eval/show_operating_points.py --algorithm faiss-t1 --threshold 10000

Including your algorithm and Evaluating the Results

See Track T1/T2 for more details on evaluation for Tracks T1 and T2.

See Track T3 for more details on evaluation for Track T3.

Credits

This project is a version of ann-benchmarks by Erik Bernhardsson and contributors targetting billion-scale datasets.

Comments

Request code review of T3 integration

@maumueller Per our email conversation, can you validate my various code changes to accomodate T3? I've tested locally on 4/6 datasets and it looks good so far.

This is currently how I will likely be instructing T3 participants: https://github.com/harsha-simhadri/big-ann-benchmarks/blob/gw/T3/t3/README.md

opened by sourcesync 15
Support for non-python implementations

Hi, I have a question about using this framework w/ a non-python ANN implementation.

It looks like this is mostly a fork of ann-benchmarks. So the only option for using outside of Python is to hack together a client/server setup, as has been done for a few algos in ann-benchmarks. This obviously handicaps and complicates non-python implementations, as it introduces costs of context-switching, serialization, and data transfer among processes.

I asked and was told early on by project organizers that the big-ann challenge would support non-python implementations:

So I'm wondering if there has been progress here, or any idea of how it might work?

It seems like it wouldn't be terribly difficult to refactor the code so that the containers executed by runner.py can have any entrypoint, e.g., a program in another language. The interface between runner and algorithm would then simply be some standard file format for inputs and nearest neighbor results. If that sounds like a good idea I can try to implement it. Otherwise maybe we can use this ticket for discussing alternatives.

Thanks -Alex

opened by alexklibisz 14
a proposed recall-with-ties impl with related unit tests
@harsha-simhadri @maumueller @

Per our discussion, here is a new recall-with-ties approach with a bunch of recall related unit tests.

Please review if you get a chance, since it is fairly different from the existing implementation and could really use a second pair of eyes.

Highlights include:

retains the set intersection method to determine number of recalls, but changes how the true_ids set is presented to it

basically it groups ids in true_ids by grouping "close" values in true_dists

"close"ness is determined by the absolute value difference of consecutive distances in true_dists ( treshold is <= 1e-6 )

it uses the first element in a consecutive group as one of the subtraction operands

it will also track the tie condition and surfaces the count of queries in query set with the tie condition

recall_tests.py attempts to provide a comprehensive set of unit tests for both versions of recall (without and with ties)

there is a run script for these tests ( tests/tests.sh ) which could be incorporated into a github action

Lowlights include:

admittedly this recall-with-ties implementation could be faster using vectorization

there are some slow tests in recall_tests.py which may slow down github push validation further if incorporated as-is as a branch push action
opened by sourcesync 9
T2/kwai thu

Kuaishou Technology Billion-Scale ANN Challenge Track2: A joint team from Kuaishou Technology and Tsinghua University. This is for Track 2. Only optimize search process, currently for three datasets: msturing-1B, bigann-1B, msspacev-1B, it's first commit, there may be some other modifications before deadline.

opened by qiaoyuKs 8
HttpANN algorithm to support language-angostic implementations (Re: Issue #20)
Thanks @maumueller, @gosha1128 and others for the fruitful discussion over in #20. I think I've arrived at an implementation that could fit the purpose of language-agnostic (big) ANN.

The HttpANN algorithm is designed to make HTTP calls to a server. The server executes all indexing and querying, thus enabling language-agnostic ANN implementations with minimal overhead. The only requirements for the server are:

It should implement the JSON-over-HTTP API documented below (copied from httpann.py). Note that this is a 1:1 copy of the BaseANN Python Class API.

It should be able to read the vector dataset in the standard binary format used by this competition.

It could in theory even run remotely, although the intended use-case is that the server runs in the same container.

The overhead for data transfer and serialization is minimal. The server only needs to parse the 10k JSON-encoded query vectors and encode the resulting 10k lists of neighbors.

I also included an example implementation which uses scikit-learn. It's too slow for the large datasets, but it works on the smaller random-xs and random-range-xs. So it should be good enough to demonstrate that this algorithm works.

Here is the API that a server must implement:

| Method | Route | Request Body | Expected Status | Response Body | | ------ | -------------------- | ---------------------------------------------------------------------------------------------------------- | --------------- | -------------------------------------------------------------------------- | | POST | /init | dictionary of constructor arguments, e.g., {“metric”: “euclidean”, “dimension”: 99 } | 200 | { } | | POST | /load_index | { "dataset": <dataset name, e.g. "bigann-10m"> } | 200 | { "load_index": } | | POST | /set_query_arguments | dictionary of query arguments | 200 | { } | | POST | /query | { “X”: , “k”: } | 200 | { } | | POST | /range_query | { “X”: , “radius”: } | 200 | { } | | POST | /get_results | { } | 200 | { “get_results”: } | | POST | /get_additional | { } | 200 | { “get_additional”: } | | POST | /get_range_results | { } | 200 | { “get_range_results”: <list of three 1-dimensional lists (lims, I, D)> } |
opened by alexklibisz 8
Count ties using extended ground truth list

Added functionality to give credit to tied candidates. This uses the fact that GT file has top 100 NNs computed while k=10 is what is evaluated. So if top 8 through top 15 candidates are tied for a query, we can use GT to credit any entries from there. This will not work if there are more than 100 ties, unless we recompute GT with k>100. Using this function, the difference in recall for msspacev-1B for diskann is as follows

With ties diskann-t2,DiskANN,msspacev-1B,10,873.3891725887347,4128.170671305772,1000000.0,57414268.0,65737.32512600713,132.8194160185564,9124.044242052121,0.9774116523400191 diskann-t2,DiskANN,msspacev-1B,10,815.8064319257669,4391.558241233456,1000000.0,57414268.0,70377.31715901,142.64251603220086,9773.534383954155,0.9786396507026879 diskann-t2,DiskANN,msspacev-1B,10,954.1772918666221,3863.1387024150636,1000000.0,57414268.0,60171.48855815104,122.95968072042571,8459.167928776094,0.9760779096738983 diskann-t2,DiskANN,msspacev-1B,10,1123.4640002144592,3333.160165097558,1000000.0,57414268.0,51104.67980196974,103.29038750170555,7087.164108336744,0.9722915813890026 diskann-t2,DiskANN,msspacev-1B,10,1535.2989986923885,2538.5193682630647,1000000.0,57414268.0,37396.14762264525,73.94146541137945,5179.051030154182,0.9634329376449721 diskann-t2,DiskANN,msspacev-1B,10,1720.4498608048532,2273.3868058398143,1000000.0,57414268.0,33371.6601151868,64.20122117614955,4642.449686178196,0.9587835993996453 diskann-t2,DiskANN,msspacev-1B,10,1233.6608183762005,3069.4611474962476,1000000.0,57414268.0,46539.751562809,93.52844862873516,6449.107449856733,0.9701869286396507 diskann-t2,DiskANN,msspacev-1B,10,1979.1985895820392,2009.0835243553008,1000000.0,57414268.0,29008.846460488112,54.51992086232774,4017.1128735161687,0.9519238641015144 diskann-t2,DiskANN,msspacev-1B,10,1034.378157726947,3598.717198799291,1000000.0,57414268.0,55506.071518532684,113.13927548096602,7694.289875835721,0.9742870787283394 diskann-t2,DiskANN,msspacev-1B,10,1371.7472009562052,2804.5663255560103,1000000.0,57414268.0,41854.846111570834,83.73499113112294,5834.428264428981,0.9675808432255424

Without ties diskann-t2,DiskANN,msspacev-1B,10,873.3891725887347,4128.170671305772,1000000.0,57414268.0,65737.32512600713,132.8194160185564,9124.044242052121,0.914753035884841 diskann-t2,DiskANN,msspacev-1B,10,815.8064319257669,4391.558241233456,1000000.0,57414268.0,70377.31715901,142.64251603220086,9773.534383954155,0.9160526674853322 diskann-t2,DiskANN,msspacev-1B,10,954.1772918666221,3863.1387024150636,1000000.0,57414268.0,60171.48855815104,122.95968072042571,8459.167928776094,0.9133817710465275 diskann-t2,DiskANN,msspacev-1B,10,1123.4640002144592,3333.160165097558,1000000.0,57414268.0,51104.67980196974,103.29038750170555,7087.164108336744,0.9097182425978987 diskann-t2,DiskANN,msspacev-1B,10,1535.2989986923885,2538.5193682630647,1000000.0,57414268.0,37396.14762264525,73.94146541137945,5179.051030154182,0.9009551098376314 diskann-t2,DiskANN,msspacev-1B,10,1720.4498608048532,2273.3868058398143,1000000.0,57414268.0,33371.6601151868,64.20122117614955,4642.449686178196,0.8958520944194296 diskann-t2,DiskANN,msspacev-1B,10,1233.6608183762005,3069.4611474962476,1000000.0,57414268.0,46539.751562809,93.52844862873516,6449.107449856733,0.9076204120616728 diskann-t2,DiskANN,msspacev-1B,10,1979.1985895820392,2009.0835243553008,1000000.0,57414268.0,29008.846460488112,54.51992086232774,4017.1128735161687,0.8887331150225133 diskann-t2,DiskANN,msspacev-1B,10,1034.378157726947,3598.717198799291,1000000.0,57414268.0,55506.071518532684,113.13927548096602,7694.289875835721,0.91110997407559 diskann-t2,DiskANN,msspacev-1B,10,1371.7472009562052,2804.5663255560103,1000000.0,57414268.0,41854.846111570834,83.73499113112294,5834.428264428981,0.9046800382043936

opened by harsha-simhadri 6

track1_baseline_faiss/baseline_faiss.py runs out of memory for 100M vectors on F32s_v2 with 64G RAM

Hello!

Thanks for providing the scripts for running baselines. The following one liner:

python -u track1_baseline_faiss/baseline_faiss.py --dataset bigann-100M \
    --indexkey OPQ64_128,IVF1048576_HNSW32,PQ64x4fsr \
    --maxtrain 100000000 \
    --two_level_clustering \
    --build \
    --add_splits 30 \
    --indexfile data/track1_baseline_faiss/deep-100M.IVF1M_2level_PQ64x4fsr.faissindex \
    --quantizer_efConstruction 200 \
    --quantizer_add_efSearch 80

produces output on F32s_v2 with 64G RAM:

args= Namespace(M0=-1, add_bs=100000, add_splits=30, autotune_max=[], autotune_range=[], basedir=None, build=True, buildthreads=-1, by_residual=-1, clustering_niter=-1, dataset='bigann-100M', indexfile='data/track1_baseline_faiss/deep-100M.IVF1M_2level_PQ64x4fsr.faissindex', indexkey='OPQ64_128,IVF1048576_HNSW32,PQ64x4fsr', inter=True, k=10, maxRAM=-1, maxtrain=100000000, min_test_duration=3.0, n_autotune=500, no_precomputed_tables=False, pairwise_quantization='', parallel_mode=-1, prepare=False, quantizer_add_efSearch=80, quantizer_efConstruction=200, query_bs=-1, radius=96237, search=False, searchparams=['autotune'], searchthreads=-1, stop_at_split=-1, train_on_gpu=False, two_level_clustering=True)
nb processors 32
model name	: Intel(R) Xeon(R) Platinum 8272CL CPU @ 2.60GHz
Dataset BigANNDataset in dimension 128, with distance euclidean, search_type knn, size: Q 10000 B 100000000
build index, key= OPQ64_128,IVF1048576_HNSW32,PQ64x4fsr
Build-time number of threads: 32
metric type 1
Update add-time parameters
   update quantizer efSearch= 16 -> 80
  update quantizer efConstruction= 40 -> 200
getting first 100000000 dataset vectors for training
train, size (100000000, 128)
  Forcing OPQ training PQ to PQ4
  training vector transform
  transform trainset
Killed

Can you please explain what could be wrong? Is the expectation to allocate 10% of data for training?

opened by DmitryKey 6

faiss segfaults on f32 instances

I cannot reproduce it on h8 or e8 instances, but on f32v2 instances faiss will segfault with some parameter settings. E.g., set up everything to run msturing-1B and carry out

params="
nprobe=128,quantizer_efSearch=128
nprobe=64,quantizer_efSearch=512
nprobe=128,quantizer_efSearch=256
nprobe=128,quantizer_efSearch=512
nprobe=256,quantizer_efSearch=256
nprobe=256,quantizer_efSearch=512
"

python  track1_baseline_faiss/baseline_faiss.py \
           --dataset msturing-1B --indexfile data/msturing-1B.IVF1M_2level_PQ64x4fsr.faissindex \
              --search --searchparams $params

results in

azureuser@test:~/big-ann-benchmarks$ bash test.sh
nb processors 32
model name      : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
Dataset MSTuringANNS in dimension 100, with distance euclidean, search_type knn, size: Q 100000 B 1000000000
reading data/msturing-1B.IVF1M_2level_PQ64x4fsr.faissindex
imbalance_factor= 1.5638867719477003
index size on disk:  41360658380
current RSS: 44945760256
precomputed tables size: 0
Search threads: 32
Optimize for intersection @  10
Running evaluation on 6 searchparams
parameters                                   inter@ 10 time(ms/q)   nb distances %quantization #runs
nprobe=128,quantizer_efSearch=128        test.sh: line 12:  8954 Killed                  python track1_baseline_faiss/baseline_faiss.py --dataset msturing-1B --indexfile data/msturing-1B.IVF1M_2level_PQ64x4fsr.faissindex --search --searchparams $params

Any thoughts Matthijs? (Once you are back from vacation)

opened by maumueller 6

Any Plans for supporting ScaNN?

Hi, I wanted to know are there any plans for adding benchmarks for SCANN? I am not sure if there are benchmarks available for SCANN for large datasets so I was curious on the same. Thanks!

opened by vamossagar12 4
Bug? when running evaluate

Hi ! Thanks for providing the scripts for evaluating results.

I found that when running python data_export.py --output res.csv, this line of code: power_capture.detect_power_benchmarks(metrics, res) will run out the yield generator res, so the next line: for i, (properties, run) in enumerate(res): don't output anything to be write into res.csv.

I'm still studying the code, and not sure if this is a bug...

opened by TokyoWolFrog 4
faiss T3 range search on SSNPP crash

@maumueller Alright, I tried the index strategy "OPQ32_128,IVF1048576_HNSW32,PQ32" on SSNPP and got the exception below. Note that I'm now defaulting to CPU on build_index for this dataset since the quantizer class doesn't support range search.

I will next try to set quantizer_on_gpu_add=False and train_on_gpu=False for build_index(). The default was True for both.

... Training PQ slice 30/32 Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations Preprocessing in 0.00 s Iteration 24 (0.36 s, search 0.32 s): objective=1.11718e+07 imbalance=1.174 nsplit=0 Training PQ slice 31/32 Clustering 65536 points in 4D to 256 clusters, redo 1 times, 25 iterations Preprocessing in 0.00 s Iteration 24 (0.36 s, search 0.32 s): objective=1.12101e+07 imbalance=1.185 nsplit=0 doing polysemous training for PQ IndexIVFPQ::precompute_table: not precomputing table, it would be too big: 34359738368 bytes (max 2147483648) Total train time 14384.034 s ============== SPLIT 0/1 Process Process-1: Traceback (most recent call last): File "/home/george/anaconda3/envs/bigann/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/home/george/anaconda3/envs/bigann/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/main.py", line 45, in run_worker run_no_docker(definition, args.dataset, args.count, File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/runner.py", line 268, in run_no_docker run_from_cmdline(cmd) File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/runner.py", line 182, in run_from_cmdline run(definition, args.dataset, args.count, args.runs, args.rebuild) File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/runner.py", line 76, in run algo.fit(dataset) File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/algorithms/faiss_t3.py", line 274, in fit index = build_index(buildthreads, by_residual, maxtrain, File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/algorithms/faiss_t3.py", line 184, in build_index for xblock, assign in stage2: File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/algorithms/faiss_t3.py", line 46, in rate_limited_iter res = res.get() File "/home/george/anaconda3/envs/bigann/lib/python3.8/multiprocessing/pool.py", line 771, in get raise self._value File "/home/george/anaconda3/envs/bigann/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, **kwds)) File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/algorithms/faiss_t3.py", line 39, in next_or_None return next(l) File "/home/george/Projects/BigANN/harsha/big-ann-benchmarks/benchmark/algorithms/faiss_t3.py", line 176, in produce_batches _, assign = quantizer_gpu.search(xblock, 1) File "/home/george/anaconda3/envs/bigann/lib/python3.8/site-packages/faiss/init.py", line 287, in replacement_search assert d == self.d AssertionError

opened by sourcesync 4
Implement Track-3 base-line index database

I try to built index database of track-3 follow : python track3_baseline_faiss/gpu_baseline_faiss.py --dataset bigann-1B \ --indexkey IVF1048576,SQ8 \ --train_on_gpu \ --build --quantizer_on_gpu_add --add_splits 30 \ --search \ --searchparams nprobe={1,4,16,64,256} \ --parallel_mode 3 --quantizer_on_gpu_search but failed caused by DRAM 128GB less than base-line 768GB. Someone could provide Track-3 Base-line link , thanks

opened by Zachacy 1
Couldn't access the slides of talks from track winners

Couldn't access the slides of talks from track winners. The url like: https://big-ann-benchmarks.com/templates/slides/* e.g. https://big-ann-benchmarks.com/templates/slides/invited-talk-anshu.pptx (just report 404 error)

opened by shanPic 0
is this a permanent fork of ann-benchmark?

It feels a bit weird to see a lot of activity on this repo rather than trying to contribute to the original one https://github.com/erikbern/ann-benchmarks

Is the ambition to merge it back into the main repo? Or is this just a short-lived repo anyway?

I'm happy to donate my code to something more neutral (eg we can set up a neutral github.com organzation rather than have the code under my username). Seems like it would be beneficial to to not diverge too far.

(also felt a bit weird that no one told me about this – I found out about it randomly)

@maumueller wdyt?

opened by erikbern 4
Discussion on Future Directions
Dear all,

<tl;dr> Please add your thoughts on the future of this benchmark!

Thank you very much for participating in our NeurIPS'21 competition. The competition will end with an event on Dec 8, and you can find the timeline for this event on https://big-ann-benchmarks.com/. We hope many of you will be able to participate!

The last part of the event will be an open discussion among the participants for future directions of this competition. As organizers we have already identified some points we would like to discuss and potentially include in a future version of the benchmark.

Filtered ANNS: can you support ANNS queries which allow filters like date range, author or some combination of attributes. This would look like a simple SQL + ANNS query.

Streaming ANNS: Can algorithms be robust to insertions and deletions. Here we have a strong baseline (fresh-diskann: https://arxiv.org/abs/2105.09613)

Out of distribution queries: this is already a problem with T2I and we can imagine various variations

Better vector compression: Most approaches use some variant of product quantization as vector compression, but can we get more accurate estimation, maybe at the price of more expensive decoding?

Please let us know what you think about these topics, and add your own!

Thanks!
opened by maumueller 5
T1/kst ann t1

codes for Billion-Scale Approximate Nearest Neighbor Search Challenge T1(re-upload) out team name registered for T1 was Kuaishou Technology Billion-Scale ANN Challenge Track1 kst_ann_t1 is our algorithm name

opened by NJU-yasuo 15

Framework for evaluating ANNS algorithms on billion scale datasets.

Related tags

Overview

Billion-Scale ANN

Install

Storing Data

Data sets

Dataset Preparation

Running the benchmark

Running the track 1 baseline

Including your algorithm and Evaluating the Results

Credits

Comments

Owner

Harsha Vardhan Simhadri

PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.

Asterisk is a framework to generate high-quality training datasets at scale

Model search is a framework that implements AutoML algorithms for model architecture search at scale

Repository for XLM-T, a framework for evaluating multilingual language models on Twitter data

An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

Cl datasets - PyTorch image dataloaders and utility functions to load datasets for supervised continual learning

Official implementation of "Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets" (CVPR2021)

A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation

A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

Official repository for the ICLR 2021 paper Evaluating the Disentanglement of Deep Generative Models with Manifold Topology

Resources for the "Evaluating the Factual Consistency of Abstractive Text Summarization" paper

Evaluating different engineering tricks that make RL work