It is a forest of random projection trees

Lyst

Last update: Dec 29, 2022

Related tags

Machine Learning rpforest

Overview

rpforest

rpforest is a Python library for approximate nearest neighbours search: finding points in a high-dimensional space that are close to a given query point in a fast but approximate manner.

rpforest differs from alternative ANN packages such as annoy by not requiring the storage of all the vectors indexed in the model. Used in this way, rpforest serves to produce a list of candidate ANNs for use by a further service where point vectors are stored (for example, a relational database).

How it works

It works by building a forest of N binary random projection trees.

In each tree, the set of training points is recursively partitioned into smaller and smaller subsets until a leaf node of at most M points is reached. Each parition is based on the cosine of the angle the points make with a randomly drawn hyperplane: points whose angle is smaller than the median angle fall in the left partition, and the remaining points fall in the right partition.

The resulting tree has predictable leaf size (no larger than M) and is approximately balanced because of median splits, leading to consistent tree traversal times.

Querying the model is accomplished by traversing each tree to the query point's leaf node to retrieve ANN candidates from that tree, then merging them and sorting by distance to the query point.

Installation

Install numpy first.
Install rpforest using pip: pip install rpforest

Usage

Fitting

Model fitting is straightforward:

from rpforest import RPForest

model = RPForest(leaf_size=50, no_trees=10)
model.fit(X)

The speed-precision tradeoff is governed by the leaf_size and no_trees parameters. Increasing leaf_size leads the model to produce shallower trees with larger leaf nodes; increasing no_trees fits more trees.

In-memory queries

Where the entire set of points can be kept in memory, rpforest supports in-memory ANN queries. After fitting, ANNs can be obtained by calling:

nns = model.query(x_query, 10)

Return nearest neighbours for vector x by first retrieving candidate NNs from x's leaf nodes, then merging them and sorting by cosine similarity with x. At most no_trees * leaf_size NNs will can be returned.

Candidate queries

rpforest can support indexing and candidate ANN queries on datasets larger than would fit in available memory. This is accomplished by first fitting the model on a subset of the data, then indexing a larger set of data into the fitted model:

from rpforest import RPForest

model = RPForest(leaf_size=50, no_trees=10)
model.fit(X_train)

model.clear()  # Deletes X_train vectors

for point_id, x in get_x_vectors():
     model.index(point_id, x)

nns = model.get_candidates(x_query, 10)

Model persistence

Model persistence is achieved simply by pickling and unpickling.

model = pickle.loads(pickle.dumps(model))

Performance

Erik Bernhardsson, the author of annoy, maintains an ANN performance shootout repository, comparing a number of Python ANN packages.

On the GloVe cosine distance benchmark, rpforest is not as fast as highly optimised C and C++ packages like FLANN and annoy. However, it far outerpforms scikit-learn's LSHForest and panns.

Development

Pull requests are welcome. To install for development:

Clone the rpforest repository: git clone [email protected]:lyst/rpforest.git
Install it for development using pip: cd rpforest && pip install -e .
You can run tests by running python setupy.py test.

When making changes to the .pyx extension files, you'll need to run python setup.py cythonize in order to produce the extension .cpp files before running pip install -e ..

Comments

Is rpforest supports custom similarity/distance function

hi, @maciejkula , According to your paper titled "Metadata Embeddings for User and Item Cold-start Recommendations", lyst generate recommendation using lightfm and some kind of ANN algorithm. So I came to rpforest in lyst's repository and I think maybe that's exactly the ANN. Now Suppose that we have trained a lightfm model, include embeddings and bias. It seems that it is still hard to rapidly generate top-k recommendation using rpforest, Since as Readme said, rpforest is based on cosine similarity, however, the score for a user-item pair in lightfm is the sum of a dot product of two embeddings and two bias. So my question is:

Is rpforest supports custom similarity/distance function, or some other way can achieve top-k recommendation?

thanks jianyi

opened by hiyijian 10

Compile error when installing rpforest

In file included from rpforest/rpforest_fast.cpp:271:

In file included from /usr/local/lib/python2.7/site-packages/numpy/core/include/numpy/arrayobject.h:4:

In file included from /usr/local/lib/python2.7/site-packages/numpy/core/include/numpy/ndarrayobject.h:17:

In file included from /usr/local/lib/python2.7/site-packages/numpy/core/include/numpy/ndarraytypes.h:1804:

/usr/local/lib/python2.7/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: "Using deprecated NumPy API, disable it by "          "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]

#warning "Using deprecated NumPy API, disable it by " \

 ^

rpforest/rpforest_fast.cpp:5727:28: error: no member named 'shrink_to_fit' in 'std::vector<int, std::allocator<int> >'

    __pyx_v_node->indices->shrink_to_fit();

    ~~~~~~~~~~~~~~~~~~~~~  ^

rpforest/rpforest_fast.cpp:5940:28: error: no member named 'shrink_to_fit' in 'std::vector<int, std::allocator<int> >'

    __pyx_v_node->indices->shrink_to_fit();

    ~~~~~~~~~~~~~~~~~~~~~  ^

1 warning and 2 errors generated.

error: command 'gcc' failed with exit status 1

opened by delip 10

Does windows support the libs

In win7 environment, when i install rpforest ,i met the problem. i use vs2015. the complie error informatios: C:\Users\juine\AppData\Local\Programs\Common\Microsoft\Visual C++ for Python \9.0\VC\Bin\cl.exe /c logo /Ox /MD /W3 /GS- /DNDEBUG -ID:\Python27\lib\site-p ackages\numpy\core\include -ID:\Python27\include -ID:\Python27\PC /Tprpforest/rp forest_fast.cpp /Fobuild\temp.win32-2.7\Release\rpforest/rpforest_fast.obj -ffas t-math cl : Command line warning D9002 : ignoring unknown option '-ffast-math' rpforest_fast.cpp d:\python27\lib\site-packages\numpy\core\include\numpy\npy_1_7_deprecated_ap i.h(12) : Warning Msg: Using deprecated NumPy API, disable it by #defining NPY_N O_DEPRECATED_API NPY_1_7_API_VERSION rpforest/rpforest_fast.cpp(271) : fatal error C1083: Cannot open include fil e: 'stdint.h': No such file or directory error: command 'C:\Users\juine\AppData\Local\Programs\Common\Microsof t\Visual C++ for Python\9.0\VC\Bin\cl.exe' failed with exit status 2

opened by juine 4

C++ error with python 3.5

Hello, I'm trying to fir an rpforet module on a big matrix (3000000 x 300) in python 3.5 on OS X 10.11 and I get the following error:

Traceback (most recent call last):
  File "rpforest_test.py", line 29, in <module>
    index.fit(model.syn0)
  File "/usr/local/lib/python3.5/site-packages/rpforest/rpforest.py", line 81, in fit
    tree.make_tree(self._X)
  File "rpforest/rpforest_fast.pyx", line 237, in rpforest.rpforest_fast.Tree.make_tree (rpforest/rpforest_fast.cpp:3896)
ValueError: Buffer dtype mismatch, expected 'double' but got 'float'

opened by w4nderlust 2

Errors installing rpforest in conda environment on Mac OS X

OS/compiler details:

OS X version: 10.11.2 (El Capitan)

$ clang --version
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.2.0
Thread model: posix

gcc is an alias for clang.

Installing rpforest in a fresh virtualenv environment works fine:

$ mkvirtualenv rpfenv
... python 3.4 env built ...
(rpfenv)$ pip install numpy
... numpy 1.10.2 installed ...
(rpfenv)$ pip install rpforest
... rpforest 1.1 installed ...

rpforest_fast was compiled successfully with:

clang -Wno-unused-result -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/dsc/.virtualenvs/rpfenv/lib/python3.4/site-packages/numpy/core/include -I/usr/local/Cellar/python3/3.4.3_2/Frameworks/Python.framework/Versions/3.4/include/python3.4m -c rpforest/rpforest_fast.cpp -o build/temp.macosx-10.10-x86_64-3.4/rpforest/rpforest_fast.o -std=c++11

Trying to do the same in a conda environment results in compilation errors however:

$ conda create -n rpfenv2 python=3.4
... python 3.4 env built ...
$ source activate rpfenv2
(rpfenv2)$ pip install numpy
... numpy 1.10.2 installed ...
(rpfenv2)$ pip install rpforest

rpforest 1.1 is downloaded, but compilation fails. Compilation command:

clang -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/Users/dsc/miniconda3/envs/rpfenv2/include -arch x86_64 -I/Users/dsc/miniconda3/envs/rpfenv2/lib/python3.4/site-packages/numpy/core/include -I/Users/dsc/miniconda3/envs/rpfenv2/include/python3.4m -c rpforest/rpforest_fast.cpp -o build/temp.macosx-10.5-x86_64-3.4/rpforest/rpforest_fast.o -std=c++11

Compiler errors:

  In file included from rpforest/rpforest_fast.cpp:271:
  In file included from /Users/dsc/miniconda3/envs/rpfenv2/lib/python3.4/site-packages/numpy/core/include/numpy/arrayobject.h:4:
  In file included from /Users/dsc/miniconda3/envs/rpfenv2/lib/python3.4/site-packages/numpy/core/include/numpy/ndarrayobject.h:18:
  In file included from /Users/dsc/miniconda3/envs/rpfenv2/lib/python3.4/site-packages/numpy/core/include/numpy/ndarraytypes.h:1781:
  /Users/dsc/miniconda3/envs/rpfenv2/lib/python3.4/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:15:2: warning: "Using deprecated NumPy API, disable it by "          "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
  #warning "Using deprecated NumPy API, disable it by " \
   ^
  rpforest/rpforest_fast.cpp:5727:28: error: no member named 'shrink_to_fit' in 'std::vector<int, std::allocator<int> >'
      __pyx_v_node->indices->shrink_to_fit();
      ~~~~~~~~~~~~~~~~~~~~~  ^
  rpforest/rpforest_fast.cpp:5940:28: error: no member named 'shrink_to_fit' in 'std::vector<int, std::allocator<int> >'
      __pyx_v_node->indices->shrink_to_fit();
      ~~~~~~~~~~~~~~~~~~~~~  ^
  1 warning and 2 errors generated.
  error: command 'clang' failed with exit status 1

opened by davechallis 1

label points that are being fit()

I'm not sure if the implementation already supports this, but is it possible assign a label with every point with fit(), so when there is a query, I can identify the neighbors by the labels?

opened by delip 1
CircleCI 2.0, tox, py35+ tests and support
Decided to use tox, so that you can:

run tests locally in multiple python versions

have a consistent testing platform across CI and local environment

Also:

updated readme

refactored setup.py ... made it not a fatal exception if python setup.py is run without an installed numpy

fixed flake8 issues

black-ified code

fixed tests code to support py35+

updated cpp library with the latest cython 0.29.14 (previously 0.23.4)
opened by iserko 0
Reuse hyperplanes

This PR makes all interior nodes of a tree at a given depth now use the same projection hyperplane. This drastically reduces the memory footprint of the tree without affecting the guarantees of the data structure (which relies on the hyperplanes being independently drawn _ between_ the trees in the forest).

opened by maciejkula 0
Raising error when tree already exists

Referencing https://github.com/lyst/rpforest/blob/master/rpforest/rpforest.py#L59

The tree already exists, so is there a way to handle this gracefully instead of raising an error?

opened by RitwikGupta 0

Releases(1.6)

1.6(Feb 8, 2020)

Source code(tar.gz)
Source code(zip)
rpforest-1.6-cp27-cp27mu-linux_x86_64.whl(801.29 KB)
rpforest-1.6-cp35-cp35m-linux_x86_64.whl(823.59 KB)
rpforest-1.6-cp36-cp36m-linux_x86_64.whl(843.34 KB)
rpforest-1.6-cp37-cp37m-linux_x86_64.whl(843.88 KB)
rpforest-1.6-cp38-cp38-linux_x86_64.whl(909.86 KB)
rpforest-1.6.tar.gz(4.19 MB)
1.5(Jan 28, 2020)

Source code(tar.gz)
Source code(zip)
1.4(Dec 16, 2019)

Source code(tar.gz)
Source code(zip)

Owner

Lyst

Your World of Fashion

GitHub

Machine Learning Algorithms ( Desion Tree, XG Boost, Random Forest )

implementation of machine learning Algorithms such as decision tree and random forest and xgboost on darasets then compare results for each and implement ant colony and genetic algorithms on tsp map, play blackjack game and robot in grid world and evaluate reward for it

1 Jan 19, 2022

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

363 Dec 14, 2022

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models.

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models. The library is a collection of Keras models and supports classification, regression and ranking. TF-DF is a TensorFlow wrapper around the Yggdrasil Decision Forests C++ libraries. Models trained with TF-DF are compatible with Yggdrasil Decision Forests' models, and vice versa.

538 Jan 1, 2023

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

6.9k Jan 5, 2023

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees.

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees. MooGBT optimizes for multiple objectives by defining constraints on sub-objective(s) along with a primary objective. The constraints are defined as upper bounds on sub-objective loss function. MooGBT uses a Augmented Lagrangian(AL) based constrained optimization framework with Gradient Boosted Trees, to optimize for multiple objectives.

66 Dec 6, 2022

Bonsai: Gradient Boosted Trees + Bayesian Optimization

Bonsai is a wrapper for the XGBoost and Catboost model training pipelines that leverages Bayesian optimization for computationally efficient hyperparameter tuning.

24 Oct 27, 2022

ThunderGBM: Fast GBDTs and Random Forests on GPUs

Documentations | Installation | Parameters | Python (scikit-learn) interface What's new? ThunderGBM won 2019 Best Paper Award from IEEE Transactions o

648 Dec 16, 2022

My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data

kNN-vs-RFR My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data In many areas, rental bikes have been launched to

1 Oct 28, 2021

Graph-total-spanning-trees - A Python script to get total number of Spanning Trees in a Graph

Total number of Spanning Trees in a Graph This is a python script just written f

0 Jul 18, 2022

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

H2O H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Fl

6.1k Jan 5, 2023

treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions.

TreeInterpreter Package for interpreting scikit-learn's decision tree and random forest predictions. Allows decomposing each prediction into bias and

720 Dec 22, 2022

Forecasting directional movements of stock prices for intraday trading using LSTM and random forest

Forecasting directional movements of stock-prices for intraday trading using LSTM and random-forest https://arxiv.org/abs/2004.10178 Pushpendu Ghosh,

270 Dec 24, 2022

Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

The following Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks (MOFs). The training set is extracted from the Cambridge Structural Database and the CoRE_MOF 2019 dataset.

1 Jan 9, 2022

Patient-Survival - Using Python, I developed a Machine Learning model using classification techniques such as Random Forest and SVM classifiers to predict a patient's survival status that have undergone breast cancer surgery.

Patient-Survival - Using Python, I developed a Machine Learning model using classification techniques such as Random Forest and SVM classifiers to predict a patient's survival status that have undergone breast cancer surgery.

1 Dec 28, 2021

Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

2 Jan 18, 2022

It is a forest of random projection trees

Related tags

Overview

rpforest

How it works

Installation

Usage

Fitting

In-memory queries

Candidate queries

Model persistence

Performance

Development

Comments

Releases(1.6)

1.6(Feb 8, 2020)

1.5(Jan 28, 2020)

1.4(Dec 16, 2019)

Owner

Lyst

Machine Learning Algorithms ( Desion Tree, XG Boost, Random Forest )

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models.

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees.

Bonsai: Gradient Boosted Trees + Bayesian Optimization

ThunderGBM: Fast GBDTs and Random Forests on GPUs

My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data

Graph-total-spanning-trees - A Python script to get total number of Spanning Trees in a Graph

treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions.

Forecasting directional movements of stock prices for intraday trading using LSTM and random forest

Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

Patient-Survival - Using Python, I developed a Machine Learning model using classification techniques such as Random Forest and SVM classifiers to predict a patient's survival status that have undergone breast cancer surgery.

Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Machine Learning Algorithms ( Desion Tree, XG Boost, Random Forest )

Uniform Manifold Approximation and Projection

Uniform Manifold Approximation and Projection

Construct and use map tile grids in different projection.

Curved Projection Reformation