A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Nicholas Monath

Last update: Nov 3, 2022

Related tags

Machine Learning clustering nearest-neighbor-search nearest-neighbors hierarchical-clustering online-clustering incremental-clustering

Overview

Install

Linux wheels available (python >=3.6) on pypi:

pip install graphgrove

Building from source:

conda create -n gg python=3.8
conda activate gg
pip install numpy
make

To build your own wheel:

conda create -n gg python=3.8
conda activate gg
pip install numpy
make
pip install build
python -m build --wheel
# which can be used as:
# pip install --force dist/graphgrove-0.0.1-cp37-cp37m-linux_x86_64.whl

Examples

Toy examples of clustering, DAG-structured clustering, and nearest neighbor search are available.

At a high level, incremental clustering can be done as:

import graphgrove as gg
k = 5
num_rounds = 50
thresholds = np.geomspace(1.0, 0.001, num_rounds).astype(np.float32)
scc = gg.vec_scc.Cosine_SCC(k=k, num_rounds=num_rounds, thresholds=thresholds, index_name='cosine_sgtree', cores=cores, verbosity=0)
# data_batches - generator of numpy matrices mini-batch-size by dim
for batch in data_batches:
    scc.partial_fit(batch)

Incremental nearest neighbor search can be done as:

import graphgrove as gg
k=5
cores=4
tree = gg.graph_builder.Cosine_SGTree(k=k, cores=cores)
# data_batches - generator of numpy matrices mini-batch-size by dim
for batch in data_batches:
    tree.insert(batch) # or tree.insert_and_knn(batch)

Algorithms Implemented

Clustering:

Sub-Cluster Component Algorithm (SCC) and its minibatch variant from the paper: Scalable Hierarchical Agglomerative Clustering. Nicholas, Monath, Kumar Avinava Dubey, Guru Guruganesh, Manzil Zaheer, Amr Ahmed, Andrew McCallum, Gokhan Mergen, Marc Najork Mert Terzihan Bryon Tjanaka Yuan Wang Yuchen Wu. KDD. 2021
DAG Structured clustering (LLama) from DAG-Structured Clustering by Nearest Neighbors. Nicholas Monath, Manzil Zaheer, Kumar Avinava Dubey, Amr Ahmed, Andrew McCallum. AISTATS 2021.

Nearest Neighbor Search:

CoverTree: Alina Beygelzimer, Sham Kakade, and John Langford. "Cover trees for nearest neighbor." ICML. 2006.
SGTree: SG-Tree is a new data structure for exact nearest neighbor search inspired from Cover Tree and its improvement, which has been used in the TerraPattern project. At a high level, SG-Tree tries to create a hierarchical tree where each node performs a "coarse" clustering. The centers of these "clusters" become the children and subsequent insertions are recursively performed on these children. When performing the NN query, we prune out solutions based on a subset of the dimensions that are being queried. This is particularly useful when trying to find the nearest neighbor in highly clustered subset of the data, e.g. when the data comes from a recursive mixture of Gaussians or more generally time marginalized coalscent process . The effect of these two optimizations is that our data structure is extremely simple, highly parallelizable and is comparable in performance to existing NN implementations on many data-sets. Manzil Zaheer, Guru Guruganesh, Golan Levin, Alexander Smola. TerraPattern: A Nearest Neighbor Search Service. 2019.

Pyomo is an object-oriented algebraic modeling language in Python for structured optimization problems.

Pyomo is a Python-based open-source software package that supports a diverse set of optimization capabilities for formulating and analyzing optimization models. Pyomo can be used to define symbolic problems, create concrete problem instances, and solve these instances with standard solvers.

1.4k Dec 28, 2022

Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

2 Jan 18, 2022

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

14.5k Jan 7, 2023

Mixing up the Invariant Information clustering architecture, with self supervised concepts from SimCLR and MoCo approaches

Self Supervised clusterer Combined IIC, and Moco architectures, with some SimCLR notions, to get state of the art unsupervised clustering while retain

9 Feb 13, 2022

Self Organising Map (SOM) for clustering of atomistic samples through unsupervised learning.

Self Organising Map for Clustering of Atomistic Samples - V2 Description Self Organising Map (also known as Kohonen Network) implemented in Python for

0 Nov 16, 2021

GroundSeg Clustering Optimized Kdtree

ground seg and clustering based on kitti velodyne data, and a additional optimized kdtree for knn and radius nn search

2 Dec 2, 2021

Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Clustering Clustering Application in Python Using scikit-learn This repository contains the prediction of baseball metric clusters using MLB Statcast

2 Apr 18, 2022

Turning images into '9-pan' palettes using KMeans clustering from sklearn.

img2palette Turning images into '9-pan' palettes using KMeans clustering from sklearn. Requirements We require: Pillow, for opening and processing ima

2 Jan 1, 2022

Upgini : data search library for your machine learning pipelines

Automated data search library for your machine learning pipelines → find & deliver relevant external data & features to boost ML accuracy :chart_with_upwards_trend:

175 Jan 8, 2023

Comments

Mini-batching time cost

Firstly, this is a great library for clustering unit-normed feature spaces fast and coherently!

I had a query about mini-batching: for some reason I expected each mini-batch to take roughly the same amount of time (or even less time on subsequent batches) when calling partial_fit. However, each mini-batch seems to take longer than the last, almost in linear fashion. This is on actual structured data with hierarchies and clusters to be found, not on randomly generated matrices.

Pseudo-behaviour First 10,000 data-points: 10 seconds to run Second 10,000 data-points: 20 seconds to run Third 10,000 data-points: 30 seconds to run Total time taken for 30000 data-points: 60 seconds.

Is this expected behaviour? As a side-note, while I found implementation details in the accompanying SCC paper and could follow them, I cannot find any details regarding the mini-batching.

opened by ASharmaML 0
Your pipy graphgrove package is buggy and prevents installation

I tried to pip install graphgrove under windows. It seems installed but some how at version 0.0.1 without any real contents. Then I tried to compile from source, it has bugs and can not pass. The wheel files somehow do not apply to windows. Hope you can seriously rebuild your package for windows and linux. Otherwise just remove from pipy since it wasted others time to try it. This is second time I saw your published work not working well. I hope you can fix them all!

opened by mosfet123 0
Add documentation with sphinx.
This PR adds documentation using jekyll + sphinx.

A preview is available here: https://mrdrozdov.github.io/graphgrove/

Notes:

The homepage is built using jekyll, and is automatically generated from the top-level README.md.

The documentation is built using sphinx, and is automatically generated by class/method docstrings.

Can run docs locally with make docs-serve, and publishes docs with make docs-deploy.

For jekyll, ruby 2.7 is required. I recommend installing this with rvm. May need to run cd homepage && bundle install.

For sphinx, look at installation information at docs/README.md.

FYI, may want to add a git hook that automatically runs make docs-deploy every time when pushing a new commit. Otherwise can manually deploy.
opened by mrdrozdov 0

Owner

Nicholas Monath

GitHub https://nmonath.github.io/graphgrove/

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and A* Search (Manhattan Distance Heuristic)

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and the A* Search (using the Manhattan Distance Heuristic)

17 Aug 14, 2022

An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

23.3k Dec 31, 2022

A framework for building (and incrementally growing) graph-based data structures used in hierarchical or DAG-structured clustering and nearest neighbor search

Related tags

Overview

Install

Examples

Algorithms Implemented

You might also like...

Pyomo is an object-oriented algebraic modeling language in Python for structured optimization problems.

Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Mixing up the Invariant Information clustering architecture, with self supervised concepts from SimCLR and MoCo approaches

Self Organising Map (SOM) for clustering of atomistic samples through unsupervised learning.

GroundSeg Clustering Optimized Kdtree

Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Turning images into '9-pan' palettes using KMeans clustering from sklearn.

Upgini : data search library for your machine learning pipelines

Comments

Mini-batching time cost

Your pipy graphgrove package is buggy and prevents installation

Add documentation with sphinx.

Owner

Nicholas Monath

A Python-based application demonstrating various search algorithms, namely Depth-First Search (DFS), Breadth-First Search (BFS), and A* Search (Manhattan Distance Heuristic)

My project contrasts K-Nearest Neighbors and Random Forrest Regressors on Real World data

Model search (MS) is a framework that implements AutoML algorithms for model architecture search at scale.

ANNchor is a python library which constructs approximate k-nearest neighbour graphs for slow metrics.

Implementation of K-Nearest Neighbors Algorithm Using PySpark

Neighbourhood Retrieval (Nearest Neighbours) with Distance Correlation.

Simple structured learning framework for python

Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Mosec is a high-performance and flexible model serving framework for building ML model-enabled backend and microservices

An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.