A high performance implementation of HDBSCAN clustering.

Last update: Jan 2, 2023

Overview

HDBSCAN

HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

In practice this means that HDBSCAN returns a good clustering straight away with little or no parameter tuning -- and the primary parameter, minimum cluster size, is intuitive and easy to select.

HDBSCAN is ideal for exploratory data analysis; it's a fast and robust algorithm that you can trust to return meaningful clusters (if there are any).

Based on the papers:

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017 [pdf]

R. Campello, D. Moulavi, and J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172. 2013

Documentation, including tutorials, are available on ReadTheDocs at http://hdbscan.readthedocs.io/en/latest/ .

Notebooks comparing HDBSCAN to other clustering algorithms, explaining how HDBSCAN works and comparing performance with other python clustering implementations are available.

How to use HDBSCAN

The hdbscan package inherits from sklearn classes, and thus drops in neatly next to other sklearn clusterers with an identical calling API. Similarly it supports input in a variety of formats: an array (or pandas dataframe, or sparse matrix) of shape (num_samples x num_features); an array (or sparse matrix) giving a distance matrix between samples.

import hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = hdbscan.HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)

Performance

Significant effort has been put into making the hdbscan implementation as fast as possible. It is orders of magnitude faster than the reference implementation in Java, and is currently faster than highly optimized single linkage implementations in C and C++. version 0.7 performance can be seen in this notebook . In particular performance on low dimensional data is better than sklearn's DBSCAN , and via support for caching with joblib, re-clustering with different parameters can be almost free.

Additional functionality

The hdbscan package comes equipped with visualization tools to help you understand your clustering results. After fitting data the clusterer object has attributes for:

The condensed cluster hierarchy
The robust single linkage cluster hierarchy
The reachability distance minimal spanning tree

All of which come equipped with methods for plotting and converting to Pandas or NetworkX for further analysis. See the notebook on how HDBSCAN works for examples and further details.

The clusterer objects also have an attribute providing cluster membership strengths, resulting in optional soft clustering (and no further compute expense). Finally each cluster also receives a persistence score giving the stability of the cluster over the range of distance scales present in the data. This provides a measure of the relative strength of clusters.

Outlier Detection

The HDBSCAN clusterer objects also support the GLOSH outlier detection algorithm. After fitting the clusterer to data the outlier scores can be accessed via the outlier_scores_ attribute. The result is a vector of score values, one for each data point that was fit. Higher scores represent more outlier like objects. Selecting outliers via upper quantiles is often a good approach.

Based on the paper:: R.J.G.B. Campello, D. Moulavi, A. Zimek and J. Sander Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection, ACM Trans. on Knowledge Discovery from Data, Vol 10, 1 (July 2015), 1-51.

Robust single linkage

The hdbscan package also provides support for the robust single linkage clustering algorithm of Chaudhuri and Dasgupta. As with the HDBSCAN implementation this is a high performance version of the algorithm outperforming scipy's standard single linkage implementation. The robust single linkage hierarchy is available as an attribute of the robust single linkage clusterer, again with the ability to plot or export the hierarchy, and to extract flat clusterings at a given cut level and gamma value.

Example usage:

import hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = hdbscan.RobustSingleLinkage(cut=0.125, k=7)
cluster_labels = clusterer.fit_predict(data)
hierarchy = clusterer.cluster_hierarchy_
alt_labels = hierarchy.get_clusters(0.100, 5)
hierarchy.plot()

Based on the paper:: K. Chaudhuri and S. Dasgupta. "Rates of convergence for the cluster tree." In Advances in Neural Information Processing Systems, 2010.

Installing

Easiest install, if you have Anaconda (thanks to conda-forge which is awesome!):

conda install -c conda-forge hdbscan

PyPI install, presuming you have an up to date pip:

pip install hdbscan

Binary wheels for a number of platforms are available thanks to the work of Ryan Helinski <[email protected]>.

If pip is having difficulties pulling the dependencies then we'd suggest to first upgrade pip to at least version 10 and try again:

pip install --upgrade pip
pip install hdbscan

Otherwise install the dependencies manually using anaconda followed by pulling hdbscan from pip:

conda install cython
conda install numpy scipy
conda install scikit-learn
pip install hdbscan

For a manual install of the latest code directly from GitHub:

pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git#egg=hdbscan

Alternatively download the package, install requirements, and manually run the installer:

wget https://github.com/scikit-learn-contrib/hdbscan/archive/master.zip
unzip master.zip
rm master.zip
cd hdbscan-master

pip install -r requirements.txt

python setup.py install

Running the Tests

The package tests can be run after installation using the command:

nosetests -s hdbscan

or, if nose is installed but nosetests is not in your PATH variable:

python -m nose -s hdbscan

If one or more of the tests fail, please report a bug at https://github.com/scikit-learn-contrib/hdbscan/issues/new

Python Version

The hdbscan library supports both Python 2 and Python 3. However we recommend Python 3 as the better option if it is available to you.

Help and Support

For simple issues you can consult the FAQ in the documentation. If your issue is not suitably resolved there, please check the issues on github. Finally, if no solution is available there feel free to open an issue ; the authors will attempt to respond in a reasonably timely fashion.

Contributing

We welcome contributions in any form! Assistance with documentation, particularly expanding tutorials, is always welcome. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

Citing

If you have used this codebase in a scientific publication and wish to cite it, please use the Journal of Open Source Software article.

L. McInnes, J. Healy, S. Astels, hdbscan: Hierarchical density based clustering In: Journal of Open Source Software, The Open Journal, volume 2, number 11. 2017

@article{mcinnes2017hdbscan,
  title={hdbscan: Hierarchical density based clustering},
  author={McInnes, Leland and Healy, John and Astels, Steve},
  journal={The Journal of Open Source Software},
  volume={2},
  number={11},
  pages={205},
  year={2017}
}

To reference the high performance algorithm developed in this library please cite our paper in ICDMW 2017 proceedings.

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017

@inproceedings{mcinnes2017accelerated,
  title={Accelerated Hierarchical Density Based Clustering},
  author={McInnes, Leland and Healy, John},
  booktitle={Data Mining Workshops (ICDMW), 2017 IEEE International Conference on},
  pages={33--42},
  year={2017},
  organization={IEEE}
}

Licensing

The hdbscan package is 3-clause BSD licensed. Enjoy.

Comments

Import hdbscan ISSUE

Hi, I have followed all the steps for installing hdbscan, but still I'm getting the error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-29-3f5a460d7435> in <module>
----> 1 import hdbscan

/opt/conda/lib/python3.6/site-packages/hdbscan/__init__.py in <module>
----> 1 from .hdbscan_ import HDBSCAN, hdbscan
      2 from .robust_single_linkage_ import RobustSingleLinkage, robust_single_linkage
      3 from .validity import validity_index
      4 from .prediction import approximate_predict, membership_vector, all_points_membership_vectors
      5 

/opt/conda/lib/python3.6/site-packages/hdbscan/hdbscan_.py in <module>
     19 from scipy.sparse import csgraph
     20 
---> 21 from ._hdbscan_linkage import (single_linkage,
     22                                mst_linkage_core,
     23                                mst_linkage_core_vector,

hdbscan/_hdbscan_linkage.pyx in init hdbscan._hdbscan_linkage()

AttributeError: type object 'hdbscan._hdbscan_linkage.UnionFind' has no attribute '__reduce_cython__'

Does someone knows how to fix it ? Bests,

opened by greg2626 30

ERROR: Could not build wheels for hdbscan which use PEP 517 and cannot be installed directly

python: Python 2.7.15rc1

OS: Linux 6a039c3530c7 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

pip: pip 19.1

tried running: pip install hdbscan

  Building wheel for hdbscan (PEP 517): finished with status 'error'                                                   │
  ERROR: Complete output from command /opt/conda/bin/python /opt/conda/lib/python3.6/site-packages/pip/_vendor/pep517/_│
in_process.py build_wheel /tmp/tmpvwnr9hhz:                                                                            │
  ERROR: running bdist_wheel                                                                                           │
  running build                                                                                                        │
  running build_py                                                                                                     │
  creating build                                                                                                       │
  creating build/lib.linux-x86_64-3.6                                                                                  │
  creating build/lib.linux-x86_64-3.6/hdbscan                                                                          │
  copying hdbscan/prediction.py -> build/lib.linux-x86_64-3.6/hdbscan                                                  │
  copying hdbscan/robust_single_linkage_.py -> build/lib.linux-x86_64-3.6/hdbscan                                      │
  copying hdbscan/__init__.py -> build/lib.linux-x86_64-3.6/hdbscan                                                    │
  copying hdbscan/validity.py -> build/lib.linux-x86_64-3.6/hdbscan                                                    │
  copying hdbscan/plots.py -> build/lib.linux-x86_64-3.6/hdbscan                                                       │
  copying hdbscan/hdbscan_.py -> build/lib.linux-x86_64-3.6/hdbscan                                                    │
  creating build/lib.linux-x86_64-3.6/hdbscan/tests                                                                    │
  copying hdbscan/tests/test_rsl.py -> build/lib.linux-x86_64-3.6/hdbscan/tests                                        │
  copying hdbscan/tests/__init__.py -> build/lib.linux-x86_64-3.6/hdbscan/tests                                        │
  copying hdbscan/tests/test_hdbscan.py -> build/lib.linux-x86_64-3.6/hdbscan/tests                                    │
  running build_ext                                                                                                    │
  cythoning hdbscan/_hdbscan_tree.pyx to hdbscan/_hdbscan_tree.c                                                       │
  building 'hdbscan._hdbscan_tree' extension                                                                           │
  creating build/temp.linux-x86_64-3.6                                                                                 │
  creating build/temp.linux-x86_64-3.6/hdbscan                                                                         │
  gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prot│
otypes -fPIC -I/opt/conda/include/python3.6m -I/tmp/pip-build-env-s09rgxp0/overlay/lib/python3.6/site-packages/numpy/co│
re/include -c hdbscan/_hdbscan_tree.c -o build/temp.linux-x86_64-3.6/hdbscan/_hdbscan_tree.o                           │
  /tmp/pip-build-env-s09rgxp0/overlay/lib/python3.6/site-packages/Cython/Compiler/Main.py:367: FutureWarning: Cython di│
rective 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /tmp/pip-install-rf│
yrnh0q/hdbscan/hdbscan/_hdbscan_tree.pyx                                                                               │
    tree = Parsing.p_module(s, pxd, full_module_name)                                                                  │
  error: command 'gcc' failed with exit status 1                                                                       │
  ----------------------------------------                                                                             │
  ERROR: Failed building wheel for hdbscan                                                                             │
  Running setup.py clean for hdbscan                                                                                   │
Failed to build hdbscan                                                                                                │
ERROR: Could not build wheels for hdbscan which use PEP 517 and cannot be installed directly                           │
The command '/bin/sh -c pip install hdbscan' returned a non-zero code: 1                                               │```

opened by boompig 27

"ValueError: zero-size array to reduction operation minimum which has no identity" with no leafs

Hi, I think I might by accident tracked an error related to #115 and #144, please see below:

Using the current master branch on Win 10 64 bits and Python 2.7.14

import numpy as np
import matplotlib.pyplot as plt

from hdbscan import HDBSCAN


# Generate data
test_data = np.array([
    [0.0, 0.0],
    [1.0, 1.0],
    [0.8, 1.0],
    [1.0, 0.8],
    [0.8, 0.8]])

# HDBSCAN
np.random.seed(1)
hdb_unweighted = HDBSCAN(min_cluster_size=3, gen_min_span_tree=True, allow_single_cluster=True)
hdb_unweighted.fit(test_data)

fig = plt.figure()
cd = hdb_unweighted.condensed_tree_
cd.plot()
fig.suptitle('Unweighted HDBSCAN condensed tree plot'); plt.show()

Whole traceback ("anonymised"):

Traceback (most recent call last):
  File "...\JetBrains\PyCharm 2017.2.4\helpers\pydev\pydev_run_in_console.py", line 37, in run_file
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File ".../.PyCharm2017.3/config/scratches/scratch_2.py", line 22, in <module>
    cd.plot()
  File "build\bdist.win-amd64\egg\hdbscan\plots.py", line 321, in plot
    max_rectangle_per_icicle=max_rectangles_per_icicle)
  File "build\bdist.win-amd64\egg\hdbscan\plots.py", line 104, in get_plot_data
    leaves = _get_leaves(self._raw_tree)
  File "build\bdist.win-amd64\egg\hdbscan\plots.py", line 44, in _get_leaves
    root = cluster_tree['parent'].min()
  File "C:\ProgramData\Anaconda3\envs\venv_temp_hdbscan_dev_py27\lib\site-packages\numpy\core\_methods.py", line 29, in _amin
    return umr_minimum(a, axis, None, out, keepdims)
ValueError: zero-size array to reduction operation minimum which has no identity

I have tracked down the problem to lines 42-45 of plots.py:

def _get_leaves(condensed_tree):
    cluster_tree = condensed_tree[condensed_tree['child_size'] > 1]
    root = cluster_tree['parent'].min()
    return _recurse_leaf_dfs(cluster_tree, root)

cluster_tree created here is empty, so line 44 throws an error.

I am not sure if there is any solution to this except maybe plotting the single_linkage_tree_?

opened by m-dz 22

Usage with images

Tried using the implementation as is for working with images, as was being done in sklearn using KMeans / MiniBatchKMeans / Meanshift clustering. But consistently run into MemoryError (for images as small as 200x200 as well). Here is a sample code -

import cv2
import numpy as np
import hdbscan

image = cv2.imread('/home/ubuntu/x.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
image = image.reshape((image.shape[0] * image.shape[1], 3))

clusterer = hdbscan.HDBSCAN(min_cluster_size=100)
cluster_labels = clusterer.fit_predict(image)

Error stack trace :

MemoryError                               Traceback (most recent call last)
<ipython-input-12-b887e72bd6d5> in <module>()
----> 1 cluster_labels = clusterer.fit_predict(image)

/usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in fit_predict(self, X, y)
    338             cluster labels
    339         """
--> 340         self.fit(X)
    341         return self.labels_
    342 

/usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in fit(self, X, y)
    320          self._condensed_tree,
    321          self._single_linkage_tree,
--> 322          self._min_spanning_tree) = hdbscan(X, **self.get_params())
    323         return self
    324 

/usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in hdbscan(X, min_cluster_size, min_samples, metric, p, algorithm)
    235     else:
    236         return _hdbscan_large_kdtree(X, min_cluster_size, 
--> 237                                      min_samples, metric, p)
    238 
    239 class HDBSCAN(BaseEstimator, ClusterMixin):

/usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in _hdbscan_large_kdtree(X, min_cluster_size, min_samples, metric, p)
    107         p = 2
    108 
--> 109     mutual_reachability_ = kdtree_pdist_mutual_reachability(X, metric, p, min_samples)
    110 
    111     min_spanning_tree = mst_linkage_core_pdist(mutual_reachability_)

/home/vg/.python-eggs/hdbscan-0.1-py2.7-linux-x86_64.egg-tmp/hdbscan/_hdbscan_reachability.so in hdbscan._hdbscan_reachability.kdtree_pdist_mutual_reachability (hdbscan/_hdbscan_reachability.c:2820)()

/home/vg/.python-eggs/hdbscan-0.1-py2.7-linux-x86_64.egg-tmp/hdbscan/_hdbscan_reachability.so in hdbscan._hdbscan_reachability.kdtree_pdist_mutual_reachability (hdbscan/_hdbscan_reachability.c:2432)()

/usr/lib/python2.7/dist-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
   1174 
   1175     m, n = s
-> 1176     dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
   1177 
   1178     wmink_names = ['wminkowski', 'wmi', 'wm', 'wpnorm']

MemoryError:

Any obvious problem with the code? Or is this to be expected?

opened by varadgunjal 21

Added Cosine Distance as Metric

I believe that using the Cosine distance can be important for some applications. Based on my tests, clustering word embeddings or sentence embeddings using Euclidean distance gives bad results. Using the Cosine Distance, improve them. I believe that's because the Norm.

opened by brunoalano 20

[WIP] Sample weighting feature implementation

As discussed, a sample weighting implementation PR. "Tested" with all 6 algorithm options, but lacking any formal tests (on a roadmap...).

Weights are basically used as starting sizes for tree creation in _hdbscan_linkage.pyx, the rest (e.g. core distance calculation) is (hopefully) untouched. As you can see in the commit log, initially weighs were passed down to the core dist calculation, but this does not seem to be a good idea (it was causing all 0 weights points to be virtually "excluded" from clustering, we believe this approach is much more appropriate).

A quick demonstration (please uncomment plots if feeling adventurous):

'''
Example of weighted HDBSCAN.
'''

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.datasets import make_blobs
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler

from hdbscan import HDBSCAN

plot_kwds = {'linewidths':0}
palette = sns.color_palette("hls", 20)


# Generate data
np.random.seed(1)
X, y = make_blobs(n_samples=400, cluster_std=3, shuffle=True, random_state=10)
X, y = shuffle(X, y, random_state=7)
X = StandardScaler().fit_transform(X)

sample_weights = np.floor(np.random.standard_gamma(shape=0.5, size=len(X))) * 4

sizes = ((sample_weights+1)*10).astype(int)
alphas = np.fromiter((0.5 if s < 1 else 0.75 for s in sample_weights), np.float, 400).reshape(-1,1)

algorithm = 'best'
# algorithm = 'generic'
# algorithm = 'prims_kdtree'
# algorithm = 'prims_balltree'
# algorithm = 'boruvka_kdtree'
# algorithm = 'boruvka_balltree'

min_cluster_size, min_samples = 8, 1


# (Unweighted) HDBSCAN
np.random.seed(1)
hdb_unweighted = HDBSCAN(
    min_cluster_size=min_cluster_size, min_samples=min_samples, gen_min_span_tree=True, allow_single_cluster=False,
    algorithm=algorithm, prediction_data=True)
hdb_unweighted.fit(X)

cluster_colours = np.asarray([palette[lab] if lab >= 0 else (0.0, 0.0, 0.0) for lab in hdb_unweighted.labels_])
cluster_colours = np.hstack([cluster_colours, alphas])

fig = plt.figure()
plt.scatter(X.T[0], X.T[1], s=sizes, c=cluster_colours, **plot_kwds)
fig.suptitle('Unweighted HDBSCAN'); plt.show()

## Lots of plots, all working!
fig = plt.figure()
hdb_unweighted.minimum_spanning_tree_.plot(edge_cmap='viridis', edge_alpha=0.5, node_size=40, edge_linewidth=2)
fig.suptitle('Unweighted HDBSCAN minimum spanning tree'); plt.show()
fig = plt.figure()
hdb_unweighted.single_linkage_tree_.plot(cmap='viridis', colorbar=True)
fig.suptitle('Unweighted HDBSCAN single linkage tree'); plt.show()
fig = plt.figure()
hdb_unweighted.condensed_tree_.plot(select_clusters=True, selection_palette=palette, log_size=True)
fig.suptitle('Unweighted HDBSCAN condensed tree plot'); plt.show()


# (Weighted) HDBSCAN
np.random.seed(1)
hdb_weighted = HDBSCAN(
    min_cluster_size=min_cluster_size, min_samples=min_samples, gen_min_span_tree=True, allow_single_cluster=False,
    algorithm=algorithm, prediction_data=True)
hdb_weighted.fit(X, sample_weights=sample_weights)

cluster_colours = np.asarray([palette[lab] if lab >= 0 else (0.0, 0.0, 0.0) for lab in hdb_weighted.labels_])
cluster_colours = np.hstack([cluster_colours, alphas])

fig = plt.figure()
plt.scatter(X.T[0], X.T[1], s=sizes, c=cluster_colours, **plot_kwds)
fig.suptitle('Weighted HDBSCAN'); plt.show()

## Lots of plots, all working (except one or two warnings...)!
fig = plt.figure()
hdb_weighted.minimum_spanning_tree_.plot(edge_cmap='viridis', edge_alpha=0.5, node_size=40, edge_linewidth=2)
fig.suptitle('Weighted HDBSCAN minimum spanning tree'); plt.show()
# single_linkage_tree throws "plots.py:563: RuntimeWarning: divide by zero encountered in log2" but works.
fig = plt.figure()
hdb_weighted.single_linkage_tree_.plot(cmap='viridis', colorbar=True)
fig.suptitle('Weighted HDBSCAN single linkage tree'); plt.show()
fig = plt.figure()
hdb_weighted.condensed_tree_.plot(select_clusters=True, selection_palette=palette, log_size=True)
fig.suptitle('Weighted HDBSCAN condensed tree plot'); plt.show()

## Check weights by cluster:
unweighted_weights = np.asarray([[lab, sum(sample_weights[hdb_unweighted.labels_ == lab])] for lab in set(hdb_unweighted.labels_)])
weighted_weights = np.asarray([[lab, sum(sample_weights[hdb_weighted.labels_ == lab])] for lab in set(hdb_weighted.labels_)])
print(unweighted_weights)
print(weighted_weights)

We are VERY open to discussion.

enhancement new feature

opened by m-dz 18

max_cluster_size parameter

Hi, this is the PR for issue #408 ! I've done some quick testing and it seems to be working correctly - the code changes are quite minor. I'm not 100% sure I'm correctly computing the 'size' of clusters - you can see I'm counting the size of a cluster as the 'child_size' of the edge that it's a child of, but I don't know if that incorrectly counts some points that have already fallen out of that cluster. Would it be more correct to count the size of non-leaf clusters as the sum of the child_size of the two clusters branching off from it?

One other notable thing is that I haven't yet edited the flat.py code - I wasn't sure if a max_cluster_size parameter made sense there, and I don't understand that bit of the codebase that well. All the functions where max_cluster_size is used have a default value of 0 (i.e. no max size) set, so all other code should be unaffected by this change, even if they're still expecting the old function signature.

opened by Rocketknight1 15

Crash when allow_single_cluster used with cluster_selection_epsilon

When I use cluster_selection_epsilon=x where x > 0 and allow_single_cluster=True together, HDBSCAN crashes.

I am using those two options together to try and get the no_structure toy dataset (bottom row, square) clustered properly. I want the square to be completely blue like how DBSCAN does it. When I only use cluster_selection_epsilon, multiple clusters appear in that square. When I only use allow_single_cluster=True, part of that square is grey. I think I can only get the desired result using both of those arguments, but HDBSCAN crashes when I do that.

Code

import numpy as np
import hdbscan

if __name__ == '__main__':
    no_structure = np.random.rand(1500, 2)
    clusterer = hdbscan.HDBSCAN(min_cluster_size=15, cluster_selection_epsilon=3, allow_single_cluster=True)
    clusterer.fit(no_structure)

Expected behavior

HDBSCAN to cluster the data without crashing. Preferably, painting all points in the square blue as described.

Actual behavior

HDBSCAN crashes with the following traceback:

Traceback (most recent call last):
  File "/home/home/PycharmProjects/sandbox/crash_example.py", line 7, in <module>
    clusterer.fit(no_structure)
  File "/home/home/PycharmProjects/sandbox/venv/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 919, in fit
    self._min_spanning_tree) = hdbscan(X, **kwargs)
  File "/home/home/PycharmProjects/sandbox/venv/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 632, in hdbscan
    return _tree_to_labels(X,
  File "/home/home/PycharmProjects/sandbox/venv/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 59, in _tree_to_labels
    labels, probabilities, stabilities = get_clusters(condensed_tree,
  File "hdbscan/_hdbscan_tree.pyx", line 645, in hdbscan._hdbscan_tree.get_clusters
  File "hdbscan/_hdbscan_tree.pyx", line 733, in hdbscan._hdbscan_tree.get_clusters
  File "hdbscan/_hdbscan_tree.pyx", line 631, in hdbscan._hdbscan_tree.epsilon_search
IndexError: index 0 is out of bounds for axis 0 with size 0

opened by danielzgtg 14

Only a single CPU core is used

Not sure if this is a bug or feature, but I have observed that on my Ubuntu 14.04 machine HDBSCAN only ever uses one core, some other cores also spike occasionally but 90% of the time it's just a single core at 100% and all the others 0%.

Is this algorithm not parallelizable? Or has it not been done yet?

opened by ghost 12

BrokenProcessPool: A task has failed to un-serialize.

This is related to this issue in BERTopic, andthis issue in UMAP. Probably related to this issue and this issue, too.

I am currently Using UMAP to reduce the dimension and then using HDBSCAN to cluster the embeddings. However, I am running into the following error. Any idea why?

My data size is 10 million rows and 5 dimensions (reduced with UMAP from 384 dimensions). I have 1TB of RAM and 32 cores.

I am using Jupyter Notebook.

---------------------------------------------------------------------------
_RemoteTraceback                          Traceback (most recent call last)
_RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 404, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
  File "/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
  File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.__setstate__
  File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews
  File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
  File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
ValueError: buffer source array is read-only
"""

The above exception was the direct cause of the following exception:

BrokenProcessPool                         Traceback (most recent call last)
/tmp/ipykernel_778601/1248467627.py in <module>
----> 1 test1=hdbscan_model.fit(embedding_test)

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py in fit(self, X, y)
    917          self._condensed_tree,
    918          self._single_linkage_tree,
--> 919          self._min_spanning_tree) = hdbscan(X, **kwargs)
    920 
    921         if self.prediction_data:

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
    608                                            gen_min_span_tree, **kwargs)
    609             else:
--> 610                 (single_linkage_tree, result_min_span_tree) = memory.cache(
    611                     _hdbscan_boruvka_kdtree)(X, min_samples, alpha,
    612                                              metric, p, leaf_size,

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
    350 
    351     def __call__(self, *args, **kwargs):
--> 352         return self.func(*args, **kwargs)
    353 
    354     def call_and_shelve(self, *args, **kwargs):

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py in _hdbscan_boruvka_kdtree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
    273 
    274     tree = KDTree(X, metric=metric, leaf_size=leaf_size, **kwargs)
--> 275     alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric,
    276                                  leaf_size=leaf_size // 3,
    277                                  approx_min_span_tree=approx_min_span_tree,

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__()

hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py in __call__(self, iterable)
   1052 
   1053             with self._backend.retrieval_context():
-> 1054                 self.retrieve()
   1055             # Make sure that we get a last message telling us we are done
   1056             elapsed_time = time.time() - self._start_time

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py in retrieve(self)
    931             try:
    932                 if getattr(self._backend, 'supports_timeout', False):
--> 933                     self._output.extend(job.get(timeout=self.timeout))
    934                 else:
    935                     self._output.extend(job.get())

/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
    540         AsyncResults.get from multiprocessing."""
    541         try:
--> 542             return future.result(timeout=timeout)
    543         except CfTimeoutError as e:
    544             raise TimeoutError from e

/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py in result(self, timeout)
    443                     raise CancelledError()
    444                 elif self._state == FINISHED:
--> 445                     return self.__get_result()
    446                 else:
    447                     raise TimeoutError()

/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py in __get_result(self)
    388         if self._exception:
    389             try:
--> 390                 raise self._exception
    391             finally:
    392                 # Break a reference cycle with the exception in self._exception

BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

opened by ginward 11

some cluster_persistence_ outputs are greater than 1?

Hi,

Quick question though, according to HDBSCAN documentation, if my interpretation is correct cluster_persistence_ outputs should be [0, 1].

However, I got some weird results like 2.81 as shown below.

import numpy as np np.round(clusters.cluster_persistence_, 2) Out[34]: array([ 2.81, 0. , 0.44, 0. , 0. , 0. , 0.02, 0. , 0. , 0. , 0. , 0.11, 0.01, 0. , 0.01, 0. , 0. , 0. , 0.01, 0. , 0.04, 0.07, 0. , 0.03, 0. , 0. , 0. , 0.01, 0. , 0. , 0.01, 0. , 0. , 0. , 0.01, 0.01, 0.03, 0. , 0.03, 0. , 0.01, 0.01, 0.01, 0.01, 0. , 0.02, 0. , 0.18, 0. , 0. , 0. , 1.06, 0.37, 0.61, 0.77, 0. , 0.09, 0. , 0.23, 0.74, 0. , 0.24, 1.1 , 0. , 0. , 0. , 0. , 0. , 0. , 0.01, 0.01, 0.01, 0.01, 0. , 0. , 0. , 0. , 0.06, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.03, 0.01, 0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.03, 0.03, 0.01, 0.04, 0. , 0. , 0. , 0.01, 0. , 0.01, 0. , 0. , 0. , 0. , 0.01, 0. , 0.01, 0. , 0. , 0.02, 0. , 0. ])

Am I doing things wrong?

Cheers,

Titan

opened by titaniumrain 11
Counter-intuitive noise points
I have been playing with hdbscan to try and build an intuition for what it is doing. Currently I am running into counter-intuitive behavior when running it on synthetic data. In particular I have been running hdbscan on data sampled evenly from a circle. My understanding of the algorithm suggests it should return a single cluster similar to what dbscan would do with the proper epsilon setting. However, hdbscan is instead identifying a single cluster and a collection of noise points. If I reduce the minimum number of points needed for a cluster below 4 the noise points vanish. Due to the symmetry of the data I'm not seeing why this parameter should make much of a difference on how the clustering works. I'm curious if my intuition is way off or if there is an issue with how I am invoking hdbscan.

Code:

from hdbscan import HDBSCAN import matplotlib.pyplot as plt import numpy as np min_cluster_size = 4 # Minimum size to cause an issue samples = 100 theta = np.linspace(-np.pi, np.pi, samples, endpoint=False) data = np.zeros((samples, 2)) data[:,0] = np.cos(theta) data[:,1] = np.sin(theta) clusterer = HDBSCAN(min_cluster_size=min_cluster_size, allow_single_cluster=True) clusterer.fit(data) labels = set(clusterer.labels_) for label in labels: cluster = data[clusterer.labels_ == label] plt.scatter(cluster[:,0], cluster[:,1]) plt.axis("equal") plt.show()

Expected output:

Actual output: (note the orange noise points in the lower right)

System Information: python version: 3.10.8 hdbscan version: 0.8.29
opened by erooke 0

Validity index calculation results in ValueError while calculating min/max

For a clustering usecase, I tried different parameters and while calculating validity index, I run into the following ValueError:

<dir>/envs/venv/lib/python3.9/site-packages/hdbscan/validity.py:33: RuntimeWarning: invalid value encountered in divide
  result /= distance_matrix.shape[0] - 1
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In [41], line 1
----> 1 validity_index(np.random.rand(5, 2), np.array([-1, 1, 1, 1, 0]))

File <dir>/envs/venv/lib/python3.9/site-packages/hdbscan/validity.py:372, in validity_index(X, labels, metric, d, per_cluster_scores, mst_raw_dist, verbose, **kwd_args)
    358     distances_for_mst, core_distances[
    359         cluster_id] = distances_between_points(
    360         X,
   (...)
    367         **kwd_args
    368     )
    370     mst_nodes[cluster_id], mst_edges[cluster_id] = \
    371         internal_minimum_spanning_tree(distances_for_mst)
--> 372     density_sparseness[cluster_id] = mst_edges[cluster_id].T[2].max()
    374 for i in range(max_cluster_id):
    376     if np.sum(labels == i) == 0:

File <dir>/envs/venv/lib/python3.9/site-packages/numpy/core/_methods.py:40, in _amax(a, axis, out, keepdims, initial, where)
     38 def _amax(a, axis=None, out=None, keepdims=False,
     39           initial=_NoValue, where=True):
---> 40     return umr_maximum(a, axis, None, out, keepdims, initial, where)

ValueError: zero-size array to reduction operation maximum which has no identity

Following is a toy example, I could reproduce with:

from hdbscan import validity_index
validity_index(np.random.rand(5, 2), np.array([-1, 1, 1, 1, 0]))

opened by tacitvenom 2

hdbscan installation issue in Microsoft Azure App Service

pip install hdbscan

fails inside the Microsoft Azure App Service - Linux environment with the below error:

ERROR: Could not build wheels for hdbscan, which is required to install pyproject.toml-based projects

However, conda install -c conda-forge hdbscan

is not possible inside Azure App service Linux environment, as we don't use Conda virtual environment.

Can you please provide me solution to have a successful HDBSCAN installation?

opened by bala1802 0
test error

............................EEE.EE..........EE..........E

ERROR: hdbscan.tests.test_hdbscan.test_condensed_tree_plot

Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 376, in test_condensed_tree_plot if_matplotlib(clusterer.condensed_tree_.plot)( File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 82, in run_test pytest.skip("Matplotlib not available.") File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/_pytest/outcomes.py", line 175, in skip raise Skipped(msg=reason, allow_module_level=allow_module_level) Skipped: Matplotlib not available.

====================================================================== ERROR: hdbscan.tests.test_hdbscan.test_single_linkage_tree_plot

Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 389, in test_single_linkage_tree_plot if_matplotlib(clusterer.single_linkage_tree_.plot)(cmap="Reds") File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 82, in run_test pytest.skip("Matplotlib not available.") File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/_pytest/outcomes.py", line 175, in skip raise Skipped(msg=reason, allow_module_level=allow_module_level) Skipped: Matplotlib not available.

====================================================================== ERROR: hdbscan.tests.test_hdbscan.test_min_span_tree_plot

Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 397, in test_min_span_tree_plot if_matplotlib(clusterer.minimum_spanning_tree_.plot)(edge_cmap="Reds") File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 82, in run_test pytest.skip("Matplotlib not available.") File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/_pytest/outcomes.py", line 175, in skip raise Skipped(msg=reason, allow_module_level=allow_module_level) Skipped: Matplotlib not available.

====================================================================== ERROR: hdbscan.tests.test_hdbscan.test_tree_pandas_output_formats

Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 428, in test_tree_pandas_output_formats if_pandas(clusterer.condensed_tree_.to_pandas)() File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 97, in run_test pytest.skip("Pandas not available.") File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/_pytest/outcomes.py", line 175, in skip raise Skipped(msg=reason, allow_module_level=allow_module_level) Skipped: Pandas not available.

====================================================================== ERROR: hdbscan.tests.test_hdbscan.test_tree_networkx_output_formats

Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 436, in test_tree_networkx_output_formats if_networkx(clusterer.condensed_tree_.to_networkx)() File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 112, in run_test pytest.skip("NetworkX not available.") File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/_pytest/outcomes.py", line 175, in skip raise Skipped(msg=reason, allow_module_level=allow_module_level) Skipped: NetworkX not available.

====================================================================== ERROR: hdbscan.tests.test_hdbscan.test_hdbscan_is_sklearn_estimator

Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 646, in test_hdbscan_is_sklearn_estimator check_estimator(HDBSCAN) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/sklearn/utils/estimator_checks.py", line 610, in check_estimator raise TypeError(msg) TypeError: Passing a class was deprecated in version 0.23 and isn't supported anymore from 0.24.Please pass an instance instead.

====================================================================== ERROR: hdbscan.tests.test_prediction_utils.test_safe_always_positive_division

Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) TypeError: test_safe_always_positive_division() missing 1 required positional argument: 'denominator'

====================================================================== ERROR: hdbscan.tests.test_rsl.test_rsl_is_sklearn_estimator

Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_rsl.py", line 197, in test_rsl_is_sklearn_estimator check_estimator(RobustSingleLinkage) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/sklearn/utils/estimator_checks.py", line 610, in check_estimator raise TypeError(msg) TypeError: Passing a class was deprecated in version 0.23 and isn't supported anymore from 0.24.Please pass an instance instead.

Ran 57 tests in 1.593s

FAILED (errors=8)

opened by slowRiver 0
Prediction Data Generation Fails w/ a Warning
Hello,

I have the following distance matrix (dist_matrix.npy.zip) that I calculated using some function.

from scipy.spatial.distance import pdist from scipy.spatial.distance import squareform dist_condensed = pdist(X, metric = lambda u, v: calc_distance(u[0], v[0])) dist_matrix = squareform(dist_condensed)

Then, I fitted a model using the attached distance matrix as follows:

import hdbscan clusterer = hdbscan.HDBSCAN(min_cluster_size = 2, min_samples = 2, metric = 'precomputed', prediction_data = True) clusterer.fit(dist_matrix)

After running the above command, I got the following warning (which looks like to be important if you want to predict some data in a later time):

hdbscan/hdbscan_.py:1256: UserWarning: Cannot generate prediction data for non-vectorspace inputs -- access to the source data rather than mere distances is required!

Also, I'd like to predict the cluster of some new data points at a later time by passing a distance matrix. Does the approximate_predict method accept a distance matrix (because I used a custom distance matrix originally)? I believe that's not the case, at least based on the documentation.

I even tried to see whether the provided example (see here) works but I got the same warning (see below).

I appreciate it if you help me understand why I get that warning originally and how I can use the above method to predict the cluster of new data points in the future.
opened by OMirzaei 0

Releases(0.8.29-1)

0.8.29-1(Oct 31, 2022)

Source code(tar.gz)
Source code(zip)
0.8.29(Oct 31, 2022)
What's Changed

Fix malformed list in docs by @dmlls in https://github.com/scikit-learn-contrib/hdbscan/pull/535

Incorrect initialization in BallTreeBoruvka dual_tree_traversal by @GregDemand in https://github.com/scikit-learn-contrib/hdbscan/pull/508

hdbscan: add support to other types of input integers by @jcfaracco in https://github.com/scikit-learn-contrib/hdbscan/pull/540

Add parameter to DBCV which toggles the usage of mutual reachability distances by @luis261 in https://github.com/scikit-learn-contrib/hdbscan/pull/552

Fixed: validity index no longer returning nan by @tadorfer in https://github.com/scikit-learn-contrib/hdbscan/pull/558

Use a positional argument for the cachedir/location by @lmcinnes in https://github.com/scikit-learn-contrib/hdbscan/pull/563

New Contributors

@dmlls made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/535

@jcfaracco made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/540

@luis261 made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/552

@tadorfer made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/558

@lmcinnes made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/563

Full Changelog: https://github.com/scikit-learn-contrib/hdbscan/compare/0.8.28...0.8.29
Source code(tar.gz)
Source code(zip)
0.8.28.wheels(Feb 8, 2022)

Source code(tar.gz)
Source code(zip)
0.8.28(Feb 8, 2022)

Resync release tags with PyPI
Source code(tar.gz)
Source code(zip)
0.8.27(Feb 8, 2022)
What's Changed

Added axis control on colourbars by @tomcharnock in https://github.com/scikit-learn-contrib/hdbscan/pull/365

Fixing typos by @cassmarcussen in https://github.com/scikit-learn-contrib/hdbscan/pull/366

Predict score by @Rhaedonius in https://github.com/scikit-learn-contrib/hdbscan/pull/368

fix issue 370 by @neontty in https://github.com/scikit-learn-contrib/hdbscan/pull/371

fix the core distance computation can only use 4 cores. by @GorvinChen in https://github.com/scikit-learn-contrib/hdbscan/pull/377

Issue 370 by @neontty in https://github.com/scikit-learn-contrib/hdbscan/pull/385

Fixed line widths for single linkage dendrograms with duplicate lambda_val by @GregDemand in https://github.com/scikit-learn-contrib/hdbscan/pull/391

Module for flat clustering by @sabarish-akridata in https://github.com/scikit-learn-contrib/hdbscan/pull/398

Improved union-find data structure implementation by @kylehofmann in https://github.com/scikit-learn-contrib/hdbscan/pull/412

fixes #370 -- epsilon search would crash when root node in leaf list by @neontty in https://github.com/scikit-learn-contrib/hdbscan/pull/400

let root children size==1 to join root node in allow_single based on 1/eps by @neontty in https://github.com/scikit-learn-contrib/hdbscan/pull/418

fix issue-436 unused deprecated joblib option by @neontty in https://github.com/scikit-learn-contrib/hdbscan/pull/438

Fix build errors by @gansanay in https://github.com/scikit-learn-contrib/hdbscan/pull/449

update build by @gansanay in https://github.com/scikit-learn-contrib/hdbscan/pull/450

Fix mismatching clusters probabilities in softclustering by @narjes23 in https://github.com/scikit-learn-contrib/hdbscan/pull/454

Fixing typos in "Don't be wrong!" section by @hovikgas in https://github.com/scikit-learn-contrib/hdbscan/pull/455

Update pyproject.toml to use oldest-supported-numpy by @stradigi-mario-bruno in https://github.com/scikit-learn-contrib/hdbscan/pull/458

Fix numpy 1.20 deprecation warnings by @rmcgibbo in https://github.com/scikit-learn-contrib/hdbscan/pull/468

max_cluster_size parameter by @Rocketknight1 in https://github.com/scikit-learn-contrib/hdbscan/pull/410

Fix typos by @julienschuermans in https://github.com/scikit-learn-contrib/hdbscan/pull/489

handle non-finite valued vectors by @jc-healy in https://github.com/scikit-learn-contrib/hdbscan/pull/497

Fixed the bug that joblib uses memory mapping when data size is too large by @ginward in https://github.com/scikit-learn-contrib/hdbscan/pull/495

Fix ZeroDivisionError: float division error when computing per_cluster_scores by @Dicksonchin93 in https://github.com/scikit-learn-contrib/hdbscan/pull/502

Update requirements.txt by @Ben-Epstein in https://github.com/scikit-learn-contrib/hdbscan/pull/503

Fixed off by one errors in min_samples for multiple algorithms by @GregDemand in https://github.com/scikit-learn-contrib/hdbscan/pull/394

Modified Prims to generate MST by @pberba in https://github.com/scikit-learn-contrib/hdbscan/pull/315

Remove usage of six by @rahulporuri in https://github.com/scikit-learn-contrib/hdbscan/pull/516

dbscan cluster extraction by @jc-healy in https://github.com/scikit-learn-contrib/hdbscan/pull/519

New Contributors

@tomcharnock made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/365

@cassmarcussen made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/366

@Rhaedonius made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/368

@neontty made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/371

@GorvinChen made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/377

@GregDemand made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/391

@sabarish-akridata made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/398

@kylehofmann made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/412

@gansanay made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/449

@narjes23 made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/454

@hovikgas made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/455

@stradigi-mario-bruno made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/458

@rmcgibbo made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/468

@Rocketknight1 made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/410

@julienschuermans made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/489

@jc-healy made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/497

@ginward made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/495

@Dicksonchin93 made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/502

@Ben-Epstein made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/503

@pberba made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/315

@rahulporuri made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/516

Full Changelog: https://github.com/scikit-learn-contrib/hdbscan/compare/0.8.26...0.8.27
Source code(tar.gz)
Source code(zip)
0.8.26(Mar 19, 2020)

Source code(tar.gz)
Source code(zip)
0.8.24(Dec 8, 2019)

Source code(tar.gz)
Source code(zip)
0.8.23(Oct 8, 2019)

Source code(tar.gz)
Source code(zip)
0.8.20(Apr 16, 2019)

Source code(tar.gz)
Source code(zip)
0.8.19(Jan 24, 2019)

Source code(tar.gz)
Source code(zip)
0.8.18(Sep 11, 2018)

Source code(tar.gz)
Source code(zip)
0.8.17(Sep 8, 2018)

Source code(tar.gz)
Source code(zip)
0.8.16(Sep 7, 2018)

Source code(tar.gz)
Source code(zip)
0.8.15(Jul 21, 2018)

Source code(tar.gz)
Source code(zip)
0.8.14(Jul 18, 2018)

Source code(tar.gz)
Source code(zip)
0.8.13(Apr 27, 2018)

Source code(tar.gz)
Source code(zip)
0.8.12(Jan 20, 2018)

A variety of minor bug fixes.
Source code(tar.gz)
Source code(zip)
0.8.11(Nov 9, 2017)

Source code(tar.gz)
Source code(zip)
0.8.10(Mar 21, 2017)

Source code(tar.gz)
Source code(zip)
0.8.9(Mar 21, 2017)

Source code(tar.gz)
Source code(zip)
0.8.8.1(Mar 5, 2017)

Source code(tar.gz)
Source code(zip)
0.8.8(Mar 5, 2017)

Source code(tar.gz)
Source code(zip)
0.8.6(Feb 2, 2017)

Source code(tar.gz)
Source code(zip)
0.8.5(Jan 28, 2017)

Source code(tar.gz)
Source code(zip)
0.8.4(Jan 5, 2017)

We introduce the density based clustering validity index from Density Based Clustering Validation by Moulavi, Jaskowiak, Campello, Zimek and Sander, as well as some minor bug fixes, and the option to match the reference implementation of HDBSCAN*
Source code(tar.gz)
Source code(zip)
0.8.3(Nov 25, 2016)

Source code(tar.gz)
Source code(zip)
0.8.2(Sep 15, 2016)

Source code(tar.gz)
Source code(zip)
0.8.1(Aug 18, 2016)

Source code(tar.gz)
Source code(zip)
0.8(May 31, 2016)

Support for cluster strengths and general bug fixes.
Source code(tar.gz)
Source code(zip)
0.7.3(Apr 29, 2016)

A visual refresh of the plotting routines, along with a series of bug fixes for various corner cases.
Source code(tar.gz)
Source code(zip)

A high performance implementation of HDBSCAN clustering.

Related tags

Overview

HDBSCAN

How to use HDBSCAN

Performance

Additional functionality

Outlier Detection

Robust single linkage

Installing

Running the Tests

Python Version

Help and Support

Contributing

Citing

Licensing

Comments

Code

Expected behavior

Actual behavior

............................EEE.EE..........EE..........E

ERROR: hdbscan.tests.test_hdbscan.test_condensed_tree_plot

====================================================================== ERROR: hdbscan.tests.test_hdbscan.test_single_linkage_tree_plot

====================================================================== ERROR: hdbscan.tests.test_hdbscan.test_min_span_tree_plot

====================================================================== ERROR: hdbscan.tests.test_hdbscan.test_tree_pandas_output_formats

====================================================================== ERROR: hdbscan.tests.test_hdbscan.test_tree_networkx_output_formats

====================================================================== ERROR: hdbscan.tests.test_hdbscan.test_hdbscan_is_sklearn_estimator

====================================================================== ERROR: hdbscan.tests.test_prediction_utils.test_safe_always_positive_division

====================================================================== ERROR: hdbscan.tests.test_rsl.test_rsl_is_sklearn_estimator

Releases(0.8.29-1)

0.8.29-1(Oct 31, 2022)

0.8.29(Oct 31, 2022)

What's Changed

New Contributors

0.8.28.wheels(Feb 8, 2022)

0.8.28(Feb 8, 2022)

0.8.27(Feb 8, 2022)

What's Changed

New Contributors

0.8.26(Mar 19, 2020)

0.8.24(Dec 8, 2019)

0.8.23(Oct 8, 2019)

0.8.20(Apr 16, 2019)

0.8.19(Jan 24, 2019)

0.8.18(Sep 11, 2018)

0.8.17(Sep 8, 2018)

0.8.16(Sep 7, 2018)

0.8.15(Jul 21, 2018)

0.8.14(Jul 18, 2018)

0.8.13(Apr 27, 2018)

0.8.12(Jan 20, 2018)

0.8.11(Nov 9, 2017)

0.8.10(Mar 21, 2017)

0.8.9(Mar 21, 2017)

0.8.8.1(Mar 5, 2017)

0.8.8(Mar 5, 2017)

0.8.6(Feb 2, 2017)

0.8.5(Jan 28, 2017)

0.8.4(Jan 5, 2017)

0.8.3(Nov 25, 2016)

0.8.2(Sep 15, 2016)

0.8.1(Aug 18, 2016)

0.8(May 31, 2016)

0.7.3(Apr 29, 2016)

Owner

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

ML-Ensemble – high performance ensemble learning

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

《LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classiﬁcation》(AAAI 2021) GitHub:

Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

MetaBalance: High-Performance Neural Networks for Class-Imbalanced Data

PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

HyperPose is a library for building high-performance custom pose estimation applications.

A high-performance anchor-free YOLO. Exceeding yolov3~v5 with ONNX, TensorRT, NCNN, and Openvino supported.

YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with ONNX, TensorRT, ncnn, and OpenVINO supported.