A high performance implementation of HDBSCAN clustering.

Overview
PyPI Version Conda-forge Version Conda-forge downloads License Travis Build Status Docs JOSS article Launch example notebooks in Binder

HDBSCAN

HDBSCAN - Hierarchical Density-Based Spatial Clustering of Applications with Noise. Performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon. This allows HDBSCAN to find clusters of varying densities (unlike DBSCAN), and be more robust to parameter selection.

In practice this means that HDBSCAN returns a good clustering straight away with little or no parameter tuning -- and the primary parameter, minimum cluster size, is intuitive and easy to select.

HDBSCAN is ideal for exploratory data analysis; it's a fast and robust algorithm that you can trust to return meaningful clusters (if there are any).

Based on the papers:

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017 [pdf]

R. Campello, D. Moulavi, and J. Sander, Density-Based Clustering Based on Hierarchical Density Estimates In: Advances in Knowledge Discovery and Data Mining, Springer, pp 160-172. 2013

Documentation, including tutorials, are available on ReadTheDocs at http://hdbscan.readthedocs.io/en/latest/ .

Notebooks comparing HDBSCAN to other clustering algorithms, explaining how HDBSCAN works and comparing performance with other python clustering implementations are available.

How to use HDBSCAN

The hdbscan package inherits from sklearn classes, and thus drops in neatly next to other sklearn clusterers with an identical calling API. Similarly it supports input in a variety of formats: an array (or pandas dataframe, or sparse matrix) of shape (num_samples x num_features); an array (or sparse matrix) giving a distance matrix between samples.

import hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = hdbscan.HDBSCAN(min_cluster_size=10)
cluster_labels = clusterer.fit_predict(data)

Performance

Significant effort has been put into making the hdbscan implementation as fast as possible. It is orders of magnitude faster than the reference implementation in Java, and is currently faster than highly optimized single linkage implementations in C and C++. version 0.7 performance can be seen in this notebook . In particular performance on low dimensional data is better than sklearn's DBSCAN , and via support for caching with joblib, re-clustering with different parameters can be almost free.

Additional functionality

The hdbscan package comes equipped with visualization tools to help you understand your clustering results. After fitting data the clusterer object has attributes for:

  • The condensed cluster hierarchy
  • The robust single linkage cluster hierarchy
  • The reachability distance minimal spanning tree

All of which come equipped with methods for plotting and converting to Pandas or NetworkX for further analysis. See the notebook on how HDBSCAN works for examples and further details.

The clusterer objects also have an attribute providing cluster membership strengths, resulting in optional soft clustering (and no further compute expense). Finally each cluster also receives a persistence score giving the stability of the cluster over the range of distance scales present in the data. This provides a measure of the relative strength of clusters.

Outlier Detection

The HDBSCAN clusterer objects also support the GLOSH outlier detection algorithm. After fitting the clusterer to data the outlier scores can be accessed via the outlier_scores_ attribute. The result is a vector of score values, one for each data point that was fit. Higher scores represent more outlier like objects. Selecting outliers via upper quantiles is often a good approach.

Based on the paper:
R.J.G.B. Campello, D. Moulavi, A. Zimek and J. Sander Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection, ACM Trans. on Knowledge Discovery from Data, Vol 10, 1 (July 2015), 1-51.

Robust single linkage

The hdbscan package also provides support for the robust single linkage clustering algorithm of Chaudhuri and Dasgupta. As with the HDBSCAN implementation this is a high performance version of the algorithm outperforming scipy's standard single linkage implementation. The robust single linkage hierarchy is available as an attribute of the robust single linkage clusterer, again with the ability to plot or export the hierarchy, and to extract flat clusterings at a given cut level and gamma value.

Example usage:

import hdbscan
from sklearn.datasets import make_blobs

data, _ = make_blobs(1000)

clusterer = hdbscan.RobustSingleLinkage(cut=0.125, k=7)
cluster_labels = clusterer.fit_predict(data)
hierarchy = clusterer.cluster_hierarchy_
alt_labels = hierarchy.get_clusters(0.100, 5)
hierarchy.plot()
Based on the paper:
K. Chaudhuri and S. Dasgupta. "Rates of convergence for the cluster tree." In Advances in Neural Information Processing Systems, 2010.

Installing

Easiest install, if you have Anaconda (thanks to conda-forge which is awesome!):

conda install -c conda-forge hdbscan

PyPI install, presuming you have an up to date pip:

pip install hdbscan

Binary wheels for a number of platforms are available thanks to the work of Ryan Helinski <[email protected]>.

If pip is having difficulties pulling the dependencies then we'd suggest to first upgrade pip to at least version 10 and try again:

pip install --upgrade pip
pip install hdbscan

Otherwise install the dependencies manually using anaconda followed by pulling hdbscan from pip:

conda install cython
conda install numpy scipy
conda install scikit-learn
pip install hdbscan

For a manual install of the latest code directly from GitHub:

pip install --upgrade git+https://github.com/scikit-learn-contrib/hdbscan.git#egg=hdbscan

Alternatively download the package, install requirements, and manually run the installer:

wget https://github.com/scikit-learn-contrib/hdbscan/archive/master.zip
unzip master.zip
rm master.zip
cd hdbscan-master

pip install -r requirements.txt

python setup.py install

Running the Tests

The package tests can be run after installation using the command:

nosetests -s hdbscan

or, if nose is installed but nosetests is not in your PATH variable:

python -m nose -s hdbscan

If one or more of the tests fail, please report a bug at https://github.com/scikit-learn-contrib/hdbscan/issues/new

Python Version

The hdbscan library supports both Python 2 and Python 3. However we recommend Python 3 as the better option if it is available to you.

Help and Support

For simple issues you can consult the FAQ in the documentation. If your issue is not suitably resolved there, please check the issues on github. Finally, if no solution is available there feel free to open an issue ; the authors will attempt to respond in a reasonably timely fashion.

Contributing

We welcome contributions in any form! Assistance with documentation, particularly expanding tutorials, is always welcome. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

Citing

If you have used this codebase in a scientific publication and wish to cite it, please use the Journal of Open Source Software article.

L. McInnes, J. Healy, S. Astels, hdbscan: Hierarchical density based clustering In: Journal of Open Source Software, The Open Journal, volume 2, number 11. 2017
@article{mcinnes2017hdbscan,
  title={hdbscan: Hierarchical density based clustering},
  author={McInnes, Leland and Healy, John and Astels, Steve},
  journal={The Journal of Open Source Software},
  volume={2},
  number={11},
  pages={205},
  year={2017}
}

To reference the high performance algorithm developed in this library please cite our paper in ICDMW 2017 proceedings.

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017
@inproceedings{mcinnes2017accelerated,
  title={Accelerated Hierarchical Density Based Clustering},
  author={McInnes, Leland and Healy, John},
  booktitle={Data Mining Workshops (ICDMW), 2017 IEEE International Conference on},
  pages={33--42},
  year={2017},
  organization={IEEE}
}

Licensing

The hdbscan package is 3-clause BSD licensed. Enjoy.

Comments
  • Import hdbscan ISSUE

    Import hdbscan ISSUE

    Hi, I have followed all the steps for installing hdbscan, but still I'm getting the error:

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-29-3f5a460d7435> in <module>
    ----> 1 import hdbscan
    
    /opt/conda/lib/python3.6/site-packages/hdbscan/__init__.py in <module>
    ----> 1 from .hdbscan_ import HDBSCAN, hdbscan
          2 from .robust_single_linkage_ import RobustSingleLinkage, robust_single_linkage
          3 from .validity import validity_index
          4 from .prediction import approximate_predict, membership_vector, all_points_membership_vectors
          5 
    
    /opt/conda/lib/python3.6/site-packages/hdbscan/hdbscan_.py in <module>
         19 from scipy.sparse import csgraph
         20 
    ---> 21 from ._hdbscan_linkage import (single_linkage,
         22                                mst_linkage_core,
         23                                mst_linkage_core_vector,
    
    hdbscan/_hdbscan_linkage.pyx in init hdbscan._hdbscan_linkage()
    
    AttributeError: type object 'hdbscan._hdbscan_linkage.UnionFind' has no attribute '__reduce_cython__'
    

    Does someone knows how to fix it ? Bests,

    opened by greg2626 30
  • ERROR: Could not build wheels for hdbscan which use PEP 517 and cannot be installed directly

    ERROR: Could not build wheels for hdbscan which use PEP 517 and cannot be installed directly

    python: Python 2.7.15rc1

    OS: Linux 6a039c3530c7 4.15.0-48-generic #51-Ubuntu SMP Wed Apr 3 08:28:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

    pip: pip 19.1

    tried running: pip install hdbscan

      Building wheel for hdbscan (PEP 517): finished with status 'error'                                                   │
      ERROR: Complete output from command /opt/conda/bin/python /opt/conda/lib/python3.6/site-packages/pip/_vendor/pep517/_│
    in_process.py build_wheel /tmp/tmpvwnr9hhz:                                                                            │
      ERROR: running bdist_wheel                                                                                           │
      running build                                                                                                        │
      running build_py                                                                                                     │
      creating build                                                                                                       │
      creating build/lib.linux-x86_64-3.6                                                                                  │
      creating build/lib.linux-x86_64-3.6/hdbscan                                                                          │
      copying hdbscan/prediction.py -> build/lib.linux-x86_64-3.6/hdbscan                                                  │
      copying hdbscan/robust_single_linkage_.py -> build/lib.linux-x86_64-3.6/hdbscan                                      │
      copying hdbscan/__init__.py -> build/lib.linux-x86_64-3.6/hdbscan                                                    │
      copying hdbscan/validity.py -> build/lib.linux-x86_64-3.6/hdbscan                                                    │
      copying hdbscan/plots.py -> build/lib.linux-x86_64-3.6/hdbscan                                                       │
      copying hdbscan/hdbscan_.py -> build/lib.linux-x86_64-3.6/hdbscan                                                    │
      creating build/lib.linux-x86_64-3.6/hdbscan/tests                                                                    │
      copying hdbscan/tests/test_rsl.py -> build/lib.linux-x86_64-3.6/hdbscan/tests                                        │
      copying hdbscan/tests/__init__.py -> build/lib.linux-x86_64-3.6/hdbscan/tests                                        │
      copying hdbscan/tests/test_hdbscan.py -> build/lib.linux-x86_64-3.6/hdbscan/tests                                    │
      running build_ext                                                                                                    │
      cythoning hdbscan/_hdbscan_tree.pyx to hdbscan/_hdbscan_tree.c                                                       │
      building 'hdbscan._hdbscan_tree' extension                                                                           │
      creating build/temp.linux-x86_64-3.6                                                                                 │
      creating build/temp.linux-x86_64-3.6/hdbscan                                                                         │
      gcc -pthread -B /opt/conda/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prot│
    otypes -fPIC -I/opt/conda/include/python3.6m -I/tmp/pip-build-env-s09rgxp0/overlay/lib/python3.6/site-packages/numpy/co│
    re/include -c hdbscan/_hdbscan_tree.c -o build/temp.linux-x86_64-3.6/hdbscan/_hdbscan_tree.o                           │
      /tmp/pip-build-env-s09rgxp0/overlay/lib/python3.6/site-packages/Cython/Compiler/Main.py:367: FutureWarning: Cython di│
    rective 'language_level' not set, using 2 for now (Py2). This will change in a later release! File: /tmp/pip-install-rf│
    yrnh0q/hdbscan/hdbscan/_hdbscan_tree.pyx                                                                               │
        tree = Parsing.p_module(s, pxd, full_module_name)                                                                  │
      error: command 'gcc' failed with exit status 1                                                                       │
      ----------------------------------------                                                                             │
      ERROR: Failed building wheel for hdbscan                                                                             │
      Running setup.py clean for hdbscan                                                                                   │
    Failed to build hdbscan                                                                                                │
    ERROR: Could not build wheels for hdbscan which use PEP 517 and cannot be installed directly                           │
    The command '/bin/sh -c pip install hdbscan' returned a non-zero code: 1                                               │```
    opened by boompig 27
  • "ValueError: zero-size array to reduction operation minimum which has no identity" with no leafs

    Hi, I think I might by accident tracked an error related to #115 and #144, please see below:

    Using the current master branch on Win 10 64 bits and Python 2.7.14

    import numpy as np
    import matplotlib.pyplot as plt
    
    from hdbscan import HDBSCAN
    
    
    # Generate data
    test_data = np.array([
        [0.0, 0.0],
        [1.0, 1.0],
        [0.8, 1.0],
        [1.0, 0.8],
        [0.8, 0.8]])
    
    # HDBSCAN
    np.random.seed(1)
    hdb_unweighted = HDBSCAN(min_cluster_size=3, gen_min_span_tree=True, allow_single_cluster=True)
    hdb_unweighted.fit(test_data)
    
    fig = plt.figure()
    cd = hdb_unweighted.condensed_tree_
    cd.plot()
    fig.suptitle('Unweighted HDBSCAN condensed tree plot'); plt.show()
    

    Whole traceback ("anonymised"):

    Traceback (most recent call last):
      File "...\JetBrains\PyCharm 2017.2.4\helpers\pydev\pydev_run_in_console.py", line 37, in run_file
        pydev_imports.execfile(file, globals, locals)  # execute the script
      File ".../.PyCharm2017.3/config/scratches/scratch_2.py", line 22, in <module>
        cd.plot()
      File "build\bdist.win-amd64\egg\hdbscan\plots.py", line 321, in plot
        max_rectangle_per_icicle=max_rectangles_per_icicle)
      File "build\bdist.win-amd64\egg\hdbscan\plots.py", line 104, in get_plot_data
        leaves = _get_leaves(self._raw_tree)
      File "build\bdist.win-amd64\egg\hdbscan\plots.py", line 44, in _get_leaves
        root = cluster_tree['parent'].min()
      File "C:\ProgramData\Anaconda3\envs\venv_temp_hdbscan_dev_py27\lib\site-packages\numpy\core\_methods.py", line 29, in _amin
        return umr_minimum(a, axis, None, out, keepdims)
    ValueError: zero-size array to reduction operation minimum which has no identity
    

    I have tracked down the problem to lines 42-45 of plots.py:

    def _get_leaves(condensed_tree):
        cluster_tree = condensed_tree[condensed_tree['child_size'] > 1]
        root = cluster_tree['parent'].min()
        return _recurse_leaf_dfs(cluster_tree, root)
    

    cluster_tree created here is empty, so line 44 throws an error.

    I am not sure if there is any solution to this except maybe plotting the single_linkage_tree_?

    opened by m-dz 22
  • Usage with images

    Usage with images

    Tried using the implementation as is for working with images, as was being done in sklearn using KMeans / MiniBatchKMeans / Meanshift clustering. But consistently run into MemoryError (for images as small as 200x200 as well). Here is a sample code -

    import cv2
    import numpy as np
    import hdbscan
    
    image = cv2.imread('/home/ubuntu/x.jpg')
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image = image.reshape((image.shape[0] * image.shape[1], 3))
    
    clusterer = hdbscan.HDBSCAN(min_cluster_size=100)
    cluster_labels = clusterer.fit_predict(image)
    

    Error stack trace :

    MemoryError                               Traceback (most recent call last)
    <ipython-input-12-b887e72bd6d5> in <module>()
    ----> 1 cluster_labels = clusterer.fit_predict(image)
    
    /usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in fit_predict(self, X, y)
        338             cluster labels
        339         """
    --> 340         self.fit(X)
        341         return self.labels_
        342 
    
    /usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in fit(self, X, y)
        320          self._condensed_tree,
        321          self._single_linkage_tree,
    --> 322          self._min_spanning_tree) = hdbscan(X, **self.get_params())
        323         return self
        324 
    
    /usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in hdbscan(X, min_cluster_size, min_samples, metric, p, algorithm)
        235     else:
        236         return _hdbscan_large_kdtree(X, min_cluster_size, 
    --> 237                                      min_samples, metric, p)
        238 
        239 class HDBSCAN(BaseEstimator, ClusterMixin):
    
    /usr/local/lib/python2.7/dist-packages/hdbscan-0.1-py2.7-linux-x86_64.egg/hdbscan/hdbscan_.pyc in _hdbscan_large_kdtree(X, min_cluster_size, min_samples, metric, p)
        107         p = 2
        108 
    --> 109     mutual_reachability_ = kdtree_pdist_mutual_reachability(X, metric, p, min_samples)
        110 
        111     min_spanning_tree = mst_linkage_core_pdist(mutual_reachability_)
    
    /home/vg/.python-eggs/hdbscan-0.1-py2.7-linux-x86_64.egg-tmp/hdbscan/_hdbscan_reachability.so in hdbscan._hdbscan_reachability.kdtree_pdist_mutual_reachability (hdbscan/_hdbscan_reachability.c:2820)()
    
    /home/vg/.python-eggs/hdbscan-0.1-py2.7-linux-x86_64.egg-tmp/hdbscan/_hdbscan_reachability.so in hdbscan._hdbscan_reachability.kdtree_pdist_mutual_reachability (hdbscan/_hdbscan_reachability.c:2432)()
    
    /usr/lib/python2.7/dist-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
       1174 
       1175     m, n = s
    -> 1176     dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)
       1177 
       1178     wmink_names = ['wminkowski', 'wmi', 'wm', 'wpnorm']
    
    MemoryError:
    

    Any obvious problem with the code? Or is this to be expected?

    opened by varadgunjal 21
  • Added Cosine Distance as Metric

    Added Cosine Distance as Metric

    I believe that using the Cosine distance can be important for some applications. Based on my tests, clustering word embeddings or sentence embeddings using Euclidean distance gives bad results. Using the Cosine Distance, improve them. I believe that's because the Norm.

    opened by brunoalano 20
  • [WIP] Sample weighting feature implementation

    [WIP] Sample weighting feature implementation

    As discussed, a sample weighting implementation PR. "Tested" with all 6 algorithm options, but lacking any formal tests (on a roadmap...).

    Weights are basically used as starting sizes for tree creation in _hdbscan_linkage.pyx, the rest (e.g. core distance calculation) is (hopefully) untouched. As you can see in the commit log, initially weighs were passed down to the core dist calculation, but this does not seem to be a good idea (it was causing all 0 weights points to be virtually "excluded" from clustering, we believe this approach is much more appropriate).

    A quick demonstration (please uncomment plots if feeling adventurous):

    '''
    Example of weighted HDBSCAN.
    '''
    
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    from sklearn.datasets import make_blobs
    from sklearn.utils import shuffle
    from sklearn.preprocessing import StandardScaler
    
    from hdbscan import HDBSCAN
    
    plot_kwds = {'linewidths':0}
    palette = sns.color_palette("hls", 20)
    
    
    # Generate data
    np.random.seed(1)
    X, y = make_blobs(n_samples=400, cluster_std=3, shuffle=True, random_state=10)
    X, y = shuffle(X, y, random_state=7)
    X = StandardScaler().fit_transform(X)
    
    sample_weights = np.floor(np.random.standard_gamma(shape=0.5, size=len(X))) * 4
    
    sizes = ((sample_weights+1)*10).astype(int)
    alphas = np.fromiter((0.5 if s < 1 else 0.75 for s in sample_weights), np.float, 400).reshape(-1,1)
    
    algorithm = 'best'
    # algorithm = 'generic'
    # algorithm = 'prims_kdtree'
    # algorithm = 'prims_balltree'
    # algorithm = 'boruvka_kdtree'
    # algorithm = 'boruvka_balltree'
    
    min_cluster_size, min_samples = 8, 1
    
    
    # (Unweighted) HDBSCAN
    np.random.seed(1)
    hdb_unweighted = HDBSCAN(
        min_cluster_size=min_cluster_size, min_samples=min_samples, gen_min_span_tree=True, allow_single_cluster=False,
        algorithm=algorithm, prediction_data=True)
    hdb_unweighted.fit(X)
    
    cluster_colours = np.asarray([palette[lab] if lab >= 0 else (0.0, 0.0, 0.0) for lab in hdb_unweighted.labels_])
    cluster_colours = np.hstack([cluster_colours, alphas])
    
    fig = plt.figure()
    plt.scatter(X.T[0], X.T[1], s=sizes, c=cluster_colours, **plot_kwds)
    fig.suptitle('Unweighted HDBSCAN'); plt.show()
    
    ## Lots of plots, all working!
    fig = plt.figure()
    hdb_unweighted.minimum_spanning_tree_.plot(edge_cmap='viridis', edge_alpha=0.5, node_size=40, edge_linewidth=2)
    fig.suptitle('Unweighted HDBSCAN minimum spanning tree'); plt.show()
    fig = plt.figure()
    hdb_unweighted.single_linkage_tree_.plot(cmap='viridis', colorbar=True)
    fig.suptitle('Unweighted HDBSCAN single linkage tree'); plt.show()
    fig = plt.figure()
    hdb_unweighted.condensed_tree_.plot(select_clusters=True, selection_palette=palette, log_size=True)
    fig.suptitle('Unweighted HDBSCAN condensed tree plot'); plt.show()
    
    
    # (Weighted) HDBSCAN
    np.random.seed(1)
    hdb_weighted = HDBSCAN(
        min_cluster_size=min_cluster_size, min_samples=min_samples, gen_min_span_tree=True, allow_single_cluster=False,
        algorithm=algorithm, prediction_data=True)
    hdb_weighted.fit(X, sample_weights=sample_weights)
    
    cluster_colours = np.asarray([palette[lab] if lab >= 0 else (0.0, 0.0, 0.0) for lab in hdb_weighted.labels_])
    cluster_colours = np.hstack([cluster_colours, alphas])
    
    fig = plt.figure()
    plt.scatter(X.T[0], X.T[1], s=sizes, c=cluster_colours, **plot_kwds)
    fig.suptitle('Weighted HDBSCAN'); plt.show()
    
    ## Lots of plots, all working (except one or two warnings...)!
    fig = plt.figure()
    hdb_weighted.minimum_spanning_tree_.plot(edge_cmap='viridis', edge_alpha=0.5, node_size=40, edge_linewidth=2)
    fig.suptitle('Weighted HDBSCAN minimum spanning tree'); plt.show()
    # single_linkage_tree throws "plots.py:563: RuntimeWarning: divide by zero encountered in log2" but works.
    fig = plt.figure()
    hdb_weighted.single_linkage_tree_.plot(cmap='viridis', colorbar=True)
    fig.suptitle('Weighted HDBSCAN single linkage tree'); plt.show()
    fig = plt.figure()
    hdb_weighted.condensed_tree_.plot(select_clusters=True, selection_palette=palette, log_size=True)
    fig.suptitle('Weighted HDBSCAN condensed tree plot'); plt.show()
    
    ## Check weights by cluster:
    unweighted_weights = np.asarray([[lab, sum(sample_weights[hdb_unweighted.labels_ == lab])] for lab in set(hdb_unweighted.labels_)])
    weighted_weights = np.asarray([[lab, sum(sample_weights[hdb_weighted.labels_ == lab])] for lab in set(hdb_weighted.labels_)])
    print(unweighted_weights)
    print(weighted_weights)
    
    

    We are VERY open to discussion.

    enhancement new feature 
    opened by m-dz 18
  • max_cluster_size parameter

    max_cluster_size parameter

    Hi, this is the PR for issue #408 ! I've done some quick testing and it seems to be working correctly - the code changes are quite minor. I'm not 100% sure I'm correctly computing the 'size' of clusters - you can see I'm counting the size of a cluster as the 'child_size' of the edge that it's a child of, but I don't know if that incorrectly counts some points that have already fallen out of that cluster. Would it be more correct to count the size of non-leaf clusters as the sum of the child_size of the two clusters branching off from it?

    One other notable thing is that I haven't yet edited the flat.py code - I wasn't sure if a max_cluster_size parameter made sense there, and I don't understand that bit of the codebase that well. All the functions where max_cluster_size is used have a default value of 0 (i.e. no max size) set, so all other code should be unaffected by this change, even if they're still expecting the old function signature.

    opened by Rocketknight1 15
  • Crash when allow_single_cluster used with cluster_selection_epsilon

    Crash when allow_single_cluster used with cluster_selection_epsilon

    When I use cluster_selection_epsilon=x where x > 0 and allow_single_cluster=True together, HDBSCAN crashes.

    I am using those two options together to try and get the no_structure toy dataset (bottom row, square) clustered properly. I want the square to be completely blue like how DBSCAN does it. When I only use cluster_selection_epsilon, multiple clusters appear in that square. When I only use allow_single_cluster=True, part of that square is grey. I think I can only get the desired result using both of those arguments, but HDBSCAN crashes when I do that.

    Code

    import numpy as np
    import hdbscan
    
    if __name__ == '__main__':
        no_structure = np.random.rand(1500, 2)
        clusterer = hdbscan.HDBSCAN(min_cluster_size=15, cluster_selection_epsilon=3, allow_single_cluster=True)
        clusterer.fit(no_structure)
    

    Expected behavior

    HDBSCAN to cluster the data without crashing. Preferably, painting all points in the square blue as described.

    Actual behavior

    HDBSCAN crashes with the following traceback:

    Traceback (most recent call last):
      File "/home/home/PycharmProjects/sandbox/crash_example.py", line 7, in <module>
        clusterer.fit(no_structure)
      File "/home/home/PycharmProjects/sandbox/venv/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 919, in fit
        self._min_spanning_tree) = hdbscan(X, **kwargs)
      File "/home/home/PycharmProjects/sandbox/venv/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 632, in hdbscan
        return _tree_to_labels(X,
      File "/home/home/PycharmProjects/sandbox/venv/lib/python3.8/site-packages/hdbscan/hdbscan_.py", line 59, in _tree_to_labels
        labels, probabilities, stabilities = get_clusters(condensed_tree,
      File "hdbscan/_hdbscan_tree.pyx", line 645, in hdbscan._hdbscan_tree.get_clusters
      File "hdbscan/_hdbscan_tree.pyx", line 733, in hdbscan._hdbscan_tree.get_clusters
      File "hdbscan/_hdbscan_tree.pyx", line 631, in hdbscan._hdbscan_tree.epsilon_search
    IndexError: index 0 is out of bounds for axis 0 with size 0
    
    opened by danielzgtg 14
  • Only a single CPU core is used

    Only a single CPU core is used

    Not sure if this is a bug or feature, but I have observed that on my Ubuntu 14.04 machine HDBSCAN only ever uses one core, some other cores also spike occasionally but 90% of the time it's just a single core at 100% and all the others 0%.

    Is this algorithm not parallelizable? Or has it not been done yet?

    opened by ghost 12
  • BrokenProcessPool: A task has failed to un-serialize.

    BrokenProcessPool: A task has failed to un-serialize.

    This is related to this issue in BERTopic, andthis issue in UMAP. Probably related to this issue and this issue, too.

    I am currently Using UMAP to reduce the dimension and then using HDBSCAN to cluster the embeddings. However, I am running into the following error. Any idea why?

    My data size is 10 million rows and 5 dimensions (reduced with UMAP from 384 dimensions). I have 1TB of RAM and 32 cores.

    I am using Jupyter Notebook.

    ---------------------------------------------------------------------------
    _RemoteTraceback                          Traceback (most recent call last)
    _RemoteTraceback: 
    """
    Traceback (most recent call last):
      File "/rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/externals/loky/process_executor.py", line 404, in _process_worker
        call_item = call_queue.get(block=True, timeout=timeout)
      File "/usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/multiprocessing/queues.py", line 122, in get
        return _ForkingPickler.loads(res)
      File "sklearn/neighbors/_binary_tree.pxi", line 1057, in sklearn.neighbors._kd_tree.BinaryTree.__setstate__
      File "sklearn/neighbors/_binary_tree.pxi", line 999, in sklearn.neighbors._kd_tree.BinaryTree._update_memviews
      File "stringsource", line 658, in View.MemoryView.memoryview_cwrapper
      File "stringsource", line 349, in View.MemoryView.memoryview.__cinit__
    ValueError: buffer source array is read-only
    """
    
    The above exception was the direct cause of the following exception:
    
    BrokenProcessPool                         Traceback (most recent call last)
    /tmp/ipykernel_778601/1248467627.py in <module>
    ----> 1 test1=hdbscan_model.fit(embedding_test)
    
    /rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py in fit(self, X, y)
        917          self._condensed_tree,
        918          self._single_linkage_tree,
    --> 919          self._min_spanning_tree) = hdbscan(X, **kwargs)
        920 
        921         if self.prediction_data:
    
    /rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py in hdbscan(X, min_cluster_size, min_samples, alpha, cluster_selection_epsilon, metric, p, leaf_size, algorithm, memory, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, cluster_selection_method, allow_single_cluster, match_reference_implementation, **kwargs)
        608                                            gen_min_span_tree, **kwargs)
        609             else:
    --> 610                 (single_linkage_tree, result_min_span_tree) = memory.cache(
        611                     _hdbscan_boruvka_kdtree)(X, min_samples, alpha,
        612                                              metric, p, leaf_size,
    
    /rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/memory.py in __call__(self, *args, **kwargs)
        350 
        351     def __call__(self, *args, **kwargs):
    --> 352         return self.func(*args, **kwargs)
        353 
        354     def call_and_shelve(self, *args, **kwargs):
    
    /rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/hdbscan/hdbscan_.py in _hdbscan_boruvka_kdtree(X, min_samples, alpha, metric, p, leaf_size, approx_min_span_tree, gen_min_span_tree, core_dist_n_jobs, **kwargs)
        273 
        274     tree = KDTree(X, metric=metric, leaf_size=leaf_size, **kwargs)
    --> 275     alg = KDTreeBoruvkaAlgorithm(tree, min_samples, metric=metric,
        276                                  leaf_size=leaf_size // 3,
        277                                  approx_min_span_tree=approx_min_span_tree,
    
    hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm.__init__()
    
    hdbscan/_hdbscan_boruvka.pyx in hdbscan._hdbscan_boruvka.KDTreeBoruvkaAlgorithm._compute_bounds()
    
    /rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py in __call__(self, iterable)
       1052 
       1053             with self._backend.retrieval_context():
    -> 1054                 self.retrieve()
       1055             # Make sure that we get a last message telling us we are done
       1056             elapsed_time = time.time() - self._start_time
    
    /rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/parallel.py in retrieve(self)
        931             try:
        932                 if getattr(self._backend, 'supports_timeout', False):
    --> 933                     self._output.extend(job.get(timeout=self.timeout))
        934                 else:
        935                     self._output.extend(job.get())
    
    /rds/user/jw983/hpc-work/pytorch-env2/lib/python3.9/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout)
        540         AsyncResults.get from multiprocessing."""
        541         try:
    --> 542             return future.result(timeout=timeout)
        543         except CfTimeoutError as e:
        544             raise TimeoutError from e
    
    /usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py in result(self, timeout)
        443                     raise CancelledError()
        444                 elif self._state == FINISHED:
    --> 445                     return self.__get_result()
        446                 else:
        447                     raise TimeoutError()
    
    /usr/local/software/spack/spack-git/opt/spack/linux-rhel7-broadwell/gcc-5.4.0/python-3.9.6-sbr552hsx3zanhgi3ekdjp4rsn6o6ejq/lib/python3.9/concurrent/futures/_base.py in __get_result(self)
        388         if self._exception:
        389             try:
    --> 390                 raise self._exception
        391             finally:
        392                 # Break a reference cycle with the exception in self._exception
    
    BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.
    
    
    opened by ginward 11
  • some cluster_persistence_ outputs are greater than 1?

    some cluster_persistence_ outputs are greater than 1?

    Hi,

    Quick question though, according to HDBSCAN documentation, if my interpretation is correct cluster_persistence_ outputs should be [0, 1].

    However, I got some weird results like 2.81 as shown below.

    import numpy as np np.round(clusters.cluster_persistence_, 2) ​ Out[34]: array([ 2.81, 0. , 0.44, 0. , 0. , 0. , 0.02, 0. , 0. , 0. , 0. , 0.11, 0.01, 0. , 0.01, 0. , 0. , 0. , 0.01, 0. , 0.04, 0.07, 0. , 0.03, 0. , 0. , 0. , 0.01, 0. , 0. , 0.01, 0. , 0. , 0. , 0.01, 0.01, 0.03, 0. , 0.03, 0. , 0.01, 0.01, 0.01, 0.01, 0. , 0.02, 0. , 0.18, 0. , 0. , 0. , 1.06, 0.37, 0.61, 0.77, 0. , 0.09, 0. , 0.23, 0.74, 0. , 0.24, 1.1 , 0. , 0. , 0. , 0. , 0. , 0. , 0.01, 0.01, 0.01, 0.01, 0. , 0. , 0. , 0. , 0.06, 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. , 0.03, 0.01, 0. , 0. , 0. , 0. , 0. , 0.01, 0. , 0.03, 0.03, 0.01, 0.04, 0. , 0. , 0. , 0.01, 0. , 0.01, 0. , 0. , 0. , 0. , 0.01, 0. , 0.01, 0. , 0. , 0.02, 0. , 0. ])

    Am I doing things wrong?

    Cheers,

    Titan

    opened by titaniumrain 11
  • Counter-intuitive noise points

    Counter-intuitive noise points

    I have been playing with hdbscan to try and build an intuition for what it is doing. Currently I am running into counter-intuitive behavior when running it on synthetic data. In particular I have been running hdbscan on data sampled evenly from a circle. My understanding of the algorithm suggests it should return a single cluster similar to what dbscan would do with the proper epsilon setting. However, hdbscan is instead identifying a single cluster and a collection of noise points. If I reduce the minimum number of points needed for a cluster below 4 the noise points vanish. Due to the symmetry of the data I'm not seeing why this parameter should make much of a difference on how the clustering works. I'm curious if my intuition is way off or if there is an issue with how I am invoking hdbscan.

    Code:

    from hdbscan import HDBSCAN
    import matplotlib.pyplot as plt
    import numpy as np
    
    min_cluster_size = 4 # Minimum size to cause an issue
    samples = 100
    
    theta = np.linspace(-np.pi, np.pi, samples, endpoint=False)
    data = np.zeros((samples, 2))
    data[:,0] = np.cos(theta)
    data[:,1] = np.sin(theta)
    
    clusterer = HDBSCAN(min_cluster_size=min_cluster_size, allow_single_cluster=True)
    clusterer.fit(data)
    
    labels = set(clusterer.labels_)
    for label in labels:
        cluster = data[clusterer.labels_ == label]
        plt.scatter(cluster[:,0], cluster[:,1])
    
    plt.axis("equal")
    plt.show()
    
    

    Expected output: image

    Actual output: image (note the orange noise points in the lower right)

    System Information: python version: 3.10.8 hdbscan version: 0.8.29

    opened by erooke 0
  • Validity index calculation results in ValueError while calculating min/max

    Validity index calculation results in ValueError while calculating min/max

    For a clustering usecase, I tried different parameters and while calculating validity index, I run into the following ValueError:

    <dir>/envs/venv/lib/python3.9/site-packages/hdbscan/validity.py:33: RuntimeWarning: invalid value encountered in divide
      result /= distance_matrix.shape[0] - 1
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    Cell In [41], line 1
    ----> 1 validity_index(np.random.rand(5, 2), np.array([-1, 1, 1, 1, 0]))
    
    File <dir>/envs/venv/lib/python3.9/site-packages/hdbscan/validity.py:372, in validity_index(X, labels, metric, d, per_cluster_scores, mst_raw_dist, verbose, **kwd_args)
        358     distances_for_mst, core_distances[
        359         cluster_id] = distances_between_points(
        360         X,
       (...)
        367         **kwd_args
        368     )
        370     mst_nodes[cluster_id], mst_edges[cluster_id] = \
        371         internal_minimum_spanning_tree(distances_for_mst)
    --> 372     density_sparseness[cluster_id] = mst_edges[cluster_id].T[2].max()
        374 for i in range(max_cluster_id):
        376     if np.sum(labels == i) == 0:
    
    File <dir>/envs/venv/lib/python3.9/site-packages/numpy/core/_methods.py:40, in _amax(a, axis, out, keepdims, initial, where)
         38 def _amax(a, axis=None, out=None, keepdims=False,
         39           initial=_NoValue, where=True):
    ---> 40     return umr_maximum(a, axis, None, out, keepdims, initial, where)
    
    ValueError: zero-size array to reduction operation maximum which has no identity
    

    Following is a toy example, I could reproduce with:

    from hdbscan import validity_index
    validity_index(np.random.rand(5, 2), np.array([-1, 1, 1, 1, 0]))
    
    opened by tacitvenom 2
  • hdbscan installation issue in Microsoft Azure App Service

    hdbscan installation issue in Microsoft Azure App Service

    pip install hdbscan

    fails inside the Microsoft Azure App Service - Linux environment with the below error:

    ERROR: Could not build wheels for hdbscan, which is required to install pyproject.toml-based projects

    However, conda install -c conda-forge hdbscan

    is not possible inside Azure App service Linux environment, as we don't use Conda virtual environment.

    Can you please provide me solution to have a successful HDBSCAN installation?

    opened by bala1802 0
  • test error

    test error

    ............................EEE.EE..........EE..........E

    ERROR: hdbscan.tests.test_hdbscan.test_condensed_tree_plot

    Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 376, in test_condensed_tree_plot if_matplotlib(clusterer.condensed_tree_.plot)( File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 82, in run_test pytest.skip("Matplotlib not available.") File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/_pytest/outcomes.py", line 175, in skip raise Skipped(msg=reason, allow_module_level=allow_module_level) Skipped: Matplotlib not available.

    ====================================================================== ERROR: hdbscan.tests.test_hdbscan.test_single_linkage_tree_plot

    Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 389, in test_single_linkage_tree_plot if_matplotlib(clusterer.single_linkage_tree_.plot)(cmap="Reds") File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 82, in run_test pytest.skip("Matplotlib not available.") File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/_pytest/outcomes.py", line 175, in skip raise Skipped(msg=reason, allow_module_level=allow_module_level) Skipped: Matplotlib not available.

    ====================================================================== ERROR: hdbscan.tests.test_hdbscan.test_min_span_tree_plot

    Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 397, in test_min_span_tree_plot if_matplotlib(clusterer.minimum_spanning_tree_.plot)(edge_cmap="Reds") File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 82, in run_test pytest.skip("Matplotlib not available.") File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/_pytest/outcomes.py", line 175, in skip raise Skipped(msg=reason, allow_module_level=allow_module_level) Skipped: Matplotlib not available.

    ====================================================================== ERROR: hdbscan.tests.test_hdbscan.test_tree_pandas_output_formats

    Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 428, in test_tree_pandas_output_formats if_pandas(clusterer.condensed_tree_.to_pandas)() File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 97, in run_test pytest.skip("Pandas not available.") File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/_pytest/outcomes.py", line 175, in skip raise Skipped(msg=reason, allow_module_level=allow_module_level) Skipped: Pandas not available.

    ====================================================================== ERROR: hdbscan.tests.test_hdbscan.test_tree_networkx_output_formats

    Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 436, in test_tree_networkx_output_formats if_networkx(clusterer.condensed_tree_.to_networkx)() File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 112, in run_test pytest.skip("NetworkX not available.") File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/_pytest/outcomes.py", line 175, in skip raise Skipped(msg=reason, allow_module_level=allow_module_level) Skipped: NetworkX not available.

    ====================================================================== ERROR: hdbscan.tests.test_hdbscan.test_hdbscan_is_sklearn_estimator

    Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_hdbscan.py", line 646, in test_hdbscan_is_sklearn_estimator check_estimator(HDBSCAN) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/sklearn/utils/estimator_checks.py", line 610, in check_estimator raise TypeError(msg) TypeError: Passing a class was deprecated in version 0.23 and isn't supported anymore from 0.24.Please pass an instance instead.

    ====================================================================== ERROR: hdbscan.tests.test_prediction_utils.test_safe_always_positive_division

    Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) TypeError: test_safe_always_positive_division() missing 1 required positional argument: 'denominator'

    ====================================================================== ERROR: hdbscan.tests.test_rsl.test_rsl_is_sklearn_estimator

    Traceback (most recent call last): File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/hdbscan/tests/test_rsl.py", line 197, in test_rsl_is_sklearn_estimator check_estimator(RobustSingleLinkage) File "/home/dh/anaconda3/envs/pointcloud/lib/python3.8/site-packages/sklearn/utils/estimator_checks.py", line 610, in check_estimator raise TypeError(msg) TypeError: Passing a class was deprecated in version 0.23 and isn't supported anymore from 0.24.Please pass an instance instead.


    Ran 57 tests in 1.593s

    FAILED (errors=8)

    opened by slowRiver 0
  • Prediction Data Generation Fails w/ a Warning

    Prediction Data Generation Fails w/ a Warning

    Hello,

    I have the following distance matrix (dist_matrix.npy.zip) that I calculated using some function.

    from scipy.spatial.distance import pdist
    from scipy.spatial.distance import squareform
    dist_condensed = pdist(X, metric = lambda u, v: calc_distance(u[0], v[0]))
    dist_matrix = squareform(dist_condensed)
    

    Then, I fitted a model using the attached distance matrix as follows:

    import hdbscan
    clusterer = hdbscan.HDBSCAN(min_cluster_size = 2, min_samples = 2, metric = 'precomputed', prediction_data = True)
    clusterer.fit(dist_matrix)
    

    After running the above command, I got the following warning (which looks like to be important if you want to predict some data in a later time):

    hdbscan/hdbscan_.py:1256: UserWarning: Cannot generate prediction data for non-vectorspace inputs -- access to the source data rather than mere distances is required!
    

    Also, I'd like to predict the cluster of some new data points at a later time by passing a distance matrix. Does the approximate_predict method accept a distance matrix (because I used a custom distance matrix originally)? I believe that's not the case, at least based on the documentation.

    I even tried to see whether the provided example (see here) works but I got the same warning (see below).

    Screen Shot 2022-10-27 at 3 58 14 PM

    I appreciate it if you help me understand why I get that warning originally and how I can use the above method to predict the cluster of new data points in the future.

    opened by OMirzaei 0
Releases(0.8.29-1)
  • 0.8.29-1(Oct 31, 2022)

  • 0.8.29(Oct 31, 2022)

    What's Changed

    • Fix malformed list in docs by @dmlls in https://github.com/scikit-learn-contrib/hdbscan/pull/535
    • Incorrect initialization in BallTreeBoruvka dual_tree_traversal by @GregDemand in https://github.com/scikit-learn-contrib/hdbscan/pull/508
    • hdbscan: add support to other types of input integers by @jcfaracco in https://github.com/scikit-learn-contrib/hdbscan/pull/540
    • Add parameter to DBCV which toggles the usage of mutual reachability distances by @luis261 in https://github.com/scikit-learn-contrib/hdbscan/pull/552
    • Fixed: validity index no longer returning nan by @tadorfer in https://github.com/scikit-learn-contrib/hdbscan/pull/558
    • Use a positional argument for the cachedir/location by @lmcinnes in https://github.com/scikit-learn-contrib/hdbscan/pull/563

    New Contributors

    • @dmlls made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/535
    • @jcfaracco made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/540
    • @luis261 made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/552
    • @tadorfer made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/558
    • @lmcinnes made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/563

    Full Changelog: https://github.com/scikit-learn-contrib/hdbscan/compare/0.8.28...0.8.29

    Source code(tar.gz)
    Source code(zip)
  • 0.8.28.wheels(Feb 8, 2022)

  • 0.8.28(Feb 8, 2022)

  • 0.8.27(Feb 8, 2022)

    What's Changed

    • Added axis control on colourbars by @tomcharnock in https://github.com/scikit-learn-contrib/hdbscan/pull/365
    • Fixing typos by @cassmarcussen in https://github.com/scikit-learn-contrib/hdbscan/pull/366
    • Predict score by @Rhaedonius in https://github.com/scikit-learn-contrib/hdbscan/pull/368
    • fix issue 370 by @neontty in https://github.com/scikit-learn-contrib/hdbscan/pull/371
    • fix the core distance computation can only use 4 cores. by @GorvinChen in https://github.com/scikit-learn-contrib/hdbscan/pull/377
    • Issue 370 by @neontty in https://github.com/scikit-learn-contrib/hdbscan/pull/385
    • Fixed line widths for single linkage dendrograms with duplicate lambda_val by @GregDemand in https://github.com/scikit-learn-contrib/hdbscan/pull/391
    • Module for flat clustering by @sabarish-akridata in https://github.com/scikit-learn-contrib/hdbscan/pull/398
    • Improved union-find data structure implementation by @kylehofmann in https://github.com/scikit-learn-contrib/hdbscan/pull/412
    • fixes #370 -- epsilon search would crash when root node in leaf list by @neontty in https://github.com/scikit-learn-contrib/hdbscan/pull/400
    • let root children size==1 to join root node in allow_single based on 1/eps by @neontty in https://github.com/scikit-learn-contrib/hdbscan/pull/418
    • fix issue-436 unused deprecated joblib option by @neontty in https://github.com/scikit-learn-contrib/hdbscan/pull/438
    • Fix build errors by @gansanay in https://github.com/scikit-learn-contrib/hdbscan/pull/449
    • update build by @gansanay in https://github.com/scikit-learn-contrib/hdbscan/pull/450
    • Fix mismatching clusters probabilities in softclustering by @narjes23 in https://github.com/scikit-learn-contrib/hdbscan/pull/454
    • Fixing typos in "Don't be wrong!" section by @hovikgas in https://github.com/scikit-learn-contrib/hdbscan/pull/455
    • Update pyproject.toml to use oldest-supported-numpy by @stradigi-mario-bruno in https://github.com/scikit-learn-contrib/hdbscan/pull/458
    • Fix numpy 1.20 deprecation warnings by @rmcgibbo in https://github.com/scikit-learn-contrib/hdbscan/pull/468
    • max_cluster_size parameter by @Rocketknight1 in https://github.com/scikit-learn-contrib/hdbscan/pull/410
    • Fix typos by @julienschuermans in https://github.com/scikit-learn-contrib/hdbscan/pull/489
    • handle non-finite valued vectors by @jc-healy in https://github.com/scikit-learn-contrib/hdbscan/pull/497
    • Fixed the bug that joblib uses memory mapping when data size is too large by @ginward in https://github.com/scikit-learn-contrib/hdbscan/pull/495
    • Fix ZeroDivisionError: float division error when computing per_cluster_scores by @Dicksonchin93 in https://github.com/scikit-learn-contrib/hdbscan/pull/502
    • Update requirements.txt by @Ben-Epstein in https://github.com/scikit-learn-contrib/hdbscan/pull/503
    • Fixed off by one errors in min_samples for multiple algorithms by @GregDemand in https://github.com/scikit-learn-contrib/hdbscan/pull/394
    • Modified Prims to generate MST by @pberba in https://github.com/scikit-learn-contrib/hdbscan/pull/315
    • Remove usage of six by @rahulporuri in https://github.com/scikit-learn-contrib/hdbscan/pull/516
    • dbscan cluster extraction by @jc-healy in https://github.com/scikit-learn-contrib/hdbscan/pull/519

    New Contributors

    • @tomcharnock made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/365
    • @cassmarcussen made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/366
    • @Rhaedonius made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/368
    • @neontty made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/371
    • @GorvinChen made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/377
    • @GregDemand made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/391
    • @sabarish-akridata made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/398
    • @kylehofmann made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/412
    • @gansanay made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/449
    • @narjes23 made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/454
    • @hovikgas made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/455
    • @stradigi-mario-bruno made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/458
    • @rmcgibbo made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/468
    • @Rocketknight1 made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/410
    • @julienschuermans made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/489
    • @jc-healy made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/497
    • @ginward made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/495
    • @Dicksonchin93 made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/502
    • @Ben-Epstein made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/503
    • @pberba made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/315
    • @rahulporuri made their first contribution in https://github.com/scikit-learn-contrib/hdbscan/pull/516

    Full Changelog: https://github.com/scikit-learn-contrib/hdbscan/compare/0.8.26...0.8.27

    Source code(tar.gz)
    Source code(zip)
  • 0.8.12(Jan 20, 2018)

  • 0.8.4(Jan 5, 2017)

    We introduce the density based clustering validity index from Density Based Clustering Validation by Moulavi, Jaskowiak, Campello, Zimek and Sander, as well as some minor bug fixes, and the option to match the reference implementation of HDBSCAN*

    Source code(tar.gz)
    Source code(zip)
  • 0.8(May 31, 2016)

  • 0.7.3(Apr 29, 2016)

Owner
scikit-learn compatible projects
null
The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Website • Key Features • How To Use • Docs •

Pytorch Lightning 21.1k Jan 1, 2023
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 4, 2023
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 3, 2023
ML-Ensemble – high performance ensemble learning

A Python library for high performance ensemble learning ML-Ensemble combines a Scikit-learn high-level API with a low-level computational graph framew

Sebastian Flennerhag 764 Dec 31, 2022
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 8, 2023
The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Website • Key Features • How To Use • Docs •

Pytorch Lightning 11.9k Feb 13, 2021
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 5.7k Feb 12, 2021
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 2.8k Feb 12, 2021
《LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classification》(AAAI 2021) GitHub:

LightXML: Transformer with dynamic negative sampling for High-Performance Extreme Multi-label Text Classification

null 76 Dec 5, 2022
Code from the paper "High-Performance Brain-to-Text Communication via Handwriting"

High-Performance Brain-to-Text Communication via Handwriting Overview This repo is associated with this manuscript, preprint and dataset. The code can

Francis R. Willett 306 Jan 3, 2023
MetaBalance: High-Performance Neural Networks for Class-Imbalanced Data

This repository is the official PyTorch implementation of Meta-Balance. Find the paper on arxiv MetaBalance: High-Performance Neural Networks for Clas

Arpit Bansal 20 Oct 18, 2021
PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

null 943 Jan 7, 2023
HyperPose is a library for building high-performance custom pose estimation applications.

HyperPose is a library for building high-performance custom pose estimation applications.

TensorLayer Community 1.2k Jan 4, 2023
A high-performance anchor-free YOLO. Exceeding yolov3~v5 with ONNX, TensorRT, NCNN, and Openvino supported.

YOLOX is an anchor-free version of YOLO, with a simpler design but better performance! It aims to bridge the gap between research and industrial communities. For more details, please refer to our report on Arxiv.

null 7.7k Jan 6, 2023
YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with ONNX, TensorRT, ncnn, and OpenVINO supported.

Introduction YOLOX is an anchor-free version of YOLO, with a simpler design but better performance! It aims to bridge the gap between research and ind

null 7.7k Jan 3, 2023
Bytedance Inc. 2.5k Jan 6, 2023
BoxInst: High-Performance Instance Segmentation with Box Annotations

Introduction This repository is the code that needs to be submitted for OpenMMLab Algorithm Ecological Challenge, the paper is BoxInst: High-Performan

null 88 Dec 21, 2022
X-modaler is a versatile and high-performance codebase for cross-modal analytics.

X-modaler X-modaler is a versatile and high-performance codebase for cross-modal analytics. This codebase unifies comprehensive high-quality modules i

null 910 Dec 28, 2022
🔥🔥High-Performance Face Recognition Library on PaddlePaddle & PyTorch🔥🔥

face.evoLVe: High-Performance Face Recognition Library based on PaddlePaddle & PyTorch Evolve to be more comprehensive, effective and efficient for fa

Zhao Jian 3.1k Jan 2, 2023