DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

Amazon Web Services - Labs

Last update: Dec 8, 2022

Related tags

Overview

Amazon DenseClus

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in clustering.

Installation

python3 -m pip install Amazon-DenseClus

Usage

DenseClus requires a Panda's dataframe as input with both numerical and categorical columns. All preprocessing and extraction are done under the hood, just call fit and then retrieve the clusters!

from denseclus import DenseClus

clf = DenseClus(
    umap_combine_method="intersection_union_mapper",
)
clf.fit(df)

print(clf.score())

Examples

A hands-on example with an overview of how to use is currently available in the form of a Jupyter Notebook.

References

@article{mcinnes2018umap-software,
  title={UMAP: Uniform Manifold Approximation and Projection},
  author={McInnes, Leland and Healy, John and Saul, Nathaniel and Grossberger, Lukas},
  journal={The Journal of Open Source Software},
  volume={3},
  number={29},
  pages={861},
  year={2018}
}

@article{mcinnes2017hdbscan,
  title={hdbscan: Hierarchical density based clustering},
  author={McInnes, Leland and Healy, John and Astels, Steve},
  journal={The Journal of Open Source Software},
  volume={2},
  number={11},
  pages={205},
  year={2017}
}

Comments

Getting error on version 0.8.27?

yi10yosa56-algo-1-l1c8r | WARNING: Discarding https://files.pythonhosted.org/packages/32/bb/59a75bc5ac66a9b4f9b8f979e4545af0e98bb1ca4e6ae96b3b956b554223/hdbscan-0.8.27.tar.gz#sha256=e3a418d0d36874f7b6a1bf0b7461f3857fc13a525fd48ba34caed2fe8973aa26 (from https://pypi.org/simple/hdbscan/). Command errored out with exit status 1: /miniconda3/bin/python3 /miniconda3/lib/python3.7/site-packages/pip install --ignore-installed --no-user --prefix /home/model-server/tmp/pip-build-env-kjaud8c5/overlay --no-warn-script-location --no-binary :none: --only-binary :none: -i https://pypi.org/simple -- setuptools wheel cython numpy Check the logs for full command output.
yi10yosa56-algo-1-l1c8r | ERROR: Could not find a version that satisfies the requirement hdbscan==0.8.27
yi10yosa56-algo-1-l1c8r | ERROR: No matching distribution found for hdbscan==0.8.27
yi10yosa56-algo-1-l1c8r | Traceback (most recent call last):
yi10yosa56-algo-1-l1c8r |   File "/opt/ml/processing/input/code/preprocessing.py", line 19, in <module>
yi10yosa56-algo-1-l1c8r |     install("hdbscan==0.8.27")
yi10yosa56-algo-1-l1c8r |   File "/opt/ml/processing/input/code/preprocessing.py", line 13, in install
yi10yosa56-algo-1-l1c8r |     stdout=open(os.devnull, "wb"),
yi10yosa56-algo-1-l1c8r |   File "/miniconda3/lib/python3.7/subprocess.py", line 363, in check_call
yi10yosa56-algo-1-l1c8r |     raise CalledProcessError(retcode, cmd)

Is this right looks like the latest HDBBSCAN version is 0.8.26? https://github.com/scikit-learn-contrib/hdbscan/releases

question

opened by momonga-ml 2

DenseClus and missing values

Does DenseClus tolerate missing values in the dataset? As for as I can see from the source code, I expect that missing values in the data have to be imputed before they are fed into DensClus, but I would like that to be confirmed.

opened by jbdatascience 1
What preprocessing required for mix type of dataset ( continuous and categorical ) ?

Hello Denseclus developers: I have one question regarding the the preprocessing,

If my dataset is the combination of continuous ( numerical ) and categorical values, What are the steps to use this repo? Do I need to preprocess my data before feeding to DenseClus function?

Thank you!

opened by monk1337 1
Will there be a transform function like UMAP has?

For the full embedding/clustering job, it might be really good to have transform function so new data can be transformed and clustered without repeating the embedding. Similar to UMAPs transform function.
duplicate

opened by yusuftalhatamer 1
:sparkles: black reformatting

Issue #, if available: Format doesn't comply and cause cicd pipeline fail

Description of changes: Reformat Code with Black

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

opened by sunbc0120 0
Bump joblib from 1.0.1 to 1.2.0
Bumps joblib from 1.0.1 to 1.2.0.

Changelog

Sourced from joblib's changelog.

Release 1.2.0

Fix a security issue where eval(pre_dispatch) could potentially run arbitrary code. Now only basic numerics are supported. joblib/joblib#1327

Make sure that joblib works even when multiprocessing is not available, for instance with Pyodide joblib/joblib#1256

Avoid unnecessary warnings when workers and main process delete the temporary memmap folder contents concurrently. joblib/joblib#1263

Fix memory alignment bug for pickles containing numpy arrays. This is especially important when loading the pickle with mmap_mode != None as the resulting numpy.memmap object would not be able to correct the misalignment without performing a memory copy. This bug would cause invalid computation and segmentation faults with native code that would directly access the underlying data buffer of a numpy array, for instance C/C++/Cython code compiled with older GCC versions or some old OpenBLAS written in platform specific assembly. joblib/joblib#1254

Vendor cloudpickle 2.2.0 which adds support for PyPy 3.8+.

Vendor loky 3.3.0 which fixes several bugs including:

robustly forcibly terminating worker processes in case of a crash (joblib/joblib#1269);

avoiding leaking worker processes in case of nested loky parallel calls;

reliability spawn the correct number of reusable workers.

Release 1.1.0

Fix byte order inconsistency issue during deserialization using joblib.load in cross-endian environment: the numpy arrays are now always loaded to use the system byte order, independently of the byte order of the system that serialized the pickle. joblib/joblib#1181

Fix joblib.Memory bug with the ignore parameter when the cached function is a decorated function.

... (truncated)

Commits

5991350 Release 1.2.0

3fa2188 MAINT cleanup numpy warnings related to np.matrix in tests (#1340)

cea26ff CI test the future loky-3.3.0 branch (#1338)

8aca6f4 MAINT: remove pytest.warns(None) warnings in pytest 7 (#1264)

067ed4f XFAIL test_child_raises_parent_exits_cleanly with multiprocessing (#1339)

ac4ebd5 MAINT add back pytest warnings plugin (#1337)

a23427d Test child raises parent exits cleanly more reliable on macos (#1335)

ac09691 [MAINT] various test updates (#1334)

4a314b1 Vendor loky 3.2.0 (#1333)

bdf47e9 Make test_parallel_with_interactively_defined_functions_default_backend timeo...

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
Fix label misalignment when evaluating UMAP trustworthiness

Issue #18

Hi and thanks a lot for the code and insights. I was reading about the trustworthiness of UMAP and I came across this minor issue, so I thought I should push a quick fix.

Best, Alex

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

opened by AlexMetsai 0
Bump numpy from 1.21.0 to 1.22.0
Bumps numpy from 1.21.0 to 1.22.0.

Release notes

Sourced from numpy's releases.

v1.22.0

NumPy 1.22.0 Release Notes

NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.

A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.

NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.

New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.

A new configurable allocator for use by downstream projects.

These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

Expired deprecations

Deprecated numeric style dtype strings have been removed

Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

(gh-19539)

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

(gh-19615)

... (truncated)

Commits

4adc87d Merge pull request #20685 from charris/prepare-for-1.22.0-release

fd66547 REL: Prepare for the NumPy 1.22.0 release.

125304b wip

c283859 Merge pull request #20682 from charris/backport-20416

5399c03 Merge pull request #20681 from charris/backport-20954

f9c45f8 Merge pull request #20680 from charris/backport-20663

794b36f Update armccompiler.py

d93b14e Update test_public_api.py

7662c07 Update init.py

311ab52 Update armccompiler.py

Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR

@dependabot recreate will recreate this PR, overwriting any edits that have been made to it

@dependabot merge will merge this PR after your CI passes on it

@dependabot squash and merge will squash and merge this PR after your CI passes on it

@dependabot cancel merge will cancel a previously requested merge and block automerging

@dependabot reopen will reopen this PR if it is closed

@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually

@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)

@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

@dependabot use these labels will set the current labels as the default for future PRs for this repo and language

@dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language

@dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language

@dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

You can disable automated security fix PRs for this repo from the Security Alerts page.

dependencies
opened by dependabot[bot] 0
enable prediction data flag in the hdbscan model

Issue #, if available: Currently, there is no support for soft clustering.

Description of changes: In order to use soft clustering, the hdbscan model needs to init with prediction_data=True (default is False). this PR gives the user the option to pass this variable when init the class DenseClus.

for more info: https://hdbscan.readthedocs.io/en/latest/soft_clustering.html

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

opened by itaiara 0
Nb

Issue #, if available: #11 Description of changes: How to tune hyperparamters for HDBSCAN.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
enhancement

opened by momonga-ml 0
example nb

Issue #, if available:

Description of changes: Adding an example NB of UMAP Validation.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

opened by momonga-ml 0

Owner

Amazon Web Services - Labs

AWS Labs

GitHub https://aws.amazon.com/blogs/opensource/introducing-denseclus-an-open-source-clustering-package-for-mixed-type-data/

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

898 Jan 9, 2023

Python package for processing UC module spectral data.

UC Module Python Package How To Install clone repo. cd UC-module pip install . How to Use uc.module.UC(measurment=str, dark=str, reference=str, heade

1 Oct 20, 2021

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

2 Nov 20, 2021

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

3.7k Jan 3, 2023

Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

1 Dec 27, 2021

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

102 Nov 10, 2022

Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

13 Mar 31, 2022

Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

1 Jan 16, 2022

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

1 Jan 19, 2022

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

791 Jan 4, 2023

Catalogue data - A Python Scripts to prepare catalogue data

catalogue_data Scripts to prepare catalogue data. Setup Clone this repo. Install

3 Mar 3, 2022

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

PremiershipPlayerAnalysis Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data. No

5 Sep 6, 2021

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

Related tags

Overview

Amazon DenseClus

Installation

Usage

Examples

References

Comments

Release 1.2.0

Release 1.1.0

v1.22.0

NumPy 1.22.0 Release Notes

Expired deprecations

Deprecated numeric style dtype strings have been removed

Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

Owner

Amazon Web Services - Labs

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Python package for processing UC module spectral data.

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Utilize data analytics skills to solve real-world business problems using Humana’s big data

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Python data processing, analysis, visualization, and data operations

PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Catalogue data - A Python Scripts to prepare catalogue data

Using Python to scrape some basic player information from www.premierleague.com and then use Pandas to analyse said data.

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

A data analysis using python and pandas to showcase trends in school performance.

A collection of learning outcomes data analysis using Python and SQL, from DQLab.

ETL pipeline on movie data using Python and postgreSQL

Driver Analysis with Factors and Forests: An Automated Data Science Tool using Python

fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Expired deprecations for `loads`, `ndfromtxt`, and `mafromtxt` in npyio