DimReductionClustering - Dimensionality Reduction + Clustering + Unsupervised Score Metrics

Last update: Nov 15, 2022

Related tags

Deep Learning DimReductionClustering

Overview

Dimensionality Reduction + Clustering + Unsupervised Score Metrics

Introduction
Installation
Usage
Hyperparameters matters
BayesSearch example

1. Introduction

DimReductionClustering is a sklearn estimator allowing to reduce the dimension of your data and then to apply an unsupervised clustering algorithm. The quality of the cluster can be done according to different metrics. The steps of the pipeline are the following:

Perform a dimension reduction of the data using UMAP
Numerically find the best epsilon parameter for DBSCAN
Perform a density based clustering methods : DBSCAN
Estimate cluster quality using silhouette score or DBCV

2. Installation

Use the package manager pip to install DimReductionClustering like below. Rerun this command to check for and install updates .

!pip install umap-learn
!pip install git+https://github.com/christopherjenness/DBCV.git

!pip install git+https://github.com/MathieuCayssol/DimReductionClustering.git

3. Usage

Example on mnist data.

Import the data

from sklearn.model_selection import train_test_split
from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1]*x_train.shape[1]))
X, X_test, Y, Y_test = train_test_split(x_train, y_train, stratify=y_train, test_size=0.9)

Instanciation + fit the model (same interface as a sklearn estimators)

model = DimReductionClustering(n_components=2, min_dist=0.000001, score_metric='silhouette', knn_topk=8, min_pts=4).fit(X)

Return the epsilon using elbow method :

Show the 2D plot :

model.display_plotly()

Get the score (Silhouette coefficient here)

model.score()

4. Hyperparameters matters

4.1 UMAP (dim reduction)

n_neighbors (global/local tradeoff) (default:15 ; 2-1/4 of data)

→ low value (glue small chain, more local)

→ high value (glue big chain, more global)
min_dist (0 to 0.99) the minimum distance apart that points are allowed to be in the low dimensional representation. This means that low values of min_dist will result in clumpier embeddings. This can be useful if you are interested in clustering, or in finer topological structure. Larger values of min_dist will prevent UMAP from packing points together and will focus on the preservation of the broad topological structure instead.
n_components low dimensional space. 2 or 3
metric (’euclidian’ by default). For NLP, good idea to choose ‘cosine’ as infrequent/frequent words will have different magnitude.

4.2 DBSCAN (clustering)

min_pts MinPts ≥ 3. Basic rule : = 2 * Dimension (4 for 2D and 6 for 3D). Higher for noisy data.
~~Epsilon~~ The maximum distance between two samples for one to be considered as in the neighborhood of the other. k-distance graph with k nearest neighbor. Sort result by descending order. Find elbow using orthogonal projection on a line between first and last point of the graph. y-coordinate of max(d((x,y),Proj(x,y))) is the optimal epsilon. Click here to know more about elbow method

! There is no Epsilon hyperparameters in the implementation, only k-th neighbor for KNN.

knn_topk k-th Nearest Neighbors. Between 3 and 20.

4.3 Score metric

Silhouette coefficient
Calinski-Harabasz
DBCV Very slow for high numbers of features. Click here to know more about DBCV

5. BayesSearch example

!pip install scikit-optimize

from skopt.space import Integer
from skopt.space import Real
from skopt.space import Categorical
from skopt.utils import use_named_args
from skopt import BayesSearchCV

search_space = list()
#UMAP Hyperparameters
search_space.append(Integer(5, 200, name='n_neighbors', prior='uniform'))
search_space.append(Real(0.0000001, 0.2, name='min_dist', prior='uniform'))
#Search epsilon with KNN Hyperparameters
search_space.append(Integer(3, 20, name='knn_topk', prior='uniform'))
#DBSCAN Hyperparameters
search_space.append(Integer(4, 15, name='min_pts', prior='uniform'))


params = {search_space[i].name : search_space[i] for i in range((len(search_space)))}

train_indices = [i for i in range(X.shape[0])]  # indices for training
test_indices = [i for i in range(X.shape[0])]  # indices for testing

cv = [(train_indices, test_indices)]

clf = BayesSearchCV(estimator=DimReductionClustering(), search_spaces=params, n_jobs=-1, cv=cv)

clf.fit(X)

clf.best_params_

clf.best_score_

You might also like...

PyTorch implementation of the paper Deep Networks from the Principle of Rate Reduction

Deep Networks from the Principle of Rate Reduction This repository is the official PyTorch implementation of the paper Deep Networks from the Principl

459 Dec 27, 2022

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

TR-BERT Source code and dataset for "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference". The code is based on huggaface's transformers.

37 Oct 30, 2022

Official NumPy Implementation of Deep Networks from the Principle of Rate Reduction (2021)

Deep Networks from the Principle of Rate Reduction This repository is the official NumPy implementation of the paper Deep Networks from the Principle

49 Dec 16, 2022

TensorFlow implementation of Barlow Twins (Barlow Twins: Self-Supervised Learning via Redundancy Reduction)

Barlow-Twins-TF This repository implements Barlow Twins (Barlow Twins: Self-Supervised Learning via Redundancy Reduction) in TensorFlow and demonstrat

36 Sep 14, 2022

InDuDoNet+: A Model-Driven Interpretable Dual Domain Network for Metal Artifact Reduction in CT Images

InDuDoNet+: A Model-Driven Interpretable Dual Domain Network for Metal Artifact Reduction in CT Images Hong Wang, Yuexiang Li, Haimiao Zhang, Deyu Men

4 Dec 27, 2022

pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

Unofficial implementation: MoCo: Momentum Contrast for Unsupervised Visual Representation Learning (Paper) InsDis: Unsupervised Feature Learning via N

16 Nov 4, 2020

Official code for Score-Based Generative Modeling through Stochastic Differential Equations

Score-Based Generative Modeling through Stochastic Differential Equations This repo contains the official implementation for the paper Score-Based Gen

818 Jan 6, 2023

Official implementation for Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020

Likelihood-Regret Official implementation of Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020. T

33 Oct 12, 2022

Repository for the "Gotta Go Fast When Generating Data with Score-Based Models" paper

Gotta Go Fast When Generating Data with Score-Based Models This repo contains the official implementation for the paper Gotta Go Fast When Generating

89 Nov 9, 2022

DimReductionClustering - Dimensionality Reduction + Clustering + Unsupervised Score Metrics

Related tags

Overview

Dimensionality Reduction + Clustering + Unsupervised Score Metrics

1. Introduction

2. Installation

3. Usage

4. Hyperparameters matters

4.1 UMAP (dim reduction)

4.2 DBSCAN (clustering)

4.3 Score metric

5. BayesSearch example

You might also like...

PyTorch implementation of the paper Deep Networks from the Principle of Rate Reduction

Source code for NAACL 2021 paper "TR-BERT: Dynamic Token Reduction for Accelerating BERT Inference"

Official NumPy Implementation of Deep Networks from the Principle of Rate Reduction (2021)

TensorFlow implementation of Barlow Twins (Barlow Twins: Self-Supervised Learning via Redundancy Reduction)

InDuDoNet+: A Model-Driven Interpretable Dual Domain Network for Metal Artifact Reduction in CT Images

pytorch implementation of "Contrastive Multiview Coding", "Momentum Contrast for Unsupervised Visual Representation Learning", and "Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination"

Official code for Score-Based Generative Modeling through Stochastic Differential Equations

Official implementation for Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020

Repository for the "Gotta Go Fast When Generating Data with Score-Based Models" paper

Owner

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

PyTorch implementation HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projections

TLDR: Twin Learning for Dimensionality Reduction

An official source code for paper Deep Graph Clustering via Dual Correlation Reduction, accepted by AAAI 2022

TorchMetrics is a collection of 25+ PyTorch metrics implementations and an easy-to-use API to create custom metrics.

Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

Awesome Deep Graph Clustering is a collection of SOTA, novel deep graph clustering methods

《Improving Unsupervised Image Clustering With Robust Learning》(2020)

PiCIE: Unsupervised Semantic Segmentation using Invariance and Equivariance in clustering (CVPR2021)

This is the code for CVPR 2021 oral paper: Jigsaw Clustering for Unsupervised Visual Representation Learning