Dimensionality Reduction + Clustering + Unsupervised Score Metrics
- Introduction
- Installation
- Usage
- Hyperparameters matters
- BayesSearch example
1. Introduction
DimReductionClustering is a sklearn estimator allowing to reduce the dimension of your data and then to apply an unsupervised clustering algorithm. The quality of the cluster can be done according to different metrics. The steps of the pipeline are the following:
- Perform a dimension reduction of the data using UMAP
- Numerically find the best epsilon parameter for DBSCAN
- Perform a density based clustering methods : DBSCAN
- Estimate cluster quality using silhouette score or DBCV
2. Installation
Use the package manager pip to install DimReductionClustering like below. Rerun this command to check for and install updates .
!pip install umap-learn
!pip install git+https://github.com/christopherjenness/DBCV.git
!pip install git+https://github.com/MathieuCayssol/DimReductionClustering.git
3. Usage
Example on mnist data.
- Import the data
from sklearn.model_selection import train_test_split
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1]*x_train.shape[1]))
X, X_test, Y, Y_test = train_test_split(x_train, y_train, stratify=y_train, test_size=0.9)
- Instanciation + fit the model (same interface as a sklearn estimators)
model = DimReductionClustering(n_components=2, min_dist=0.000001, score_metric='silhouette', knn_topk=8, min_pts=4).fit(X)
Return the epsilon using elbow method :
- Show the 2D plot :
model.display_plotly()
- Get the score (Silhouette coefficient here)
model.score()
4. Hyperparameters matters
4.1 UMAP (dim reduction)
-
n_neighbors
(global/local tradeoff) (default:15 ; 2-1/4 of data)→ low value (glue small chain, more local)
→ high value (glue big chain, more global)
-
min_dist
(0 to 0.99) the minimum distance apart that points are allowed to be in the low dimensional representation. This means that low values ofmin_dist
will result in clumpier embeddings. This can be useful if you are interested in clustering, or in finer topological structure. Larger values ofmin_dist
will prevent UMAP from packing points together and will focus on the preservation of the broad topological structure instead. -
n_components
low dimensional space. 2 or 3 -
metric
(’euclidian’ by default). For NLP, good idea to choose ‘cosine’ as infrequent/frequent words will have different magnitude.
4.2 DBSCAN (clustering)
-
min_pts
MinPts ≥ 3. Basic rule : = 2 * Dimension (4 for 2D and 6 for 3D). Higher for noisy data. -
The maximum distance between two samples for one to be considered as in the neighborhood of the other. k-distance graph with k nearest neighbor. Sort result by descending order. Find elbow using orthogonal projection on a line between first and last point of the graph. y-coordinate of max(d((x,y),Proj(x,y))) is the optimal epsilon. Click here to know more about elbow methodEpsilon
! There is no Epsilon hyperparameters in the implementation, only k-th neighbor for KNN.
knn_topk
k-th Nearest Neighbors. Between 3 and 20.
4.3 Score metric
-
Silhouette coefficient
-
Calinski-Harabasz
-
DBCV
Very slow for high numbers of features. Click here to know more about DBCV
5. BayesSearch example
!pip install scikit-optimize
from skopt.space import Integer
from skopt.space import Real
from skopt.space import Categorical
from skopt.utils import use_named_args
from skopt import BayesSearchCV
search_space = list()
#UMAP Hyperparameters
search_space.append(Integer(5, 200, name='n_neighbors', prior='uniform'))
search_space.append(Real(0.0000001, 0.2, name='min_dist', prior='uniform'))
#Search epsilon with KNN Hyperparameters
search_space.append(Integer(3, 20, name='knn_topk', prior='uniform'))
#DBSCAN Hyperparameters
search_space.append(Integer(4, 15, name='min_pts', prior='uniform'))
params = {search_space[i].name : search_space[i] for i in range((len(search_space)))}
train_indices = [i for i in range(X.shape[0])] # indices for training
test_indices = [i for i in range(X.shape[0])] # indices for testing
cv = [(train_indices, test_indices)]
clf = BayesSearchCV(estimator=DimReductionClustering(), search_spaces=params, n_jobs=-1, cv=cv)
clf.fit(X)
clf.best_params_
clf.best_score_