Python Methods and Functions

## Dunn index and DB index — Clustered Indexes | Set 1

## Why do we need cluster validity indices?

## Dunn's index

## Dunn index GitHub

## Optimizing Dunn Index calculation

### StackOverflow question

### Answer:

Different performance metrics are used to evaluate different machine learning algorithms. In case of a classification problem, we have a variety of performance measures to assess how good our model is. For cluster analysis, the analogous question is how to assess the “goodness” of the resulting clusters?

- To compare clustering algorithms.
- To compare two sets of clusters.
- To compare two clusters, ie which one is better in terms of compression and connection.

In general, cluster validity measures are classified into 3 classes, which are:

- Validation of the internal cluster: The result of clustering is evaluated against the data of the cluster itself (internal information) without reference to external information.
- External cluster validation: The results of clustering are evaluated against some externally known results, such as class labels provided externally.
- Relative validation of clusters: The results of clustering are evaluated by varying different parameters for the same algorithm (for example, by changing the number of clusters).

In addition to the term cluster validity index, we need to know the inter-cluster distance d (a, b) between two clusters a, b and the intra-cluster index D (a) of cluster a.

The distance between the clusters d (a, b) between two clusters a and b can be:

**Single link distance**: closest distance between two objects belonging respectively to a and b.

**Full link distance**: distance between two other distant objects belonging respectively to a and b.
Average connection distance: Average distance between all the objects belonging respectively to a and b.

**Centroid Connection Distance**: Distance between the center of gravity of the two clusters a and b respectively.
The intra-cluster distance D (a) of a cluster a can be -

**Full diameter connection distance**: distance between the two most distant objects belonging to cluster a.

**Average diameter connection distance**: Average distance between all objects belonging to cluster a.
Connection distance of the diameter of the center of gravity: twice the average distance between all objects and the center of gravity of the cluster a.

Now let's talk about 2 internal cluster validity indices, namely the Dunn index and the DB index.

The Dunn Index (DI) (introduced by J. C. Dunn in 1974), a metric for evaluating clustering algorithms, is an internal scoring scheme, in which the result is based on the aggregated data itself. Like all other indices of this type, the aim of this Dunn index is to identify sets of compact clusters, with low variance between cluster members and well separated, in which the means of the different clusters are sufficiently far from inside the variance of the cluster.

The higher the value of the Dunn index, the better the aggregation. The number of clusters which maximizes the Dunn index is considered to be the optimal number of clusters k. It also has some drawbacks. As the number of clusters and the dimensionality of the data increase, so does the cost of computation.

The Dunn index for the number c of clusters is implemented (Python 3) as:

import pandas as pd from sklearn import datasets from jqmcvi import base # loading the dataset X = datasets.load_iris() df = pd.DataFrame(X.data) # K-Means from sklearn import cluster k_means = cluster.KMeans(n_clusters=3) k_means.fit(df) #K-means training y_pred = k_means.predict(df) # We store the K-means results in a dataframe pred = pd.DataFrame(y_pred) pred.columns = ['Type'] # we merge this dataframe with df prediction = pd.concat([df, pred], axis = 1) # We store the clusters clus0 = prediction.loc[prediction.Species == 0] clus1 = prediction.loc[prediction.Species == 1] clus2 = prediction.loc[prediction.Species == 2] cluster_list = [clus0.values, clus1.values, clus2.values] print(base.dunn(cluster_list))

import numpy as np from sklearn.preprocessing import LabelEncoder DIAMETER_METHODS = ['mean_cluster', 'farthest'] CLUSTER_DISTANCE_METHODS = ['nearest', 'farthest'] def inter_cluster_distances(labels, distances, method='nearest'): """Calculates the distances between the two nearest points of each cluster. :param labels: a list containing cluster labels for each of the n elements :param distances: an n x n numpy.array containing the pairwise distances between elements :param method: `nearest` for the distances between the two nearest points in each cluster, or `farthest` """ if method not in CLUSTER_DISTANCE_METHODS: raise ValueError( 'method must be one of {}'.format(CLUSTER_DISTANCE_METHODS)) if method == 'nearest': return __cluster_distances_by_points(labels, distances) elif method == 'farthest': return __cluster_distances_by_points(labels, distances, farthest=True) def __cluster_distances_by_points(labels, distances, farthest=False): n_unique_labels = len(np.unique(labels)) cluster_distances = np.full((n_unique_labels, n_unique_labels), float('inf') if not farthest else 0) np.fill_diagonal(cluster_distances, 0) for i in np.arange(0, len(labels) - 1): for ii in np.arange(i, len(labels)): if labels[i] != labels[ii] and ( (not farthest and distances[i, ii] < cluster_distances[labels[i], labels[ii]]) or (farthest and distances[i, ii] > cluster_distances[labels[i], labels[ii]])): cluster_distances[labels[i], labels[ii]] = cluster_distances[ labels[ii], labels[i]] = distances[i, ii] return cluster_distances def diameter(labels, distances, method='farthest'): """Calculates cluster diameters :param labels: a list containing cluster labels for each of the n elements :param distances: an n x n numpy.array containing the pairwise distances between elements :param method: either `mean_cluster` for the mean distance between all elements in each cluster, or `farthest` for the distance between the two points furthest from each other """ if method not in DIAMETER_METHODS: raise ValueError('method must be one of {}'.format(DIAMETER_METHODS)) n_clusters = len(np.unique(labels)) diameters = np.zeros(n_clusters) if method == 'mean_cluster': for i in range(0, len(labels) - 1): for ii in range(i + 1, len(labels)): if labels[i] == labels[ii]: diameters[labels[i]] += distances[i, ii] for i in range(len(diameters)): diameters[i] /= sum(labels == i) elif method == 'farthest': for i in range(0, len(labels) - 1): for ii in range(i + 1, len(labels)): if labels[i] == labels[ii] and distances[i, ii] > diameters[ labels[i]]: diameters[labels[i]] = distances[i, ii] return diameters def dunn(labels, distances, diameter_method='farthest', cdist_method='nearest'): """ Dunn index for cluster validation (larger is better). .. math:: D = \\min_{i = 1 \\ldots n_c; j = i + 1\ldots n_c} \\left\\lbrace \\frac{d \\left( c_i,c_j \\right)}{\\max_{k = 1 \\ldots n_c} \\left(diam \\left(c_k \\right) \\right)} \\right\\rbrace where :math:`d(c_i,c_j)` represents the distance between clusters :math:`c_i` and :math:`c_j`, and :math:`diam(c_k)` is the diameter of cluster :math:`c_k`. Inter-cluster distance can be defined in many ways, such as the distance between cluster centroids or between their closest elements. Cluster diameter can be defined as the mean distance between all elements in the cluster, between all elements to the cluster centroid, or as the distance between the two furthest elements. The higher the value of the resulting Dunn index, the better the clustering result is considered, since higher values indicate that clusters are compact (small :math:`diam(c_k)`) and far apart (large :math:`d \\left( c_i,c_j \\right)`). :param labels: a list containing cluster labels for each of the n elements :param distances: an n x n numpy.array containing the pairwise distances between elements :param diameter_method: see :py:function:`diameter` `method` parameter :param cdist_method: see :py:function:`diameter` `method` parameter labels = LabelEncoder().fit(labels).transform(labels) ic_distances = inter_cluster_distances(labels, distances, cdist_method) min_distance = min(ic_distances[ic_distances.nonzero()]) max_diameter = max(diameter(labels, distances, diameter_method)) return min_distance / max_diameter if __name__ == '__main__': from sklearn.metrics.pairwise import euclidean_distances from sklearn.datasets import load_iris from sklearn.cluster import KMeans data = load_iris() kmeans = KMeans(n_clusters=3) c = data['target'] x = data['data'] k = kmeans.fit_predict(x) d = euclidean_distances(x) for diameter_method in DIAMETER_METHODS: for cdist_method in CLUSTER_DISTANCE_METHODS: dund = dunn(c, d, diameter_method, cdist_method) dunk = dunn(k, d, diameter_method, cdist_method) print(diameter_method, cdist_method, dund, dunk) @oktavianidewi

This is a Python implementation of the Dunn index, which is used to evaluate the results of clustering. While, officially, I have unwittingly followed the reference below, Wikipedia has been much more helpful in this endeavor. I also want to thank the people who used the code and commented in the GitHub gist where the code below is hosted. I have received many bug reports and helpful suggestions.

Kovács, F .; Legány, C. & Babos, A. Techniques for measuring the validity of clusters. 6th International Symposium of Hungarian Researchers on Computational Intelligence, 2005.

Closest measures the distance between two clusters as the distance between the two closest points in each cluster. the farthest one does something similar, but using the two points in each group that are farthest from each other.

DIAMETER_METHODS = ['mean_cluster', 'farthest'] are the two methods of calculating the diameter of a cluster that I implemented in the code. Mean_cluster uses the average distance between all the elements in each cluster as the diameter and furthest uses the distance between the two points furthest from each other inside the same cluster.

The Dunn Index is a method of evaluating clustering. A higher value is better. It is calculated as the lowest intercluster distance (ie. the smallest distance between any two cluster centroids) divided by the highest intracluster distance (ie. the largest distance between any two points in any cluster).

I have a code snippet for calculating the Dunn Index:

```
def dunn_index(pf, cf):
"""
pf -- all data points
cf -- cluster centroids
"""
numerator = inf
for c in cf: # for each cluster
for t in cf: # for each cluster
if t is c: continue # if same cluster, ignore
numerator = min(numerator, distance(t, c)) # find distance between centroids
denominator = 0
for c in cf: # for each cluster
for p in pf: # for each point
if p.get_cluster() is not c: continue # if point not in cluster, ignore
for t in pf: # for each point
if t.get_cluster() is not c: continue # if point not in cluster, ignore
if t is p: continue # if same point, ignore
denominator = max(denominator, distance(t, p))
return numerator/denominator
```

The problem is this is exceptionally slow: for an example data set consisting of 5000 instances and 15 clusters, the function above needs to perform just over 375 million distance calculations at worst. Realistically it's much lower, but even a best case, where the data is ordered by cluster already, is around 25 million distance calculations. I want to shave time off of it, and I've already tried rectilinear distance vs. euclidean and it's not good.

How can I improve this algorithm?

**TLDR**: Importantly, the problem is set up in **two-dimensions**. For large dimensions, these techniques can be ineffective.

In 2D, we can compute the diameter (intracluster distance) of each cluster in `O(n log n)`

time where `n`

is the cluster size using convex hulls. Vectorization is used to speed up remaining operations. There are two possible asymptotic improvements mentioned at the end of the post, contributions welcome ;)

Setup and fake data:

```
import numpy as np
from scipy import spatial
from matplotlib import pyplot as plt
# set up fake data
np.random.seed(0)
n_centroids = 1000
centroids = np.random.rand(n_centroids, 2)
cluster_sizes = np.random.randint(1, 1000, size=n_centroids)
# labels from 1 to n_centroids inclusive
labels = np.repeat(np.arange(n_centroids), cluster_sizes) + 1
points = np.zeros((cluster_sizes.sum(), 2))
points[:,0] = np.repeat(centroids[:,0], cluster_sizes)
points[:,1] = np.repeat(centroids[:,1], cluster_sizes)
points += 0.05 * np.random.randn(cluster_sizes.sum(), 2)
```

Looks somewhat like this:

Next, we define a `diameter`

function for computing the largest intracluster distance, based on this approach using the convex hull.

```
# compute the diameter based on convex hull
def diameter(pts):
# need at least 3 points to construct the convex hull
if pts.shape[0] <= 1:
return 0
if pts.shape[0] == 2:
return ((pts[0] - pts[1])**2).sum()
# two points which are fruthest apart will occur as vertices of the convex hull
hull = spatial.ConvexHull(pts)
candidates = pts[spatial.ConvexHull(pts).vertices]
return spatial.distance_matrix(candidates, candidates).max()
```

For the Dunn index calculation, I assume that we have already computed the points, the cluster labels and the cluster centroids.

If the number of clusters is large, the following solution based on Pandas may perform well:

```
import pandas as pd
def dunn_index_pandas(pts, labels, centroids):
# O(k n log(n)) with k clusters and n points; better performance with more even clusters
max_intracluster_dist = pd.DataFrame(pts).groupby(labels).agg(diameter_pandas)[0].max()
# O(k^2) with k clusters; can be reduced to O(k log(k))
# get pairwise distances between centroids
cluster_dmat = spatial.distance_matrix(centroids, centroids)
# fill diagonal with +inf: ignore zero distance to self in "min" computation
np.fill_diagonal(cluster_dmat, np.inf)
min_intercluster_dist = cluster_sizes.min()
return min_intercluster_dist / max_intracluster_dist
```

Otherwise, we can continue with a pure `numpy`

solution.

```
def dunn_index(pts, labels, centroids):
# O(k n log(n)) with k clusters and n points; better performance with more even clusters
max_intracluster_dist = max(diameter(pts[labels==i]) for i in np.unique(labels))
# O(k^2) with k clusters; can be reduced to O(k log(k))
# get pairwise distances between centroids
cluster_dmat = spatial.distance_matrix(centroids, centroids)
# fill diagonal with +inf: ignore zero distance to self in "min" computation
np.fill_diagonal(cluster_dmat, np.inf)
min_intercluster_dist = cluster_sizes.min()
return min_intercluster_dist / max_intracluster_dist
%time dunn_index(points, labels, centroids)
# returned value 2.15
# in 2.2 seconds
%time dunn_index_pandas(points, labels, centroids)
# returned 2.15
# in 885 ms
```

For `1000`

clusters with `i.i.d. ~U[1,1000]`

cluster sizes this takes 2.2. seconds on my machine. This number drops to .8 seconds with the Pandas approach for this example (many small clusters).

There are two further optimization opportunities that are relevant when the number of clusters is large:

First, I am computing the minimal intercluster distance with a brute force

`O(k^2)`

approach where`k`

is the number of clusters. This can be reduced to`O(k log(k))`

, as discussed here.Second,

`max(diameter(pts[labels==i]) for i in np.unique(labels))`

requires`k`

passes over an array of size`n`

. With many clusters this can become the bottleneck (as in this example). This is somewhat mitigated with the pandas approach, but I expect that this can be optimized a lot further. For current parameters, roughly one third of compute time is spent outside of computing intercluser of intracluster distances.

If you can program, you are ready to grapple with Bayesian statistics. In this book, you'll learn how to solve statistical problems using Python code instead of math formulas, using discrete probabili...

28/08/2021

Spark is one of the hottest technologies in big data analysis right now, and with good reason. If you work for, or you hope to work for, a company that has massive amounts of data to analyze, Spark of...

10/07/2020

Since you are interested in learning more about designing data visualizations (by virtue of the fact that you’re reading this book), then chances are good that you have been the reader of other peop...

10/07/2020

In Learn Python 3 the Hard Way PDF, you'll learn Python by working through 52 brilliantly crafted exercises. Read them. Enter your code exactly. (No copying and pasting!) Correct your mistakes. ...

05/09/2021

X
# Submit new EBook