Buckshot++ is a new algorithm that finds highly stable clusters efficiently.

Buckshot++: An Outlier-Resistant and Scalable Clustering Algorithm. (Inspired by the Buckshot Algorithm.)

Here, we introduce a new algorithm, which we name Buckshot++. Buckshot++ improves upon the k-means by dealing with the main shortcoming thereof, namely, the need to predetermine the number of clusters, K. Typically, K is found in the following manner:

settle on some metric,
evaluate that metric at multiple values of K,
use a greedy stopping rule to determine when to stop (typically the bend in an elbow curve).

There must be a better way. We detail the following 3 improvements that the Buckshot++ algorithm makes to k-means.

Not all metrics are create equal. And since K-means doesn't prescribe which metric to use for finding K, we analyzed that some of the commonly implemented metrics are too inconsistent from one iteration to the next. Buckshot++ prescribes the silhouette score for finding K.
In k-means, every single point is clustered -- even the noise and outliers. But what we really care about is the pattern and not the noise. We show here an elegant way to overcome this problem -- even simpler than k-medoids or k-medians.
Finally, the computational complexity of running k-means multiple times on the whole dataset to find the best K can be prohibitive. We show below a surprisingly simple alternative with better asymptotics.

Details of the Buckshot++ algorithm

ALGORITHM: Buckshot++
INPUTS: population of N vectors
B := number of bootstrap samples
F := max number of clusters to try
M := cluster quality metric
OUTPUT: the optimal K for kmeans

Take B bootstrap samples where each sample is of size 1/B.
for each counter k from 2 to F do
Compute kmeans with k centers.
Compute the metric M on the clusters.
Compute the centroid of all metrics vectors.
Get argmax of the centroid vector.

Explanation of Buckshot++

The Buckshot++ algorithm was motivated by the Buckshot algorithm, which essentially finds cluster centers by performing hierarchical clustering on a sample and then performing k-means by taking those cluster centers as inputs. Hierarchical has relatively high time complexity, which is why Buckshot performs hierarchical only on a sample. The key difference between hierarchical and kmeans is that the former is more deterministic/stable but less scalable than the latter, as the next table elucidates.

%matplotlib inline
import pandas as pd
pd.set_option('display.max_rows', 500)
tbl = pd.DataFrame({'k-means': ['O(N * k * d * i)', 'random initial means; local minimum; outlier'],
                    'hierarchical': ['O(N^2 * logN)', 'outlier']}
                   , index=['Computational Complexity', 'Sources of Instability'])
tbl

	k-means	hierarchical
Computational Complexity	O(N * k * d * i)	O(N^2 * logN)
Sources of Instability	random initial means; local minimum; outlier	outlier

Hierarchical's higher time complexity means that, for large inputs, running k-means multiple times is still faster than running hierarchical just once. The Buckshot algorithm runs hierarchical just once on a small sample in order to initialize cluster centers for k-means. Since O(N^2 * logN) grows really fast, the sample must be really small to make it work computationally. But a key critique of Buckshot is failure to find the right structure with a small sample.

Buckshot++'s key innovation lies in the step "Take B bootstrap samples where each sample is of size 1/B." While Buckshot is doing hierarchical on a sample, Buckshot++ is doing multiple kmeans on bootstrap samples. Doing kmeans many times can still finish sooner than doing hierarchical just once, as the time complexities above show. An added bonus is that bootstrapping is a great way to smooth out noise and improve stability. In fact, that is exactly why Bagging (a.k.a. Bootstrap Aggregating) and Random Forests work so well.

Python implementation of Buckshot++

The core algorithm implementation is in the buckshotpp module. We use it below to cluster a news headlines dataset.

from buckshotpp import Clusterings, plot_mult_samples
from numpy.random import choice
from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_mutual_info_score
import nltk; nltk.download('punkt', quiet=True)
import matplotlib.pyplot as plt; plt.rcParams['figure.dpi'] = 120
import warnings; warnings.filterwarnings('ignore')

vecSpaceMod = Clusterings({'file_loc': 'data/news_headlines.csv',
                           'tf_dampen': True,
                           'common_word_pct': 1,
                           'rare_word_pct': 1,
                           'dim_redu': False}
                         )  # Instantiate a Clusterings object using parameters.
news_df = vecSpaceMod.get_file() # Read news_headlines.csv into a df.
metrics_byK = vecSpaceMod.buckshot(news_df)
plot_mult_samples(metrics_byK, 'silhouette')

An insight from this chart

Each green curve is generated from a bootstrap sample, and the red curve is their average. Remember the sources of instability for k-means listed in the table above? Outlier is one. The concept of outlier has somewhat different meaning in the context of clustering. In supervised learning, an outlier is a rare observation that's far from other observations distance-wise. In clustering, a far away observation is its own well-separated cluster. Here, our interpretation is that "rare" is the operative word here and that outliers are singleton clusters that exert undue influence on the formation of other clusters. Look at how bagging led to a more stable estimate of the optimal number of clusters in the graph above.

Not all metrics are create equal

The two internal clustering metrics implemented in scikit-learn are: the Silhouette Coefficient and the Calinski-Harabasz criterion. Comparing the Silhouette plotted above with the Calinski plotted below, it's clear that Calinski is far more extreme, perhaps implausibly extreme.

plot_mult_samples(metrics_byK, 'calinski')

Internal or External Clustering Metrics?

This data contains a field named "STORY" that indicates which story a headline belongs to. With this field as the ground truth, we compute Mutual Information (the most common external metric) using the code below. Mutual Information's possible range is 0-1. Using the K resulting from Buckshot++, we obtained a Mutual Information of about 0.6, an indicator that the model performance is reasonable.

X = vecSpaceMod.term_weight_matr(news_df.TITLE)
kmeans_fit = KMeans(20).fit(X)  # the argument comes from inflectin point of silhouette plot
mutual_info = adjusted_mutual_info_score(labels_true=news_df.STORY, labels_pred=kmeans_fit.labels_) 
mutual_info

0.6435601965984835

Practically, does Buckshot++ produce well-separated clusters?

Taking a look at the documents and their corresponding "predictedCluster", the results certainly do seem reasonable.

cluster_results = pd.DataFrame({'predictedCluster': kmeans_fit.labels_,
                                'document': news_df.TITLE})
cluster_results.sort_values(by='predictedCluster', inplace=True)

cluster_results

	predictedCluster	document
25	0	SAC Capital Starts Anew as Point72
50	0	Zebra Technologies to Acquire Enterprise Busin...
23	0	Fine Tuning: Good Wife just gets better
21	0	Boulder's Wealth May Be A Factor For Lowest Ob...
6	0	Power restored to nuclear plant in Waterford, ...
73	0	Electricity out as Millstone shifts to diesel
59	1	Twitter's head of media Chloe Sladden steps do...
28	1	Twitter's revolving door: media head Chloe Sla...
12	1	Twitter Exec Exodus Continues with Media Chief...
67	2	Sony Xperia C3 arrives with 5MP selfie camera,...
30	2	Leaked: Images Of Sony's Xperia C3 'Selfie Phone'
45	2	Sony Xperia Z2 Encased In A Block Of Ice, Cont...
90	2	Sony Xperia Z4 Concept Emerges as Fan Imagines...
78	2	If you hate the word 'selfie' look away now, t...
71	3	Twitter Executive Quits Amid Stalling Growth
47	3	Twitter COO quits, signalling management shake-up
52	3	Twitter Loses a Powerful Executive
31	3	Second Twitter executive quits hours after Row...
20	3	Twitter COO resigns as growth lags
61	3	Twitter COO Rowghani resigns amid lacklustre g...
57	4	'Goodbye Twitter' COO Ali Rowghani, says bye t...
69	4	Twitter chief operating officer resigns as use...
66	4	UPDATE 3-Twitter chief operating officer resig...
86	4	Twitter chief operating officer Ali Rowghani h...
76	4	Ali Rowghani, Twitter's COO, resigns after mon...
49	4	Twitter COO Ali Rowghani Just Announced Via Tw...
13	4	Twitter COO Ali Rowghani Exits
35	4	Second Twitter exec resigns with goodbye tweet...
39	5	Why almost everything you've been told about u...
77	5	Why Fargo Works So Well as a TV Show
0	6	'Mad Men' Preview: Buckle Up For 7 'Dense' Epi...
4	6	'Mad Men' end in sight for Weiner
36	6	Weiner reflects on the beginning of the end of...
42	7	Giant mystery crater in Siberia has scientists...
85	7	Mysterious giant crater in the earth discovere...
60	7	Massive Crater Discovered in Siberia
92	7	Massive mystery crater at 'end of the world'
16	7	Mysterious crater in Siberia spawns wild Inter...
43	8	Inflation rise stalls wage hopes in the UK
82	8	The Least Obese City in the Country
19	8	Real wages could resume fall as "Easter effect...
55	8	UK Inflation Rise To 1.8% Delays Real Wage Ris...
26	8	Virginia's Governor Challenges Abortion Clinic...
51	8	BREAKING NEWS: Transport costs lead to hike in...
8	8	Cable prices climb 4 times faster than inflati...
79	9	Despite Safety Issues, GM's Sales Still Increa...
17	9	Chrysler Group LLC reports June 2014 US sales ...
40	9	GM June Sales Up 9 Percent, Best June Since 2007
87	9	Ford sales fall, GM barely even; Jeep powers C...
18	10	Gov. McAuliffe Makes Health Announcements
48	10	Microsoft wants Windows XP dead and has announ...
74	10	McAuliffe puts focus on women's health
7	11	Sony makes duckfacing official with Xperia C3,...
54	11	Sony to announce 'Selfie' phone on July 8th wi...
27	11	Sony prepares to launch a smartphone that has ...
91	11	Sony Xperia C3 launches as "world's best selfi...
88	11	Sony unveils Xperia C3 smartphone with LED fla...
11	11	Sony Xperia C3 Boasts 5MP "PROselfie" Front-fa...
44	12	UK CPI rises to 1.8% in April, core CPI hits 2%
75	12	Rising CO2 Levels Will Lower Nutritional Value...
1	12	Here's How Climate Change Will Make Food Less ...
81	12	Rising CO2 levels also make our food less nutr...
80	13	Nutrition in Crops Are Cut down Drastically by...
2	13	Rising carbon dioxide levels reduce nutrients ...
68	13	With carbon dioxide levels up, nutrients in cr...
64	14	Inflation back up: Modest rise to 1.8% in Apri...
83	14	US plants prepare for long-term nuclear waste ...
22	14	Nuclear Plant Operators Deal With Radioactive ...
32	14	US plants prepare long-term nuclear waste stor...
84	15	'Mad Men' takes off on its final flight
3	15	'Mad Men' mixology
5	15	'Mad Men': 7 things to know for Season 7
9	15	Mad Men - the (Blaxploitation) Movie
37	15	TV Review: Mad Men Season 7
46	15	'Mad Men': Season 7 Premiere Guide (Video)
70	15	10 Things You Never Knew About 'Mad Men'!
53	15	'Mad Men' Season 7 Spoilers: Everything We Kno...
72	15	Rich Sommer from AMC's 'Mad Men' Season Premiere
63	16	Fargo (FX) Season Finale 2014 â��Morton's Forkâ��
56	16	Before 'Fargo's' season finale, a sequel (or p...
65	16	'Fargo' Season 1 Spoilers: Episode 10 Synopsis...
62	17	Google Glass headsets get new designs in colla...
41	17	Google's first fashionable Glass frames are de...
89	17	Google Glass Still Trying To Look Cool
34	17	Net-a-Porter Embraces Google Glass
15	18	Routine pelvic exams not recommended under new...
14	18	Doctors group nixes routine pelvic exams
38	18	Metro Detroit doctors wary of recommendation a...
10	18	Doctors against having frequent pelvic exams
58	19	Technology stocks falling for 2nd day in a row
24	19	UPDATE 5-JPMorgan profit weaker than expected ...
29	19	JPMorgan profit weaker than expected
33	19	Marks and Spencer's profits fall for third year

Summary of the key advantages of Buckshot++

Accurate method of estimating the number of clusters (a clearly best Silhouette emerged every time, while typical elbow heuristic searches can hit or miss).
Scalable (faster search for K achieved by using k-means rather than hierarchical; running k-means on subsample rather than everything).
Noise resistant when used in conjunction with k-means++ (sampling with replacement lessens the chance of selecting an outlier in the bootstrap sample).

Self-driving car env with PPO algorithm from stable baseline3

Self-driving car with RL stable baseline3 Most of the project develop from https://github.com/GerardMaggiolino/Gym-Medium-Post Please check it out! Th

7 Dec 22, 2022

This program goes thru reddit, finds the most mentioned tickers and uses Vader SentimentIntensityAnalyzer to calculate the ticker compound value.

195 Dec 13, 2022

Finds Jobs on LinkedIn using web-scraping

Find Jobs on LinkedIn 📔 This program finds jobs by scraping on LinkedIn 👨‍💻 Relies on User Input. Accepts: Country, City, State 📑 Data about jobs

44 Dec 27, 2022

Subcert is an subdomain enumeration tool, that finds all the subdomains from certificate transparency logs.

Subcert Subcert is a subdomain enumeration tool, that finds all the valid subdomains from certificate transparency logs. Table of contents Setup Demo

59 Dec 16, 2022

finds grocery stores and stuff next to route (gpx)

Route-Report Route report is a command-line utility that can be used to locate points-of-interest near your planned route (gpx). The results are based

5 Oct 10, 2022

Finds price floor for every single attribute in a given collection

Solana Solanart Scanner Enjoy the Free Code Steps to run Download VS Code

19 Oct 20, 2022

Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

Sonnet finder Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet. Usage This is a Python scrip

11 Sep 25, 2022

This code finds bounding box of a single human mouth.

This code finds bounding box of a single human mouth. In comparison to other face segmentation methods, it is relatively insusceptible to open mouth conditions, e.g., yawning, surgical robots, etc. The mouth coordinates are found in a more certified way using two independent algorithms. Therefore, the algorithm can be used in more sensitive applications.

4 Nov 27, 2022

Float2Binary - A simple python class which finds the binary representation of a floating-point number.

Float2Binary A simple python class which finds the binary representation of a floating-point number. You can find a class in IEEE754.py file with the

3 Dec 14, 2021

Musillow is a music recommender app that finds songs similar to your favourites.

MUSILLOW The music recommender app Check it out now!!! View Demo · Report Bug · Request Feature About The App Musillow is a music recommender app that

3 Feb 3, 2022

This app finds duplicate to near duplicate images by generating a hash value for each image stored with a specialized data structure called VP-Tree which makes searching an image on a dataset of 100Ks almost instantanious

Offline Reverse Image Search Overview This app finds duplicate to near duplicate images by generating a hash value for each image stored with a specia

53 Nov 15, 2022

Can't find the function "Clusterings" in code?

Hi, I am not the best coder, I'll just say. But I am looking for a function named "Clusterings" in your buckshotpp folder but I cannot find it. Is it a hidden item or something? If it is a misunderstanding of how this code (or object oriented code in general) works, can you please explain? I am able to get your code to compile and it is very interesting, but I would like to understand it properly. Thank you.

opened by laurenleesc 1

Buckshot++ is a new algorithm that finds highly stable clusters efficiently.

Related tags

Overview

Buckshot++: An Outlier-Resistant and Scalable Clustering Algorithm. (Inspired by the Buckshot Algorithm.)

Details of the Buckshot++ algorithm

Explanation of Buckshot++

Python implementation of Buckshot++

An insight from this chart

Not all metrics are create equal

Internal or External Clustering Metrics?

Practically, does Buckshot++ produce well-separated clusters?

Summary of the key advantages of Buckshot++

You might also like...

Self-driving car env with PPO algorithm from stable baseline3

This program goes thru reddit, finds the most mentioned tickers and uses Vader SentimentIntensityAnalyzer to calculate the ticker compound value.

Finds Jobs on LinkedIn using web-scraping

Subcert is an subdomain enumeration tool, that finds all the subdomains from certificate transparency logs.

finds grocery stores and stuff next to route (gpx)

Finds price floor for every single attribute in a given collection

Finds snippets in iambic pentameter in English-language text and tries to combine them to a rhyming sonnet.

This code finds bounding box of a single human mouth.

Float2Binary - A simple python class which finds the binary representation of a floating-point number.

Musillow is a music recommender app that finds songs similar to your favourites.

This app finds duplicate to near duplicate images by generating a hash value for each image stored with a specialized data structure called VP-Tree which makes searching an image on a dataset of 100Ks almost instantanious

Finds, downloads, parses, and standardizes public bikeshare data into a standard pandas dataframe format

WpDisect is a wordpress hacking tool that finds vulnerabilities in wordpress.

A repository that finds a person who looks like you by using face recognition technology.

Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters

Flower is a web based tool for monitoring and administrating Celery clusters.

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.

Massively parallel self-organizing maps: accelerate training on multicore CPUs, GPUs, and clusters

A very lightweight monitoring system for Raspberry Pi clusters running Kubernetes.

Comments

Can't find the function "Clusterings" in code?

Owner

John Jung

A drop-in replacement for django's ImageField that provides a flexible, intuitive and easily-extensible interface for quickly creating new images from the one assigned to the field.

The new Python SDK for Sentry.io

A music recommendation REST API which makes a machine learning algorithm work with the Django REST Framework

A fresh approach to autocomplete implementations, specially for Django. Status: v3 stable, 2.x.x stable, 1.x.x deprecated. Please DO regularely ping us with your link at #yourlabs IRC channel

A fresh approach to autocomplete implementations, specially for Django. Status: v3 stable, 2.x.x stable, 1.x.x deprecated. Please DO regularely ping us with your link at #yourlabs IRC channel

Stable Neural ODE with Lyapunov-Stable Equilibrium Points for Defending Against Adversarial Attacks

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

Twitter bot that finds new friends in Twitter.

Developed an optimized algorithm which finds the most optimal path between 2 points in a 3D Maze using various AI search techniques like BFS, DFS, UCS, Greedy BFS and A*