Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Last update: Jul 9, 2022

Related tags

Data Analysis python machine-learning multithreading dimensionality-reduction feature-engineering embedding denoising laplacian-eigenmaps

Overview

Fast Laplacian Eigenmaps in python

Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python. Comes with an wrapper for NMSlib to compute approximate-nearest-neighbors. Performs several times faster than the default scikit-learn implementation.

Installation

You'll need NMSlib for using this package properly. Installing it with no binaries is recommended if your CPU supports advanced instructions (it problably does):

pip3 install --no-binary :all: nmslib
# Along with requirements:
pip3 install numpy pandas scipy scikit-learn

Then you can install this package with pip:

pip3 install fastlapmap

Usage

See the following example with the handwritten digits data. Here, I visually compare results from the scikit-learn Laplacian Eigenmaps implementation to those from my implementation.

Note that this implementation contains two similarity-learning algorithms: anisotropic diffusion maps and fuzzy simplicial sets.

# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.manifold import SpectralEmbedding
from fastlapmap import LapEigenmap

# Load some data
from sklearn.datasets import load_digits
digits = load_digits()
data = digits.data

# Define hyperparameters
N_EIGS=2
N_NEIGHBORS=10
N_JOBS=10

sk_se = SpectralEmbedding(n_components=N_EIGS, n_neighbors=N_NEIGHBORS, n_jobs=N_JOBS).fit_transform(data)

flapmap_diff = LapEigenmap(data, n_eigs=2, similarity='diffusion', norm_laplacian=True, k=N_NEIGHBORS, n_jobs=N_JOBS)
flapmap_fuzzy = LapEigenmap(data, n_eigs=2, similarity='fuzzy', norm_laplacian=True, k=N_NEIGHBORS, n_jobs=N_JOBS)

fig, (ax1, ax2, ax3) = plt.subplots(1, 3)
fig.suptitle('Handwritten digits data:', fontsize=24)
ax1.scatter(sk_se[:, 0], sk_se[:, 1], c=digits.target, cmap='Spectral', s=5)
ax1.set_title('Sklearn\'s Laplacian Eigenmaps', fontsize=20)
ax2.scatter(flapmap_diff[:, 0], flapmap_diff[:, 1], c=digits.target, cmap='Spectral', s=5)
ax2.set_title('Fast Laplacian Eigenmaps with diffusion harmonics', fontsize=20)
ax3.scatter(flapmap_fuzzy[:, 0], flapmap_fuzzy[:, 1], c=digits.target, cmap='Spectral', s=5)
ax3.set_title('Fast Laplacian Eigenmaps with fuzzy simplicial sets', fontsize=20)
plt.show()

As we can see, results are nearly identical.

Benchmark

See the runtime comparison between this implementation and scikit-learn:

## Load benchmark function:
from fastlapmap.benchmark import runtime_benchmark

# Load data
from sklearn.datasets import load_digits
digits = load_digits()
data = digits.data

# Define hyperparameters
N_EIGS = 2
N_NEIGHBORS = 10
N_JOBS = 10
SIZES = [1000, 5000, 10000, 25000, 50000, 100000]
N_RUNS = 3

runtime_benchmark(data,
                  n_eigs=N_EIGS,
                  n_neighbors=N_NEIGHBORS,
                  n_jobs=N_JOBS,
                  sizes=SIZES,
                  n_runs=N_RUNS)

As you can see, the diffusion harmoics model is the fastest, followed closely by fuzzy simplicial sets. Both outperform scikit-learn default implementation and escalate linearly with sample size.

TextDescriptives - A Python library for calculating a large variety of statistics from text

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance.

150 Dec 30, 2022

Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

759 Jan 7, 2023

Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

Meltano is open source, self-hosted, CLI-first, debuggable, and extensible. Pipelines are code, ready to be version c

625 Jan 2, 2023

OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

Overview OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase

3 Feb 12, 2022

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

1 Nov 17, 2021

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

3.7k Jan 3, 2023

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

🧪📈 🐍. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

791 Jan 4, 2023

Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

1 Jan 16, 2022

Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Related tags

Overview

Fast Laplacian Eigenmaps in python

Installation

Usage

Benchmark

You might also like...

TextDescriptives - A Python library for calculating a large variety of statistics from text

Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

Meltano: ELT for the DataOps era. Meltano is open source, self-hosted, CLI-first, debuggable, and extensible.

OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Python data processing, analysis, visualization, and data operations

Owner

CleanX is an open source python library for exploring, cleaning and augmenting large datasets of X-rays, or certain other types of radiological images.

Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

VevestaX is an open source Python package for ML Engineers and Data Scientists.

yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

Spaghetti: an open-source Python library for the analysis of network-based spatial data

Open source platform for Data Science Management automation

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

PCAfold is an open-source Python library for generating, analyzing and improving low-dimensional manifolds obtained via Principal Component Analysis (PCA).