Uniform Manifold Approximation and Projection

Overview

pypi_version pypi_downloads

conda_version conda_downloads

License build_status Coverage

Docs joss_paper

UMAP

Uniform Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can be used for visualisation similarly to t-SNE, but also for general non-linear dimension reduction. The algorithm is founded on three assumptions about the data:

  1. The data is uniformly distributed on a Riemannian manifold;
  2. The Riemannian metric is locally constant (or can be approximated as such);
  3. The manifold is locally connected.

From these assumptions it is possible to model the manifold with a fuzzy topological structure. The embedding is found by searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.

The details for the underlying mathematics can be found in our paper on ArXiv:

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018

The important thing is that you don't need to worry about that—you can use UMAP right now for dimension reduction and visualisation as easily as a drop in replacement for scikit-learn's t-SNE.

Documentation is available via Read the Docs.

New: this package now also provides support for densMAP. The densMAP algorithm augments UMAP to preserve local density information in addition to the topological structure of the data. Details of this method are described in the following paper:

Narayan, A, Berger, B, Cho, H, Density-Preserving Data Visualization Unveils Dynamic Patterns of Single-Cell Transcriptomic Variability, bioRxiv, 2020

Installing

UMAP depends upon scikit-learn, and thus scikit-learn's dependencies such as numpy and scipy. UMAP adds a requirement for numba for performance reasons. The original version used Cython, but the improved code clarity, simplicity and performance of Numba made the transition necessary.

Requirements:

  • Python 3.6 or greater
  • numpy
  • scipy
  • scikit-learn
  • numba

Recommended packages:

  • pynndescent
  • For plotting
    • matplotlib
    • datashader
    • holoviews
  • for Parametric UMAP
    • tensorflow > 2.0.0

Installing pynndescent can significantly increase performance, and in later versions it will become a hard dependency.

Install Options

Conda install, via the excellent work of the conda-forge team:

conda install -c conda-forge umap-learn

The conda-forge packages are available for Linux, OS X, and Windows 64 bit.

PyPI install, presuming you have numba and sklearn and all its requirements (numpy and scipy) installed:

pip install umap-learn

If you wish to use the plotting functionality you can use

pip install umap-learn[plot]

to install all the plotting dependencies.

If you wish to use Parametric UMAP, you need to install Tensorflow, which can be installed either using the instructions at https://www.tensorflow.org/install (reccomended) or using

pip install umap-learn[parametric_umap]

for a CPU-only version of Tensorflow.

If pip is having difficulties pulling the dependencies then we'd suggest installing the dependencies manually using anaconda followed by pulling umap from pip:

conda install numpy scipy
conda install scikit-learn
conda install numba
pip install umap-learn

For a manual install get this package:

wget https://github.com/lmcinnes/umap/archive/master.zip
unzip master.zip
rm master.zip
cd umap-master

Install the requirements

sudo pip install -r requirements.txt

or

conda install scikit-learn numba

Install the package

python setup.py install

How to use UMAP

The umap package inherits from sklearn classes, and thus drops in neatly next to other sklearn transformers with an identical calling API.

import umap
from sklearn.datasets import load_digits

digits = load_digits()

embedding = umap.UMAP().fit_transform(digits.data)

There are a number of parameters that can be set for the UMAP class; the major ones are as follows:

  • n_neighbors: This determines the number of neighboring points used in local approximations of manifold structure. Larger values will result in more global structure being preserved at the loss of detailed local structure. In general this parameter should often be in the range 5 to 50, with a choice of 10 to 15 being a sensible default.
  • min_dist: This controls how tightly the embedding is allowed compress points together. Larger values ensure embedded points are more evenly distributed, while smaller values allow the algorithm to optimise more accurately with regard to local structure. Sensible values are in the range 0.001 to 0.5, with 0.1 being a reasonable default.
  • metric: This determines the choice of metric used to measure distance in the input space. A wide variety of metrics are already coded, and a user defined function can be passed as long as it has been JITd by numba.

An example of making use of these options:

import umap
from sklearn.datasets import load_digits

digits = load_digits()

embedding = umap.UMAP(n_neighbors=5,
                      min_dist=0.3,
                      metric='correlation').fit_transform(digits.data)

UMAP also supports fitting to sparse matrix data. For more details please see the UMAP documentation

Benefits of UMAP

UMAP has a few signficant wins in its current incarnation.

First of all UMAP is fast. It can handle large datasets and high dimensional data without too much difficulty, scaling beyond what most t-SNE packages can manage. This includes very high dimensional sparse datasets. UMAP has successfully been used directly on data with over a million dimensions.

Second, UMAP scales well in embedding dimension—it isn't just for visualisation! You can use UMAP as a general purpose dimension reduction technique as a preliminary step to other machine learning tasks. With a little care it partners well with the hdbscan clustering library (for more details please see Using UMAP for Clustering).

Third, UMAP often performs better at preserving some aspects of global structure of the data than most implementations of t-SNE. This means that it can often provide a better "big picture" view of your data as well as preserving local neighbor relations.

Fourth, UMAP supports a wide variety of distance functions, including non-metric distance functions such as cosine distance and correlation distance. You can finally embed word vectors properly using cosine distance!

Fifth, UMAP supports adding new points to an existing embedding via the standard sklearn transform method. This means that UMAP can be used as a preprocessing transformer in sklearn pipelines.

Sixth, UMAP supports supervised and semi-supervised dimension reduction. This means that if you have label information that you wish to use as extra information for dimension reduction (even if it is just partial labelling) you can do that—as simply as providing it as the y parameter in the fit method.

Seventh, UMAP supports a variety of additional experimental features including: an "inverse transform" that can approximate a high dimensional sample that would map to a given position in the embedding space; the ability to embed into non-euclidean spaces including hyperbolic embeddings, and embeddings with uncertainty; very preliminary support for embedding dataframes also exists.

Finally, UMAP has solid theoretical foundations in manifold learning (see our paper on ArXiv). This both justifies the approach and allows for further extensions that will soon be added to the library.

Performance and Examples

UMAP is very efficient at embedding large high dimensional datasets. In particular it scales well with both input dimension and embedding dimension. For the best possible performance we recommend installing the nearest neighbor computation library pynndescent . UMAP will work without it, but if installed it will run faster, particularly on multicore machines.

For a problem such as the 784-dimensional MNIST digits dataset with 70000 data samples, UMAP can complete the embedding in under a minute (as compared with around 45 minutes for scikit-learn's t-SNE implementation). Despite this runtime efficiency, UMAP still produces high quality embeddings.

The obligatory MNIST digits dataset, embedded in 42 seconds (with pynndescent installed and after numba jit warmup) using a 3.1 GHz Intel Core i7 processor (n_neighbors=10, min_dist=0.001):

UMAP embedding of MNIST digits

The MNIST digits dataset is fairly straightforward, however. A better test is the more recent "Fashion MNIST" dataset of images of fashion items (again 70000 data sample in 784 dimensions). UMAP produced this embedding in 49 seconds (n_neighbors=5, min_dist=0.1):

UMAP embedding of "Fashion MNIST"

The UCI shuttle dataset (43500 sample in 8 dimensions) embeds well under correlation distance in 44 seconds (note the longer time required for correlation distance computations):

UMAP embedding the UCI Shuttle dataset

The following is a densMAP visualization of the MNIST digits dataset with 784 features based on the same parameters as above (n_neighbors=10, min_dist=0.001). densMAP reveals that the cluster corresponding to digit 1 is noticeably denser, suggesting that there are fewer degrees of freedom in the images of 1 compared to other digits.

densMAP embedding of the MNIST dataset

Plotting

UMAP includes a subpackage umap.plot for plotting the results of UMAP embeddings. This package needs to be imported separately since it has extra requirements (matplotlib, datashader and holoviews). It allows for fast and simple plotting and attempts to make sensible decisions to avoid overplotting and other pitfalls. An example of use:

import umap
import umap.plot
from sklearn.datasets import load_digits

digits = load_digits()

mapper = umap.UMAP().fit(digits.data)
umap.plot.points(mapper, labels=digits.target)

The plotting package offers basic plots, as well as interactive plots with hover tools and various diagnostic plotting options. See the documentation for more details.

Parametric UMAP

Parametric UMAP provides support for training a neural network to learn a UMAP based transformation of data. This can be used to support faster inference of new unseen data, more robust inverse transforms, autoencoder versions of UMAP and semi-supervised classification (particularly for data well separated by UMAP and very limited amounts of labelled data). See the documentation of Parametric UMAP or the example notebooks for more.

densMAP

The densMAP algorithm augments UMAP to additionally preserve local density information in addition to the topological structure captured by UMAP. One can easily run densMAP using the umap package by setting the densmap input flag:

embedding = umap.UMAP(densmap=True).fit_transform(data)

This functionality is built upon the densMAP implementation provided by the developers of densMAP, who also contributed to integrating densMAP into the umap package.

densMAP inherits all of the parameters of UMAP. The following is a list of additional parameters that can be set for densMAP:

  • dens_frac: This determines the fraction of epochs (a value between 0 and 1) that will include the density-preservation term in the optimization objective. This parameter is set to 0.3 by default. Note that densMAP switches density optimization on after an initial phase of optimizing the embedding using UMAP.
  • dens_lambda: This determines the weight of the density-preservation objective. Higher values prioritize density preservation, and lower values (closer to zero) prioritize the UMAP objective. Setting this parameter to zero reduces the algorithm to UMAP. Default value is 2.0.
  • dens_var_shift: Regularization term added to the variance of local densities in the embedding for numerical stability. We recommend setting this parameter to 0.1, which consistently works well in many settings.
  • output_dens: When this flag is True, the call to fit_transform returns, in addition to the embedding, the local radii (inverse measure of local density defined in the densMAP paper) for the original dataset and for the embedding. The output is a tuple (embedding, radii_original, radii_embedding). Note that the radii are log-transformed. If False, only the embedding is returned. This flag can also be used with UMAP to explore the local densities of UMAP embeddings. By default this flag is False.

For densMAP we recommend larger values of n_neighbors (e.g. 30) for reliable estimation of local density.

An example of making use of these options (based on a subsample of the mnist_784 dataset):

import umap
from sklearn.datasets import fetch_openml
from sklearn.utils import resample

digits = fetch_openml(name='mnist_784')
subsample, subsample_labels = resample(digits.data, digits.target, n_samples=7000,
                                       stratify=digits.target, random_state=1)

embedding, r_orig, r_emb = umap.UMAP(densmap=True, dens_lambda=2.0, n_neighbors=30,
                                     output_dens=True).fit_transform(subsample)

See the documentation for more details.

Help and Support

Documentation is at Read the Docs. The documentation includes a FAQ that may answer your questions. If you still have questions then please open an issue and I will try to provide any help and guidance that I can.

Citation

If you make use of this software for your work we would appreciate it if you would cite the paper from the Journal of Open Source Software:

@article{mcinnes2018umap-software,
  title={UMAP: Uniform Manifold Approximation and Projection},
  author={McInnes, Leland and Healy, John and Saul, Nathaniel and Grossberger, Lukas},
  journal={The Journal of Open Source Software},
  volume={3},
  number={29},
  pages={861},
  year={2018}
}

If you would like to cite this algorithm in your work the ArXiv paper is the current reference:

@article{2018arXivUMAP,
     author = {{McInnes}, L. and {Healy}, J. and {Melville}, J.},
     title = "{UMAP: Uniform Manifold Approximation
     and Projection for Dimension Reduction}",
     journal = {ArXiv e-prints},
     archivePrefix = "arXiv",
     eprint = {1802.03426},
     primaryClass = "stat.ML",
     keywords = {Statistics - Machine Learning,
                 Computer Science - Computational Geometry,
                 Computer Science - Learning},
     year = 2018,
     month = feb,
}

Additionally, if you use the densMAP algorithm in your work please cite the following reference:

@article {NBC2020,
    author = {Narayan, Ashwin and Berger, Bonnie and Cho, Hyunghoon},
    title = {Density-Preserving Data Visualization Unveils Dynamic Patterns of Single-Cell Transcriptomic Variability},
    journal = {bioRxiv},
    year = {2020},
    doi = {10.1101/2020.05.12.077776},
    publisher = {Cold Spring Harbor Laboratory},
    URL = {https://www.biorxiv.org/content/early/2020/05/14/2020.05.12.077776},
    eprint = {https://www.biorxiv.org/content/early/2020/05/14/2020.05.12.077776.full.pdf},
}

If you use the Parametric UMAP algorithm in your work please cite the following reference:

@article {NBC2020,
    author = {Sainburg, Tim and McInnes, Leland and Gentner, Timothy Q.},
    title = {Parametric UMAP: learning embeddings with deep neural networks for representation and semi-supervised learning},
    journal = {ArXiv e-prints},
    archivePrefix = "arXiv",
    eprint = {},
    primaryClass = "stat.ML",
    keywords = {Statistics - Machine Learning,
                Computer Science - Computational Geometry,
                Computer Science - Learning},
    year = 2020,
    }

License

The umap package is 3-clause BSD licensed.

We would like to note that the umap package makes heavy use of NumFOCUS sponsored projects, and would not be possible without their support of those projects, so please consider contributing to NumFOCUS.

Contributing

Contributions are more than welcome! There are lots of opportunities for potential projects, so please get in touch if you would like to help out. Everything from code to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute. To contribute please fork the project make your changes and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

Issues
  • Wierd results on dataset

    Wierd results on dataset

    Tried it on some Swedish parlament voting data. Did a notebook comparing it to t-SNE that works fine, but umap just produces one big blob. Tried some different parameters without any luck, but honestly I have no clue what either of the parameters does :).

    See notebook for more info (if you run it you will have nice interactive plots, but i also added a static plot since the interactive is stripped from the gist)

    https://gist.github.com/maxberggren/56efa53776f42755b83261c54081496e

    opened by maxberggren 25
  • RecursionError

    RecursionError

    Hi, thanks for the great package.

    I have a dataset which has 200000 rows and 15 columns. I tried to apply UMAP as following

    embedding = umap.UMAP(n_neighbors=5, min_dist=0.3, metric='correlation').fit_transform(data)

    After 10 seconds, I got following exceptions :

    • RecursionError: maximum recursion depth exceeded while calling a Python object
    • return make_angular_tree(data, indices, rng_state, leaf_size) SystemError: CPUDispatcher(<function angular_random_projection_split at 0x000001C8260D6378>) returned a result with an error set

    I set the system recursion limit to 10000 as below and tried again but then python exited with a code like -143537645 meaning exited with error.

    sys.setrecursionlimit(10000)

    Is there any solution, workaround or anything I can do for this problem?

    Thanks.

    opened by sametdumankaya 23
  • Attribute Error

    Attribute Error

    Hi there, these are my system specs: macOS Sierra 10.12.3 (16D32)

    I have installed umap through pip. When I try to run it this the error message that comes up. I'm unsure what the problem is, any ideas?

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-10-68ef34dfa695> in <module>()
         16         umap_mfccs = get_scaled_umap_embeddings(mfcc_features,
         17                                                 neighbours,
    ---> 18                                                 distances)
         19         umap_embeddings_mfccs.append(umap_mfccs)
         20 
    
    <ipython-input-10-68ef34dfa695> in get_scaled_umap_embeddings(features, neighbour, distance)
          1 def get_scaled_umap_embeddings(features, neighbour, distance):
          2 
    ----> 3     embedding = umap.UMAP(n_neighbors=neighbour,
          4                           min_dist = distance,
          5                           metric = 'correlation').fit_transform(features)
    
    AttributeError: module 'umap' has no attribute 'UMAP'
    
    
    opened by Jhard01 23
  • Numba warnings

    Numba warnings

    First I want to thank you for sharing this great work. It is very useful for me. The first time I execute UMAP fit and transform I get some numba warnings. They do not seem to be critical, but they are somewhat disturbing.

    numba.__version__ '0.44.0'

    sys.version '3.7.3 (default, Mar 27 2019, 22:11:17) \n[GCC 7.3.0]'

    reducer = umap.UMAP(random_state=42)
    reducer.fit(X)
    Y = reducer.transform(X)
    
    /home/mbr085/anaconda3/envs/divisivegater/lib/python3.7/site-packages/umap/umap_.py:349: NumbaWarning: 
    Compilation is falling back to object mode WITH looplifting enabled because Function "fuzzy_simplicial_set" failed type inference due to: Untyped global name 'nearest_neighbors': cannot determine Numba type of <class 'function'>
    
    File "../../../../../anaconda3/envs/divisivegater/lib/python3.7/site-packages/umap/umap_.py", line 467:
    def fuzzy_simplicial_set(
        <source elided>
        if knn_indices is None or knn_dists is None:
            knn_indices, knn_dists, _ = nearest_neighbors(
            ^
    
      @numba.jit()
    /home/mbr085/anaconda3/envs/divisivegater/lib/python3.7/site-packages/numba/compiler.py:725: NumbaWarning: Function "fuzzy_simplicial_set" was compiled in object mode without forceobj=True.
    
    File "../../../../../anaconda3/envs/divisivegater/lib/python3.7/site-packages/umap/umap_.py", line 350:
    @numba.jit()
    def fuzzy_simplicial_set(
    ^
    
      self.func_ir.loc))
    /home/mbr085/anaconda3/envs/divisivegater/lib/python3.7/site-packages/numba/compiler.py:734: NumbaDeprecationWarning: 
    Fall-back from the nopython compilation path to the object mode compilation path has been detected, this is deprecated behaviour.
    
    For more information visit http://numba.pydata.org/numba-doc/latest/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit
    
    File "../../../../../anaconda3/envs/divisivegater/lib/python3.7/site-packages/umap/umap_.py", line 350:
    @numba.jit()
    def fuzzy_simplicial_set(
    ^
    
      warnings.warn(errors.NumbaDeprecationWarning(msg, self.func_ir.loc))
    
    Good Reads 
    opened by mbr085 22
  • Break out inner loop of optimize_layout function

    Break out inner loop of optimize_layout function

    ...so numba can parallelize it effectively across all cores of a machine.

    During benchmarking, I found that numba was not parallelizing optimize_layout, which is the most compute intensive part of UMAP. By extracting the code that runs for each epoch into its own function and decorating with numba's parallel=True I found that the computation can take advantage of all cores in a machine.

    For example, running locally on a 4-core Mac I found one benchmark (of optimize_layout) went from ~110s to ~45s. And on a 16-core Linux box, the UMAP step of Scanpy went from ~206s to ~73s (for 130K rows).

    opened by tomwhite 21
  • Implement inverse transform

    Implement inverse transform

    I implemented an inverse transform allowing a user to find vectors in data space corresponding to points in embedding space. The basic algorithm is the same as forward UMAP except that the simplicial set is based on a Delaunay complex rather than a Vietoris-Rips complex to provide for smooth behavior in the gaps between clusters. However, this means that inverse_transform(transform(points)) != points in general.

    It was necessary to modify optimize_layout to allow for non-Eucldiean distance metrics in the "embedding" space so that it could be used backwards. It is also now possible to choose whether the used membershhip function is for the input set or the output set. As a side effect, it is now possible to use UMAP to embed data onto a torus or a sphere (or any manifold for which you define a distance and gradient).

    Example scripts for inverse transformation and embedding onto a torus or a sphere are in the examples directory.

    This implementation is still a little rough but is a proof of concept. Currently, only the euclidean and haversine metrics are supported. Functions returning the gradient of other metrics are still needed.

    Embedding of the sklearn digits dataset onto a torus torus

    Embedding of the sklearn digits dataset onto a sphere sphere

    Inverse transform of a grid of points (shown as black xs) back into data space. Dataset is MNIST. Some points look like simple pixel-wise averages of neighboring points. The sharpness of generated vectors increases with higher values of n_epoch. inverse_transform

    opened by josephcourtney 20
  • Stuck at constructing embedding?

    Stuck at constructing embedding?

    I currently have a dataset with more than 10 million rows of data and 384 dimensions. I use PCA to reduce the 384 dimensions to 10, and then apply UMAP via the BertTopic library.

    To avoid running into memory issues, I am using a machine with 1TB of RAM and 128 cores. However, it seems that the process hang at "Construct embedding", and only about 500GB of RAM is being used (so not a memory issue).

    Here are the code and verbose:

    
    embeddings = np.load('embeddings.npy')
    
    pca = PCA(n_components=10)
    embeddings_pca = pca.fit_transform(embeddings)
    
    vectorizer_model = CountVectorizer(ngram_range=(1, 2), stop_words="english")
    
    umap_model = UMAP(n_neighbors=15, n_components=5, min_dist=0.0, metric='cosine', low_memory = True, verbose=True)
    
    # Setting HDBSCAN model
    hdbscan_model = HDBSCAN(min_cluster_size=10, metric='euclidean', cluster_selection_method='eom', prediction_data=True)
    
    topic_model = BERTopic(umap_model = umap_model, hdbscan_model=hdbscan_model,  verbose=True, seed_topic_list=seed_topic_list, low_memory=True, calculate_probabilities=True, vectorizer_model=vectorizer_model)
    
    #topics, probs = topic_model.fit_transform(docs)
    
    topic_model = topic_model.fit(docs, embeddings_pca)
    
    UMAP(angular_rp_forest=True, dens_frac=0.0, dens_lambda=0.0, metric='cosine',
         min_dist=0.0, n_components=5, verbose=True)
    Construct fuzzy simplicial set
    Tue Sep 28 11:33:15 2021 Finding Nearest Neighbors
    Tue Sep 28 11:33:15 2021 Building RP forest with 64 trees
    Tue Sep 28 11:34:42 2021 NN descent for 23 iterations
    	 1  /  23
    	 2  /  23
    	Stopping threshold met -- exiting after 2 iterations
    Tue Sep 28 11:49:29 2021 Finished Nearest Neighbor Search
    Tue Sep 28 11:50:33 2021 Construct embedding
    

    If I understand correctly, the most memory consuming step should be nearest neighbour search (which it completed with no issue)? How come does it stuck at constructing embeddings?

    opened by ginward 19
  • Converging to a single point

    Converging to a single point

    I'm using UMAP to embed a bunch of 128 dimensional face embeddings generated by a neural net.

    As I increase the number of embeddings (I have 3M total) the output from UMAP converges to a single point in the center surrounded by a sparse cloud around it. How can I fix this? Here are some examples from fewer samples to more samples. n = 73728, 114688, 172032, 196608, 245760

    73728 114688 172032 196608 245760

    opened by kylemcdonald 17
  • Add a fast way to compute knn_indices

    Add a fast way to compute knn_indices

    Tested on a [50k, 50k] distance matrix computed on 50k np.random.normal points where this runs in 6.8s-7.0s on a 120 core machine whereas the old code takes 245s.

    opened by sleighsoft 16
  • Are euclidian distances interpretable in the embedding?

    Are euclidian distances interpretable in the embedding?

    Hi,

    I have a couple of questions regarding the interpretability of the results obtained with UMAP. I apologize if the questions are too trivial and should be posted elsewhere.

    Third, UMAP often performs better at preserving aspects of global structure of the data than t-SNE. This means that it can often provide a better "big picture" view of your data as well as preserving local neighbor relations.

    I was wondering whether euclidian distances in the embedding are interpretable. Taking the MNIST example:

    MNIST example

    Does it make sense to say that the digit "3" is as distant to "0" as it is from "1"? Or that because the area of the cluster of "1"s is larger than the area of the cluster of "8"s, that there is more variability in "1"s than there is in "8"s?

    Is there any way of interpreting of the axes mean, i.e., if we were to take a random point in the 2D space, could we apply an inverse transformation to figure out what the N-D object would've looked like?

    Thank you for your awesome project!

    Good Reads 
    opened by dangom 16
  • Intermittent ZeroDivisionError: division by zero

    Intermittent ZeroDivisionError: division by zero

    This is a fantastic library, thanks very much for your great work. Periodically though, I'm getting a ZeroDivisionError: division by zero while building a UMAP projection. My data doesn't change, nor does the way I call the UMAP constructor:

    model = umap.UMAP(n_neighbors=25, min_dist=0.00001, metric='correlation')
    fit_model = model.fit_transform( np.array(image_vectors) )
    

    Once in a while (maybe 5% of runs) this throws the following trace (umap version 0.1.5):

    File "imageplot.py", line 278, in <module>
        Imageplot(image_dir=sys.argv[1], output_dir='output')
      File "imageplot.py", line 30, in __init__
        self.create_2d_projection()
      File "imageplot.py", line 148, in create_2d_projection
        model = self.build_model(image_vectors)
      File "imageplot.py", line 175, in build_model
        return model.fit_transform( np.array(image_vectors) )
      File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 1402, in fit_transform
        self.fit(X)
      File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 1361, in fit
        self.verbose
      File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 385, in rptree_leaf_array
        angular=angular)
      File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 310, in make_tree
        angular)
      File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 315, in make_tree
        angular)
      File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 315, in make_tree
        angular)
      File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 310, in make_tree
        angular)
      File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 315, in make_tree
        angular)
      File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 315, in make_tree
        angular)
      File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 310, in make_tree
        angular)
      File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 310, in make_tree
        angular)
      File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 310, in make_tree
        angular)
      File "/Users/yaledhlab/anaconda3/lib/python3.6/site-packages/umap/umap_.py", line 301, in make_tree
        rng_state)
    ZeroDivisionError: division by zero
    

    I took a quick look at the make_tree function but that didn't show much--the real problem seems to be swallowed in the stacktrace by the recursion. Do you have an idea what might cause this? I'll upgrade to the latest master and see if the problem continues.

    opened by duhaime 16
  • [BUG] Very Serious Bug When using

    [BUG] Very Serious Bug When using "metric=precomputed".

    First of all, I want to calc a very large dataset using umap to make dimension reduction. Considering the document that a precomputed matrix could accelerate the computing speed.

    So, I used "metric=precomputed" switch to perform a better speed. However, I found the result is completely different from the original method !!! I've already located the problem, here is a introduction. So example code:

    import umap
    import scipy 
    import numpy as np
    from sklearn.metrics import pairwise_distances
    
    #version for  original code 
    rd = np.random.RandomState(888) 
    aa = rd.normal(5, 1, [40,50])
    model1 = umap.UMAP(random_state=10)
    embedding = model1.fit_transform(aa)
    
    print(embedding)
    
    #using precomputed distance
    bb = pairwise_distances(aa)
    from scipy.sparse import csr_matrix
    cc = csr_matrix(bb)
    
    model3  = umap.UMAP(metric='precomputed',random_state=10)
    embedding2 = model3.fit_transform(cc)
    print(embedding2)
    
    

    There is the first problem, If a precomputed matrix needed to use, the matrix should be a sparse matrix!!! other wise the code will calc the dist matrix again!!

    umap.py row=2364

    if self.metric == "precomputed" and self._sparse_data:
        # For sparse precomputed distance matrices, we just argsort the rows to find
        # nearest neighbors. To make this easier, we expect matrices that are
        # symmetrical (so we can find neighbors by looking at rows in isolation,
        # rather than also having to consider that sample's column too).
        # print("Computing KNNs for sparse precomputed distances...")
        if sparse_tril(X).getnnz() != sparse_triu(X).getnnz():
    

    so the result from above side is :

    #embedding 1
    [[-6.777242  -2.9653606]
     [-5.312394  -3.720777 ]
     [-7.0402703 -4.865579 ]
     [-7.494709  -4.1032014]
     [-6.739107  -4.2637396]
     [-6.359506  -4.032858 ]
        .......
    
    #embedding 2
      [[-8.083555  15.876725 ]
     [-8.736375  14.2785635]
     [-6.570646  15.637847 ]
     [-6.1307974 14.759017 ]
     [-6.2153373 15.248924 ]
     [-6.284779  16.172754 ]
     ......
    
    
    

    So, I tracked the original code to find what cause the difference.

    then a find here !

    umap.py row=562 in the function of "fuzzy_simplicial_set"

    this is a core function of umap

    
    knn_indices, knn_dists, _ = nearest_neighbors(
        X,
        n_neighbors,
        metric,
        metric_kwds,
        angular,
        random_state,
        verbose=verbose,
    )
    
    

    the bug is IF X is a sparse style matrix, the calc result is different from when X is a dense matrix!!!!!!

    result

    # when x is sparse matrix
    # x
    <40x40 sparse matrix of type '<class 'numpy.float32'>'
    	with 1560 stored elements in Compressed Sparse Row format>
    X.shape
    Out[3]: (40, 40)
    
    knn_dists
    [[ 7.33820343,  7.59337425,  8.05820179,  8.16062737,  8.27206707,
             8.3769598 ,  8.52959538,  8.54904938,  8.54981995,  8.65421963,
             8.75841427,  8.81221962,  8.81944847,  8.87848377,  8.88055706],
           [ 9.24660969,  9.44158745,  9.50287247,  9.5380373 ,  9.75831985,
             9.76912975,  9.88988495,  9.95298195, 10.00327873, 10.03878117,
            10.13149166, 10.21953869, 10.27251053, 10.27710056, 10.33901501],
    .......
    
    # we convert X to dense matrix 
    # using the same function again and the reset of the parameter are all the same
    
    aa = X.todense()
    
    a, b, _ = nearest_neighbors(
        aa,
        n_neighbors,
        metric,
        metric_kwds,
        angular,
        random_state,
        verbose=verbose,
    )
    
    # show b
    b
    matrix([[ 0.       ,  7.3382034,  7.5933743,  8.058202 ,  8.160627 ,
              8.272067 ,  8.37696  ,  8.529595 ,  8.549049 ,  8.54982  ,
              8.65422  ,  8.758414 ,  8.81222  ,  8.819448 ,  8.878484 ],
            [ 0.       ,  9.24661  ,  9.441587 ,  9.502872 ,  9.538037 ,
              9.75832  ,  9.76913  ,  9.889885 ,  9.952982 , 10.003279 ,
             10.038781 , 10.131492 , 10.219539 , 10.272511 , 10.277101 ],
            [ 0.       ,  8.134261 ,  8.294831 ,  8.376041 ,  8.377901 ,
              8.489849 ,  8.676778 ,  8.7918825,  8.965539 ,  9.050419 ,
              9.153272 ,  9.177004 ,  9.195522 ,  9.284557 ,  9.451907 ],
        .......
    
    

    Which means the same function of nearest_neighbors will generate differnt output when X is dense and X is sparse matrix!!

    umap.py row= 254

    def nearest_neighbors(
        X,
        n_neighbors,
        metric,
        metric_kwds,
        angular,
        random_state,
        low_memory=True,
        use_pynndescent=True,
        n_jobs=-1,
        verbose=False,
    ):
    
    

    this is a very serious problem~~~

    please fix it

    my package version

    umap.version == '0.5.3'

    opened by Wall-ee 0
  • fix relations_dictionary problems which prevents from correctly updating aligned_umap

    fix relations_dictionary problems which prevents from correctly updating aligned_umap

    Fixes AlignedUMAP.update which does not work correctly.

    The relations dictionary is updated with the inverse of the provided relations dictionary. However inverse dictionary should be used only in init_from_existing while generating new_embedding.

    The problem is not revealed when the number of entities over updates are equal and the inverse relation is equal to the relation itself. However when the number of entities increase over updates and the inverse relation is not equal to relation itself, the updates do not create a correct map.

    Here is an example notebook to see the how Aligned UMAP works over 10 years data where the number of entities increase over time. If 10 years' data is provided to AlignedUMAP.fitat once, it creates correct maps. However if AlignedUMAP.fitis used with partial data and the rest is feeded with AlignedUMAP.update, updates give incorrect mappings.

    I have some other suggestions but I would like to ask your comments before proceeding:

    • fit function signature has **fit_params however it only uses relations and window_size among these parameters. It gives the wrong impression that we can set fit parameters. I propose to keep fit function simple as in UMAP.fit and only receive X, y and relations and use window_size from self.alignment_window_size
    • update function does not initialize new mapper with the parameters used upon initialization in fit function. For example initial mappings might use a metric different from default one, but that metric will not be used in the updating mappers. I can change update function to initialize mapper using the same parameters in fit function.
    • are all values in PARAM_NAMES relevant for update? For example can we change metric for update?
    • default n_epochs value used in mapper and fit/update are not equal. Does that make any difference. Why is self.n_epochs value not updated with default value? If there is no specific reason, the value could be set in mapper and aligned_umap could use the value from mapper.

    If you agree with the proposed changes, I can add them to this PR.

    opened by hndgzkn 2
  • LoweringError: Failed in nopython mode pipeline (step: native lowering)

    LoweringError: Failed in nopython mode pipeline (step: native lowering)

    When trying to do the following, I get several errors. import umap.umap_ as UMAP

    I asked this same question in stackoverflow but didn't receive any solution there (already tried a couple solutions which I posted there). But, I'm not sure how to fix this issue at all & really need umap to function right now-- have a deadline for this project tomorrow (not sure why this all of a sudden isn't working). Should I just delete anaconda3 & all python packages and start from scratch?

    https://stackoverflow.com/questions/72245808/loweringerror-failed-in-nopython-mode-pipeline-step-native-lowering?noredirect=1#comment127641512_72245808

    numba is 0.53.0 and umap learn is 0.5.3. Does it matter that umap has "pypi" as channel and "pypi_0" as build and numba doesn't? both in anaconda3

    opened by sofiavaldez77 3
  • Maximum recursion depth error

    Maximum recursion depth error

    UMAP Version: 0.5.3 Source: conda-forge Python Version: 3.10.4 OS: Windows 11

    I am trying to perform metric learning with UMAP. I train with supervised training data and using that on a new test data. However, I keep getting the following error:

    RecursionError: Failed in nopython mode pipeline (step: convert to parfors)
    maximum recursion depth exceeded while calling a Python object
    

    There were similar issues reported here, for instance #99 , but the post indicates that it should be fixed. Adding small amounts of noise doesn't seem to help.

    The training data is approx. 6k samples long and has 24 features, and the test data is approximately 50k samples long. Note that the error only happens with the test data. The training data can be transformed without any issues.

    opened by stallam-unb 0
  • AWS installation error

    AWS installation error

    /opt/conda/lib/python3.7/site-packages/secretstorage/dhcrypto.py:16: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead from cryptography.utils import int_from_bytes /opt/conda/lib/python3.7/site-packages/secretstorage/util.py:25: CryptographyDeprecationWarning: int_from_bytes is deprecated, use int.from_bytes instead from cryptography.utils import int_from_bytes Collecting umap-learn Using cached umap-learn-0.5.3.tar.gz (88 kB) Preparing metadata (setup.py) ... error error: subprocess-exited-with-error

    × python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [20 lines of output] Traceback (most recent call last): File "", line 36, in File "", line 34, in File "/tmp/pip-install-yu54aukv/umap-learn_65b51e1e82e94363811f955f8daa853b/setup.py", line 75, in setup(**configuration) File "/opt/conda/lib/python3.7/site-packages/setuptools/init.py", line 87, in setup return distutils.core.setup(**attrs) File "/opt/conda/lib/python3.7/site-packages/setuptools/_distutils/core.py", line 109, in setup _setup_distribution = dist = klass(attrs) File "/opt/conda/lib/python3.7/site-packages/setuptools/dist.py", line 466, in init for k, v in attrs.items() File "/opt/conda/lib/python3.7/site-packages/setuptools/_distutils/dist.py", line 293, in init self.finalize_options() File "/opt/conda/lib/python3.7/site-packages/setuptools/dist.py", line 885, in finalize_options for ep in sorted(loaded, key=by_order): File "/opt/conda/lib/python3.7/site-packages/setuptools/dist.py", line 884, in loaded = map(lambda e: e.load(), filtered) File "/opt/conda/lib/python3.7/site-packages/setuptools/_vendor/importlib_metadata/init.py", line 196, in load return functools.reduce(getattr, attrs, module) AttributeError: type object 'Distribution' has no attribute '_finalize_feature_opts' [end of output]

    opened by alexsomoza 0
  • feature: add ability to specify tools to interactive charts so richer exploration with images and other content is possible

    feature: add ability to specify tools to interactive charts so richer exploration with images and other content is possible

    I ran into a scenario where I needed the interactive plots to show me the images in the tooltip in the thumbnail so I can debug the offending samples. Boken allows this with HoverTool concept that takes in a template of HTML and can render the data as desired.

    In my use case, I have HoverTool defined as the following:

    hover_tool = HoverTool(tooltips="""
    <div>
        <div>
            <img src='@image' style='float: left; margin: 5px 5px 5px 5px'/>
        </div>
        <div>
            <span style='font-size: 16px; color: #224499'>Class:</span>
            <span style='font-size: 18px'>@class</span>
        </div>
        <div>
            <span style='font-size: 16px; color: #224499'>Index:</span>
            <span style='font-size: 18px'>$index</span>
        </div>
        <div>
            <span style='font-size: 16px; color: #224499'>Values:</span>
            <span style='font-size: 18px'>($x, $y)</span>
        </div>
    </div>
    """
    

    Wherein my hover_data includes base64 encoded images as one of the columns with header image, generated using a method similar to the following:

        @staticmethod
        def embeddable_image(data):
            img_data = (data[0, 0, ...] * 255).astype(np.uint8)
            image = Image.fromarray(img_data, mode="L").resize((64, 64), Image.BICUBIC)
            buffer = BytesIO()
            image.save(buffer, format="jpeg")
            for_encoding = buffer.getvalue()
            image_blurb = f"data:image/jpg;base64,{base64.b64encode(for_encoding).decode()}"
            return image_blurb
    

    Then I apply this method to samples building hover_data as follows:

    df["image"] = list(map(self.embeddable_image, np.vsplit(img_set, img_set.shape[0])))
    

    where df is my data frame representing hover_data.

    So, when I invoke the interactive umap API:

            p = umap.plot.interactive(
                mapper,
                labels=r_labels_set,
                hover_data=df,
                point_size=2,
                tools=["pan", "wheel_zoom", "box_zoom", "save", "reset", "help", hover_tool],
                color_key_cmap="Paired",
                background="black",
            )
    

    Some example results:

    Screen Shot 2022-04-25 at 10 14 55 pm Screen Shot 2022-04-26 at 10 51 13 am

    opened by suneeta-mall 1
Releases(0.5.3)
Owner
Leland McInnes
Leland McInnes
Drug design and development team HackBio internship is a virtual bioinformatics program that introduces students and professional to advanced practical bioinformatics and its applications globally.

-Nyokong. Drug design and development team HackBio internship is a virtual bioinformatics program that introduces students and professional to advance

null 3 Aug 8, 2021
The Timescale NFT Starter Kit is a step-by-step guide to get up and running with collecting, storing, analyzing and visualizing NFT data from OpenSea, using PostgreSQL and TimescaleDB.

Timescale NFT Starter Kit The Timescale NFT Starter Kit is a step-by-step guide to get up and running with collecting, storing, analyzing and visualiz

Timescale 67 May 21, 2022
Plot and save the ground truth and predicted results of human 3.6 M and CMU mocap dataset.

Visualization-of-Human3.6M-Dataset Plot and save the ground truth and predicted results of human 3.6 M and CMU mocap dataset. human-motion-prediction

Gaurav Kumar Yadav 4 Apr 30, 2022
Debugging, monitoring and visualization for Python Machine Learning and Data Science

Welcome to TensorWatch TensorWatch is a debugging and visualization tool designed for data science, deep learning and reinforcement learning from Micr

Microsoft 3.2k May 20, 2022
Visualize and compare datasets, target values and associations, with one line of code.

In-depth EDA (target analysis, comparison, feature analysis, correlation) in two lines of code! Sweetviz is an open-source Python library that generat

Francois Bertrand 2k May 28, 2022
Visualize and compare datasets, target values and associations, with one line of code.

In-depth EDA (target analysis, comparison, feature analysis, correlation) in two lines of code! Sweetviz is an open-source Python library that generat

Francois Bertrand 1.2k Feb 18, 2021
Smoking Simulation is an app to simulate the spreading of smokers and non-smokers, their interactions and population during certain amount of time.

Smoking Simulation is an app to simulate the spreading of smokers and non-smokers, their interactions and population during certain

Bohdan Ruban 6 Jun 4, 2021
Rubrix is a free and open-source tool for exploring and iterating on data for artificial intelligence projects.

Open-source tool for exploring, labeling, and monitoring data for AI projects

Recognai 1.1k May 23, 2022
This package creates clean and beautiful matplotlib plots that work on light and dark backgrounds

This package creates clean and beautiful matplotlib plots that work on light and dark backgrounds. Inspired by the work of Edward Tufte.

Nico Schlömer 195 Apr 7, 2022
Python module for drawing and rendering beautiful atoms and molecules using Blender.

Batoms is a Python package for editing and rendering atoms and molecules objects using blender. A Python interface that allows for automating workflows.

Xing Wang 45 May 27, 2022
eoplatform is a Python package that aims to simplify Remote Sensing Earth Observation by providing actionable information on a wide swath of RS platforms and provide a simple API for downloading and visualizing RS imagery

An Earth Observation Platform Earth Observation made easy. Report Bug | Request Feature About eoplatform is a Python package that aims to simplify Rem

Matthew Tralka 3 Nov 11, 2021
China and India Population and GDP Visualization

China and India Population and GDP Visualization Historical Population Comparison between India and China This graph shows the population data of Indi

Nicolas De Mello 10 Oct 27, 2021
Dipto Chakrabarty 6 Jan 25, 2022
Exploratory analysis and data visualization of aircraft accidents and incidents in Brazil.

Exploring aircraft accidents in Brazil Occurrencies with aircraft in Brazil are investigated by the Center for Investigation and Prevention of Aircraf

Augusto Herrmann 5 Dec 14, 2021
Python scripts for plotting audiograms and related data from Interacoustics Equinox audiometer and Otoaccess software.

audiometry Python scripts for plotting audiograms and related data from Interacoustics Equinox 2.0 audiometer and Otoaccess software. Maybe similar sc

Hamilton Lab at UT Austin 1 Dec 9, 2021
Analysis and plotting for motor/prop/ESC characterization, thrust vs RPM and torque vs thrust

esc_test This is a Python package used to plot and analyze data collected for the purpose of characterizing a particular propeller, motor, and ESC con

Alex Spitzer 1 Dec 28, 2021
Farhad Davaripour, Ph.D. 1 Jan 5, 2022
Define fortify and autoplot functions to allow ggplot2 to handle some popular R packages.

ggfortify This package offers fortify and autoplot functions to allow automatic ggplot2 to visualize statistical result of popular R packages. Check o

Sinhrks 495 May 19, 2022
Fast data visualization and GUI tools for scientific / engineering applications

PyQtGraph A pure-Python graphics library for PyQt5/PyQt6/PySide2/PySide6 Copyright 2020 Luke Campagnola, University of North Carolina at Chapel Hill h

pyqtgraph 2.8k May 23, 2022