An implementation of the largeVis algorithm for visualizing large, high-dimensional datasets, for R

Overview

largeVis

Travis-CI Build Status Coverage Status https://gitter.im/elbamos/largeVis AppVeyor Build Status

This is an implementation of the largeVis algorithm described in (https://arxiv.org/abs/1602.00370). It also incorporates:

  • A very fast algorithm for estimating k-nearest neighbors, implemented in C++ with Rcpp and OpenMP. See the Benchmarks file for performance details.
  • Efficient implementations of the clustering algorithms:
    • HDBSCAN
    • OPTICS
    • DBSCAN
  • Functions for visualizing manifolds like this.

News Highlights

  • Version 0.1.10 re-adds clustering, and also adds momentum training to largeVis, as well as a host of other features and improvements.
  • Version 0.1.9.1 has been accepted by CRAN. Much grattitude to Uwe Ligges and Kurt Hornik for their assistance, advice, and patience.

Some Examples

MNIST

Wiki Words

Clustering With HDBSCAN

Visualize Embeddings

Visualize Embeddings

Building Notes

  • Note on R 3.4: Before R 3.4, the CRAN binaries were likely to have been compiled without OpenMP, and getting OpenMP to work on Mac OS X was somewhat tricky. This should all have changed (for the better) with R 3.4, which natively using clang 4.0 by default. Since R 3.4 is new, I'm not able to provide advice, but am interested in hearing of any issues and any workarounds to issues that you may discover.
Comments
  • caught segfault : 'memory not mapped'

    caught segfault : 'memory not mapped'

    Got the following segfault error with largeVis, as the 'bench' branch was not available, have recompiled from github/master without OpenMP as suggested in here.

    For compiling without OpenMP I made the Makevars file as follows,

    PKG_LIBS = $(FLIBS) $(LAPACK_LIBS) $(BLAS_LIBS)
    PKG_CXXFLAGS = -DARMA_64BIT_WORD -DNDEBUG
    CXX_STD=CXX11
    LDFLAGS = $(LDFLAGS)
    

    and compiled as

    R-3.3.1 CMD INSTALL largeVis-master
    

    The error message :

    > library(largeVis)
    Loading required package: Rcpp
    Loading required package: Matrix
    
    Attaching package: ‘Matrix’
    
    The following object is masked from ‘package:tidyr’:
    
        expand
    
    largeVis was compiled without OpenMP support.
    > neig<-randomProjectionTreeSearch(t(dat.small.matrix), K=10, tree_threshold = 100, max_iter = 15, n_trees = 10)
    
     *** caught segfault ***
    address 0x75a8, cause 'memory not mapped'
    
    Traceback:
     1: .Call("largeVis_searchTrees", PACKAGE = "largeVis", threshold,     n_trees, K, maxIter, data, distMethod, seed, threads, verbose)
     2: searchTrees(threshold = as.integer(tree_threshold), n_trees = as.integer(n_trees),     K = as.integer(K), maxIter = as.integer(max_iter), data = x,     distMethod = as.character(distance_method), seed = seed,     threads = threads, verbose = as.logical(verbose))
     3: randomProjectionTreeSearch.matrix(t(dat.small.matrix), K = 10,     tree_threshold = 100, max_iter = 15, n_trees = 10)
     4: randomProjectionTreeSearch(t(dat.small.matrix), K = 10, tree_threshold = 100,     max_iter = 15, n_trees = 10)
    
    

    Maybe I didn't compile it properly since the error still occurs in the 'multiprocessing step'.

    opened by NagaComBio 30
  • Installation error

    Installation error

    I'm using Rtools to install the source files (through R in Windows 7). The R version is 3.3.0 and gcc version is 4.9.3. The command I use to install in R is install.packages("largeVis", repos = NULL, type="source", verbose = T, quiet = F)

    The error information is as follows. (For convenience, I pick up the key line here: largeVis.cpp:58:6: error: cannot convert 'int*' to 'vertexidxtype* {aka long long int*}' in assignment)

    d:/R/Rtools/mingw_32/bin/g++ -std=c++0x -I"D:/R/R-33~1.0/include" -DNDEBUG -I"D:/R/R-3.3.0/library/Rcpp/include" -I"D:/R/R-3.3.0/library/RcppProgress/include" -I"D:/R/R-3.3.0/library/RcppArmadillo/include" -I"D:/R/R-3.3.0/library/testthat/include" -I"d:/Compiler/gcc-4.9.3/local330/include" -fopenmp -DARMA_64BIT_WORD -O2 -Wall -mtune=core2 -c gradients.cpp -o gradients.o d:/R/Rtools/mingw_32/bin/g++ -std=c++0x -I"D:/R/R-33~1.0/include" -DNDEBUG -I"D:/R/R-3.3.0/library/Rcpp/include" -I"D:/R/R-3.3.0/library/RcppProgress/include" -I"D:/R/R-3.3.0/library/RcppArmadillo/include" -I"D:/R/R-3.3.0/library/testthat/include" -I"d:/Compiler/gcc-4.9.3/local330/include" -fopenmp -DARMA_64BIT_WORD -O2 -Wall -mtune=core2 -c largeVis.cpp -o largeVis.o largeVis.cpp: In member function 'void Visualizer::initAlias(arma::ivec&, const vec&, const ivec&, Rcpp::Nullable<Rcpp::Vector<14, Rcpp::PreserveStorage> >)': largeVis.cpp:58:6: error: cannot convert 'int_' to 'vertexidxtype_ {aka long long int_}' in assignment ps = newps.memptr(); ^ largeVis.cpp: In function 'arma::mat sgd(arma::mat, arma::ivec&, arma::ivec&, arma::ivec&, arma::vec&, double, double, long long int, int, double, Rcpp::Nullable<Rcpp::Vector<14, Rcpp::PreserveStorage> >, bool)': largeVis.cpp:173:41: error: no matching function for call to 'Visualizer::Visualizer(int_, int_, const uword&, coordinatetype_, const int&, double, long long int)' (iterationtype) n_samples); ^ largeVis.cpp:173:41: note: candidate is: largeVis.cpp:34:3: note: Visualizer::Visualizer(vertexidxtype_, vertexidxtype_, dimidxtype, coordinatetype_, int, distancetype, iterationtype) Visualizer(vertexidxtype * sourcePtr, ^ largeVis.cpp:34:3: note: no known conversion for argument 1 from 'int_' to 'vertexidxtype* {aka long long int_}' make: *_* [largeVis.o] Error 1 Warning: running command 'make -f "Makevars" -f "D:/R/R-33~1.0/etc/i386/Makeconf" -f "D:/R/R-33~1.0/share/make/winshlib.mk" CXX='$(CXX1X) $(CXX1XSTD)' CXXFLAGS='$(CXX1XFLAGS)' CXXPICFLAGS='$(CXX1XPICFLAGS)' SHLIB_LDFLAGS='$(SHLIB_CXX1XLDFLAGS)' SHLIB_LD='$(SHLIB_CXX1XLD)' SHLIB="largeVis.dll" OBJECTS="RcppExports.o dbscan.o denseneighbors.o distance.o edgeweights.o gradients.o largeVis.o sparse.o"' had status 2 ERROR: compilation failed for package 'largeVis'

    • removing 'D:/R/R-3.3.0/library/largeVis'
    opened by gryang11 26
  • hdbscan-non-numeric argument to binary operator

    hdbscan-non-numeric argument to binary operator

    library(largeVis) set.seed(123) ts_matrix_elec <- elect_data %>% scale() %>% t() visObject <- largeVis(ts_matrix_elec, n_trees = 50, K = 10) plot(t(visObject$coords))

    clusters <- hdbscan(visObject, verbose = FALSE) # failed Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs - : non-numeric argument to binary operator

    gplot(clusters, t(visObject$coords))

    What happened? Is there any suggestion?

    opened by bifeng 24
  • testcfunctions.cpp fails (Win)... what is it supposed to do?

    testcfunctions.cpp fails (Win)... what is it supposed to do?

    On a github clone with R-3.4.1, the tests from testcfunctions.cpp fail:

    testthat results ================================================================
    OK: 147 SKIPPED: 1 FAILED: 1
    1. Failure: Catch unit tests pass (@test-cpp.R#6) 
    
    

    and above:

    testcfunctions.cpp:9
    ...............................................................................
    
    testcfunctions.cpp:17: FAILED:
      CATCH_CHECK( testAlias() == 71 )
    with expansion:
      83 == 71
    
    testcfunctions.cpp:18: FAILED:
      CATCH_CHECK( testAlias() == 74 )
    with expansion:
      97 == 74
    
    testcfunctions.cpp:19: FAILED:
      CATCH_CHECK( testAlias() == 70 )
    with expansion:
      68 == 70
    
    testcfunctions.cpp:20: FAILED:
      CATCH_CHECK( testAlias() == 90 )
    with expansion:
      88 == 90
    

    etc. But I don't really get what these tests are supposed to test? Looks like you are setting up the RNG with a seed and expect a certain output?

    Is this expected to fail on Windows possibly?

    opened by meowcat 19
  • randomProjectionTreeSearch gets stuck and never returns

    randomProjectionTreeSearch gets stuck and never returns

    Apologies in advance if this is not a "bug" but just something I am doing wrong.

    I have a data set of 423K rows and 225 dimensions. I am running the different largeVis steps separately to debug ("randomProjectionTreeSearch", "buildEdgeMatrix", "buildWijMatrix", "projectKNNs"). The first step runs at full speed for a couple of seconds and then settles in a mono-thread load (15% on a 4 core machine) and never returns.

    I have had the same behaviour with a similar dataset (423K rows) but with 500 different dimensions. In that case changing the "K" parameter prevented the issue. I have gone over the various hyper parameters but have not been able to find a setting that works for my set of 225 dimensions.

    Is there any way that I can debug this so as to prevent me from having to search randomly the solution space of hyper parameters ? I have tried setting the option "getOption("verbose", TRUE)" but this does not ouput anything.

    Any help would be appreciated. In any case, thanks for your wonderful package!

    Spec:

    • Windows 10 pro
    • 16 Gb RAM, core i7 6700 HQ (4 core)
    • largeVis 0.1.10 x64 (compiled against github, though I have also tried CRAN 32-bit version)
    • R 3.3.2 x86_64-w64-mingw32
    opened by avanwouwe 16
  • Why is largeVis found to be slower than Rtsne?

    Why is largeVis found to be slower than Rtsne?

    Hi,

    first of all, thank you very much for this implementation of largeVis. I gave it a try and it worked fine. I had to follow the instructions at http://thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks-lgfortran-and-lquadmath-error/ for Mac OS X El Capitan to install it first, though.

    The dataset that I tested is comprised of ~15k points with 512D. Running it with default parameters and a single thread took about 1200 seconds while running Rtsne() took about 212 seconds.

    The clusters looked much tighter and mostly better separated then those from Rtsne().

    The longer runtime came a bit unexpected after I read

    It has been benchmarked at more than 30x faster than Barnes-Hut on datasets of approximately 1-million rows, and scaled linearly as long as there is sufficient RAM.

    in your README.

    Hence, I am curious where this difference comes from and was wondering if you could maybe provide some clarifications here.

    TIA.

    Best,

    Cedric

    opened by claczny 16
  • BuildWijMatrix fails: invalid row or column index

    BuildWijMatrix fails: invalid row or column index

    R 3.2.5; latest version of largeVis; Ubuntu 12.04

    Running largeVis by itself and its step-by-step components doesn't work. I narrowed it down to the BuildWijMatrix step failing, but don't know how to proceed.

    sessionInfo() R version 3.2.5 (2016-04-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu precise (12.04.5 LTS)

    locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
    [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
    [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
    [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
    [9] LC_ADDRESS=C LC_TELEPHONE=C
    [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

    attached base packages: [1] stats graphics grDevices utils datasets methods base

    other attached packages: [1] largeVis_0.2 Matrix_1.2-8 Rcpp_0.12.10

    loaded via a namespace (and not attached): [1] colorspace_1.3-2 scales_0.4.1 assertthat_0.1 lazyeval_0.2.0
    [5] plyr_1.8.4 tools_3.2.5 gtable_0.2.0 tibble_1.2
    [9] ggplot2_2.2.1 grid_3.2.5 munsell_0.4.3 lattice_0.20-34

    dim(as.matrix(seurat_sw480@data)) [1] 15843 1691 neighbors <- randomProjectionTreeSearch(as.matrix(seurat_sw480@data), n_trees = 5, max_iter = 1, verbose=T) Searching for neighbors. 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| **************************************************| edges <- buildEdgeMatrix(data = as.matrix(seurat_sw480@data), neighbors = neighbors)
    gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 1124907 60.1 1770749 94.6 1770749 94.6 Vcells 118347552 903.0 196989268 1503.0 187146708 1427.9 rm(neighbors) wij <- buildWijMatrix(edges)

    error: SpMat::SpMat(): invalid row or column index Error in referenceWij(is, x@i, x@x^2, as.integer(threads), perplexity) : SpMat::SpMat(): invalid row or column index

    opened by billytcl 15
  • meaning of tree failure.

    meaning of tree failure.

    Thank you for providing the great works, I got a question that some dataset will lead to "tree failure exception" in the function "copyHeapToMatrix". What does it means? how can I avoid it when preparing the dataset?

    **********************************************terminate called after throwing an instance of 'Rcpp::exception'
      what():  Tree failure.
    Aborted
    
    opened by sparktsao 15
  • Error in newest R package version

    Error in newest R package version

    Hi, I get an error, when I load the package:

    > library(largeVis)

    Loading required package: Matrix Error : object ‘opticsXi’ is not exported by 'namespace:dbscan' Error: package or namespace load failed for ‘largeVis’

    my sessionInfo() is below. When I google I find this - so somebody else has had this error too. They point to the CRAN page where this error apears too: https://cran.r-project.org/web/checks/check_results_largeVis.html

    Is there an issue with the newest version? Thank you!

    R version 3.3.1 (2016-06-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

    locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
    [3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
    [5] LC_TIME=English_United States.1252

    attached base packages: [1] stats graphics grDevices utils datasets methods base

    other attached packages: [1] Matrix_1.2-6

    loaded via a namespace (and not attached): [1] tools_3.3.1 Rcpp_0.12.7 grid_3.3.1 dbscan_1.0-0 lattice_0.20-33

    opened by rwarnung 12
  • Error when installing largeVis on Ubunutu

    Error when installing largeVis on Ubunutu

    Hi, I have this error when I try to install your package like this install_github("elbamos/largeVis") :

    ** R
    ** inst
    ** byte-compile and prepare package for lazy loading
    ** help
    *** installing help indices
    ** building package indices
    ** installing vignettes
    ** testing if installed package can be loaded from temporary location
    Error: package or namespace load failed for ‘largeVis’:
     .onAttach failed in attachNamespace() for 'largeVis', details:
      call: checkBits()
      error: la fonction 'enterRNGScope' n'existe pas dans le package 'Rcpp'
    Erreur : le chargement a échoué
    Exécution arrêtée
    ERROR: loading failed
    * removing ‘/usr/local/lib/R/site-library/largeVis’
    Erreur : Failed to install 'largeVis' from GitHub:
      (converti depuis l'avis) installation of package ‘/tmp/RtmpptvoQp/file40d3768bc12b/largeVis_0.2.2.tar.gz’ had non-zero exit status
    
    opened by jsaintvanne 10
  • Install largeVis from Github met errors

    Install largeVis from Github met errors

    Hi, I was trying to install largeVis package using R function 'install_github()' with the error message: Can anyone help me solve this? `In file included from RcppExports.cpp:4: In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppArmadillo/include/RcppArmadillo.h:31: In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppArmadillo/include/RcppArmadilloForward.h:26: In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/RcppCommon.h:29: In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/Rcpp/r/headers.h:67: In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/Rcpp/platform/compiler.h:100: In file included from /usr/local/clang4/bin/../include/c++/v1/cmath:305: /usr/local/clang4/bin/../include/c++/v1/math.h:301:15: fatal error: 'math.h' file not found #include_next <math.h> ^~~~~~~~ 1 error generated. make: *** [RcppExports.o] Error 1 ERROR: compilation failed for package ‘largeVis’

    • removing ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/largeVis’ Error: Failed to install 'largeVis' from GitHub: (converted from warning) installation of package ‘/var/folders/9w/9grv0t81461bxp0r26p689b40000gn/T//Rtmpzoqfvg/file3e849b7844c/largeVis_0.2.tar.gz’ had non-zero exit status`
    opened by Shawnmhy 9
  • error: SpMat::SpMat(): invalid row or column index

    error: SpMat::SpMat(): invalid row or column index

    Hi there, I am trying to use largevis to do clustering. I have about ~200 dataset, each dataset has ~ 1000 - 100000 samples with 2 features (feature number is consistent). While the largevis function works for almost all my dataset, I still got this error message for one of my dataset:

    
    error: SpMat::SpMat(): invalid row or column index
    Error in referenceWij(is, x@i, x@x^2, as.integer(threads), perplexity) : 
      SpMat::SpMat(): invalid row or column index
    In addition: Warning message:
    In largeVis(t(as.matrix(memberships[, c("X", "Y")])), dim = 2, K = K,  :
      The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector.
    
    

    I realized that someone had such problem before, and the solution is to install the branch 'hotfix/twobugs', I successfully installed this version as well but no luck. Any ideas? Thanks!

    The dataset is here: data.csv

    The function I run is: largeVis(t(as.matrix(data[, c('X', 'Y')])), dim=2, K = K, tree_threshold = 100, max_iter = 5,sgd_batches = 1, threads = 1)

    opened by Shawnmhy 3
Owner
null
MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

MetPy MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data. MetPy follows semantic versioni

Unidata 971 Dec 25, 2022
yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

The yt project 367 Dec 25, 2022
The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

About The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficien

ROOT 2k Dec 29, 2022
PCAfold is an open-source Python library for generating, analyzing and improving low-dimensional manifolds obtained via Principal Component Analysis (PCA).

PCAfold is an open-source Python library for generating, analyzing and improving low-dimensional manifolds obtained via Principal Component Analysis (PCA).

Burn Research 4 Oct 13, 2022
A 2-dimensional physics engine written in Cairo

A 2-dimensional physics engine written in Cairo

Topology 38 Nov 16, 2022
TextDescriptives - A Python library for calculating a large variety of statistics from text

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance.

null 150 Dec 30, 2022
Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Fast Laplacian Eigenmaps in python Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python. Comes with an wrapper for NMS

null 17 Jul 9, 2022
This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

Ishan Hegde 1 Nov 17, 2021
An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

null 1 Feb 11, 2022
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled 102 Nov 10, 2022
HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

HyperSpy 411 Dec 27, 2022
Python Package for DataHerb: create, search, and load datasets.

The Python Package for DataHerb A DataHerb Core Service to Create and Load Datasets.

DataHerb 4 Feb 11, 2022
Instant search for and access to many datasets in Pyspark.

SparkDataset Provides instant access to many datasets right from Pyspark (in Spark DataFrame structure). Drop a star if you like the project. ?? Motiv

Souvik Pratiher 31 Dec 16, 2022
Active Learning demo using two small datasets

ActiveLearningDemo How to run step one put the dataset folder and use command below to split the dataset to the required structure run utils.py For ea

null 3 Nov 10, 2021
Python tools for querying and manipulating BIDS datasets.

PyBIDS is a Python library to centralize interactions with datasets conforming BIDS (Brain Imaging Data Structure) format.

Brain Imaging Data Structure 180 Dec 18, 2022
Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Python dataset creator to construct datasets composed of OpenFace extracted features and Shimmer3 GSR+ Sensor datas

Gabriele 3 Jul 5, 2022
VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

VHub - An API that permits uploading of vulnerability datasets and return of the serialized data

André Rodrigues 2 Feb 14, 2022
A variant of LinUCB bandit algorithm with local differential privacy guarantee

Contents LDP LinUCB Description Model Architecture Dataset Environment Requirements Script Description Script and Sample Code Script Parameters Launch

Weiran Huang 4 Oct 25, 2022
Pipeline and Dataset helpers for complex algorithm evaluation.

tpcp - Tiny Pipelines for Complex Problems A generic way to build object-oriented datasets and algorithm pipelines and tools to evaluate them pip inst

Machine Learning and Data Analytics Lab FAU 3 Dec 7, 2022