An implementation of the largeVis algorithm for visualizing large, high-dimensional datasets, for R

Last update: May 25, 2022

Related tags

Data Analysis largeVis

Overview

largeVis

This is an implementation of the largeVis algorithm described in (https://arxiv.org/abs/1602.00370). It also incorporates:

A very fast algorithm for estimating k-nearest neighbors, implemented in C++ with Rcpp and OpenMP. See the Benchmarks file for performance details.
Efficient implementations of the clustering algorithms:
- HDBSCAN
- OPTICS
- DBSCAN
Functions for visualizing manifolds like this.

News Highlights

Version 0.1.10 re-adds clustering, and also adds momentum training to largeVis, as well as a host of other features and improvements.
Version 0.1.9.1 has been accepted by CRAN. Much grattitude to Uwe Ligges and Kurt Hornik for their assistance, advice, and patience.

Some Examples

Clustering With HDBSCAN

Visualize Embeddings

Building Notes

Note on R 3.4: Before R 3.4, the CRAN binaries were likely to have been compiled without OpenMP, and getting OpenMP to work on Mac OS X was somewhat tricky. This should all have changed (for the better) with R 3.4, which natively using clang 4.0 by default. Since R 3.4 is new, I'm not able to provide advice, but am interested in hearing of any issues and any workarounds to issues that you may discover.

Comments

caught segfault : 'memory not mapped'

Got the following segfault error with largeVis, as the 'bench' branch was not available, have recompiled from github/master without OpenMP as suggested in here.

For compiling without OpenMP I made the Makevars file as follows,

PKG_LIBS = $(FLIBS) $(LAPACK_LIBS) $(BLAS_LIBS)
PKG_CXXFLAGS = -DARMA_64BIT_WORD -DNDEBUG
CXX_STD=CXX11
LDFLAGS = $(LDFLAGS)

and compiled as

R-3.3.1 CMD INSTALL largeVis-master

The error message :

> library(largeVis)
Loading required package: Rcpp
Loading required package: Matrix

Attaching package: ‘Matrix’

The following object is masked from ‘package:tidyr’:

    expand

largeVis was compiled without OpenMP support.
> neig<-randomProjectionTreeSearch(t(dat.small.matrix), K=10, tree_threshold = 100, max_iter = 15, n_trees = 10)

 *** caught segfault ***
address 0x75a8, cause 'memory not mapped'

Traceback:
 1: .Call("largeVis_searchTrees", PACKAGE = "largeVis", threshold,     n_trees, K, maxIter, data, distMethod, seed, threads, verbose)
 2: searchTrees(threshold = as.integer(tree_threshold), n_trees = as.integer(n_trees),     K = as.integer(K), maxIter = as.integer(max_iter), data = x,     distMethod = as.character(distance_method), seed = seed,     threads = threads, verbose = as.logical(verbose))
 3: randomProjectionTreeSearch.matrix(t(dat.small.matrix), K = 10,     tree_threshold = 100, max_iter = 15, n_trees = 10)
 4: randomProjectionTreeSearch(t(dat.small.matrix), K = 10, tree_threshold = 100,     max_iter = 15, n_trees = 10)

Maybe I didn't compile it properly since the error still occurs in the 'multiprocessing step'.

opened by NagaComBio 30

Installation error
I'm using Rtools to install the source files (through R in Windows 7). The R version is 3.3.0 and gcc version is 4.9.3. The command I use to install in R is install.packages("largeVis", repos = NULL, type="source", verbose = T, quiet = F)

The error information is as follows. (For convenience, I pick up the key line here: largeVis.cpp:58:6: error: cannot convert 'int*' to 'vertexidxtype* {aka long long int*}' in assignment)

d:/R/Rtools/mingw_32/bin/g++ -std=c++0x -I"D:/R/R-33~1.0/include" -DNDEBUG -I"D:/R/R-3.3.0/library/Rcpp/include" -I"D:/R/R-3.3.0/library/RcppProgress/include" -I"D:/R/R-3.3.0/library/RcppArmadillo/include" -I"D:/R/R-3.3.0/library/testthat/include" -I"d:/Compiler/gcc-4.9.3/local330/include" -fopenmp -DARMA_64BIT_WORD -O2 -Wall -mtune=core2 -c gradients.cpp -o gradients.o d:/R/Rtools/mingw_32/bin/g++ -std=c++0x -I"D:/R/R-33~1.0/include" -DNDEBUG -I"D:/R/R-3.3.0/library/Rcpp/include" -I"D:/R/R-3.3.0/library/RcppProgress/include" -I"D:/R/R-3.3.0/library/RcppArmadillo/include" -I"D:/R/R-3.3.0/library/testthat/include" -I"d:/Compiler/gcc-4.9.3/local330/include" -fopenmp -DARMA_64BIT_WORD -O2 -Wall -mtune=core2 -c largeVis.cpp -o largeVis.o largeVis.cpp: In member function 'void Visualizer::initAlias(arma::ivec&, const vec&, const ivec&, Rcpp::Nullable<Rcpp::Vector<14, Rcpp::PreserveStorage> >)': largeVis.cpp:58:6: error: cannot convert 'int_' to 'vertexidxtype_ {aka long long int_}' in assignment ps = newps.memptr(); ^ largeVis.cpp: In function 'arma::mat sgd(arma::mat, arma::ivec&, arma::ivec&, arma::ivec&, arma::vec&, double, double, long long int, int, double, Rcpp::Nullable<Rcpp::Vector<14, Rcpp::PreserveStorage> >, bool)': largeVis.cpp:173:41: error: no matching function for call to 'Visualizer::Visualizer(int_, int_, const uword&, coordinatetype_, const int&, double, long long int)' (iterationtype) n_samples); ^ largeVis.cpp:173:41: note: candidate is: largeVis.cpp:34:3: note: Visualizer::Visualizer(vertexidxtype_, vertexidxtype_, dimidxtype, coordinatetype_, int, distancetype, iterationtype) Visualizer(vertexidxtype * sourcePtr, ^ largeVis.cpp:34:3: note: no known conversion for argument 1 from 'int_' to 'vertexidxtype* {aka long long int_}' make: *_* [largeVis.o] Error 1 Warning: running command 'make -f "Makevars" -f "D:/R/R-33~1.0/etc/i386/Makeconf" -f "D:/R/R-33~1.0/share/make/winshlib.mk" CXX='$(CXX1X) $(CXX1XSTD)' CXXFLAGS='$(CXX1XFLAGS)' CXXPICFLAGS='$(CXX1XPICFLAGS)' SHLIB_LDFLAGS='$(SHLIB_CXX1XLDFLAGS)' SHLIB_LD='$(SHLIB_CXX1XLD)' SHLIB="largeVis.dll" OBJECTS="RcppExports.o dbscan.o denseneighbors.o distance.o edgeweights.o gradients.o largeVis.o sparse.o"' had status 2 ERROR: compilation failed for package 'largeVis'

removing 'D:/R/R-3.3.0/library/largeVis'
opened by gryang11 26
hdbscan-non-numeric argument to binary operator

library(largeVis) set.seed(123) ts_matrix_elec <- elect_data %>% scale() %>% t() visObject <- largeVis(ts_matrix_elec, n_trees = 50, K = 10) plot(t(visObject$coords))

clusters <- hdbscan(visObject, verbose = FALSE) # failed Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs - : non-numeric argument to binary operator

gplot(clusters, t(visObject$coords))

What happened? Is there any suggestion?

opened by bifeng 24

testcfunctions.cpp fails (Win)... what is it supposed to do?

On a github clone with R-3.4.1, the tests from testcfunctions.cpp fail:

testthat results ================================================================
OK: 147 SKIPPED: 1 FAILED: 1
1. Failure: Catch unit tests pass (@test-cpp.R#6)

and above:

testcfunctions.cpp:9
...............................................................................

testcfunctions.cpp:17: FAILED:
  CATCH_CHECK( testAlias() == 71 )
with expansion:
  83 == 71

testcfunctions.cpp:18: FAILED:
  CATCH_CHECK( testAlias() == 74 )
with expansion:
  97 == 74

testcfunctions.cpp:19: FAILED:
  CATCH_CHECK( testAlias() == 70 )
with expansion:
  68 == 70

testcfunctions.cpp:20: FAILED:
  CATCH_CHECK( testAlias() == 90 )
with expansion:
  88 == 90

etc. But I don't really get what these tests are supposed to test? Looks like you are setting up the RNG with a seed and expect a certain output?

Is this expected to fail on Windows possibly?

opened by meowcat 19

randomProjectionTreeSearch gets stuck and never returns
Apologies in advance if this is not a "bug" but just something I am doing wrong.

I have a data set of 423K rows and 225 dimensions. I am running the different largeVis steps separately to debug ("randomProjectionTreeSearch", "buildEdgeMatrix", "buildWijMatrix", "projectKNNs"). The first step runs at full speed for a couple of seconds and then settles in a mono-thread load (15% on a 4 core machine) and never returns.

I have had the same behaviour with a similar dataset (423K rows) but with 500 different dimensions. In that case changing the "K" parameter prevented the issue. I have gone over the various hyper parameters but have not been able to find a setting that works for my set of 225 dimensions.

Is there any way that I can debug this so as to prevent me from having to search randomly the solution space of hyper parameters ? I have tried setting the option "getOption("verbose", TRUE)" but this does not ouput anything.

Any help would be appreciated. In any case, thanks for your wonderful package!

Spec:

Windows 10 pro

16 Gb RAM, core i7 6700 HQ (4 core)

largeVis 0.1.10 x64 (compiled against github, though I have also tried CRAN 32-bit version)

R 3.3.2 x86_64-w64-mingw32
opened by avanwouwe 16
Why is largeVis found to be slower than Rtsne?

Hi,

first of all, thank you very much for this implementation of largeVis. I gave it a try and it worked fine. I had to follow the instructions at http://thecoatlessprofessor.com/programming/rcpp-rcpparmadillo-and-os-x-mavericks-lgfortran-and-lquadmath-error/ for Mac OS X El Capitan to install it first, though.

The dataset that I tested is comprised of ~15k points with 512D. Running it with default parameters and a single thread took about 1200 seconds while running Rtsne() took about 212 seconds.

The clusters looked much tighter and mostly better separated then those from Rtsne().

The longer runtime came a bit unexpected after I read

It has been benchmarked at more than 30x faster than Barnes-Hut on datasets of approximately 1-million rows, and scaled linearly as long as there is sufficient RAM.

in your README.

Hence, I am curious where this difference comes from and was wondering if you could maybe provide some clarifications here.

TIA.

Best,

Cedric

opened by claczny 16
BuildWijMatrix fails: invalid row or column index

R 3.2.5; latest version of largeVis; Ubuntu 12.04

Running largeVis by itself and its step-by-step components doesn't work. I narrowed it down to the BuildWijMatrix step failing, but don't know how to proceed.

sessionInfo() R version 3.2.5 (2016-04-14) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu precise (12.04.5 LTS)

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] largeVis_0.2 Matrix_1.2-8 Rcpp_0.12.10

loaded via a namespace (and not attached): [1] colorspace_1.3-2 scales_0.4.1 assertthat_0.1 lazyeval_0.2.0
[5] plyr_1.8.4 tools_3.2.5 gtable_0.2.0 tibble_1.2
[9] ggplot2_2.2.1 grid_3.2.5 munsell_0.4.3 lattice_0.20-34

dim(as.matrix(seurat_sw480@data)) [1] 15843 1691 neighbors <- randomProjectionTreeSearch(as.matrix(seurat_sw480@data), n_trees = 5, max_iter = 1, verbose=T) Searching for neighbors. 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| **************************************************| edges <- buildEdgeMatrix(data = as.matrix(seurat_sw480@data), neighbors = neighbors)
gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 1124907 60.1 1770749 94.6 1770749 94.6 Vcells 118347552 903.0 196989268 1503.0 187146708 1427.9 rm(neighbors) wij <- buildWijMatrix(edges)

error: SpMat::SpMat(): invalid row or column index Error in referenceWij(is, x@i, x@x^2, as.integer(threads), perplexity) : SpMat::SpMat(): invalid row or column index

opened by billytcl 15
meaning of tree failure.
Thank you for providing the great works, I got a question that some dataset will lead to "tree failure exception" in the function "copyHeapToMatrix". What does it means? how can I avoid it when preparing the dataset?

**********************************************terminate called after throwing an instance of 'Rcpp::exception' what(): Tree failure. Aborted
opened by sparktsao 15
Error in newest R package version

Hi, I get an error, when I load the package:

> library(largeVis)

Loading required package: Matrix Error : object ‘opticsXi’ is not exported by 'namespace:dbscan' Error: package or namespace load failed for ‘largeVis’

my sessionInfo() is below. When I google I find this - so somebody else has had this error too. They point to the CRAN page where this error apears too: https://cran.r-project.org/web/checks/check_results_largeVis.html

Is there an issue with the newest version? Thank you!

R version 3.3.1 (2016-06-21) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 7 x64 (build 7601) Service Pack 1

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] Matrix_1.2-6

loaded via a namespace (and not attached): [1] tools_3.3.1 Rcpp_0.12.7 grid_3.3.1 dbscan_1.0-0 lattice_0.20-33

opened by rwarnung 12

Error when installing largeVis on Ubunutu

Hi, I have this error when I try to install your package like this install_github("elbamos/largeVis") :

** R
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
Error: package or namespace load failed for ‘largeVis’:
 .onAttach failed in attachNamespace() for 'largeVis', details:
  call: checkBits()
  error: la fonction 'enterRNGScope' n'existe pas dans le package 'Rcpp'
Erreur : le chargement a échoué
Exécution arrêtée
ERROR: loading failed
* removing ‘/usr/local/lib/R/site-library/largeVis’
Erreur : Failed to install 'largeVis' from GitHub:
  (converti depuis l'avis) installation of package ‘/tmp/RtmpptvoQp/file40d3768bc12b/largeVis_0.2.2.tar.gz’ had non-zero exit status

opened by jsaintvanne 10

Install largeVis from Github met errors
Hi, I was trying to install largeVis package using R function 'install_github()' with the error message: Can anyone help me solve this? `In file included from RcppExports.cpp:4: In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppArmadillo/include/RcppArmadillo.h:31: In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/RcppArmadillo/include/RcppArmadilloForward.h:26: In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/RcppCommon.h:29: In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/Rcpp/r/headers.h:67: In file included from /Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include/Rcpp/platform/compiler.h:100: In file included from /usr/local/clang4/bin/../include/c++/v1/cmath:305: /usr/local/clang4/bin/../include/c++/v1/math.h:301:15: fatal error: 'math.h' file not found #include_next <math.h> ^~~~~~~~ 1 error generated. make: *** [RcppExports.o] Error 1 ERROR: compilation failed for package ‘largeVis’

removing ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/largeVis’ Error: Failed to install 'largeVis' from GitHub: (converted from warning) installation of package ‘/var/folders/9w/9grv0t81461bxp0r26p689b40000gn/T//Rtmpzoqfvg/file3e849b7844c/largeVis_0.2.tar.gz’ had non-zero exit status`
opened by Shawnmhy 9
error: SpMat::SpMat(): invalid row or column index
Hi there, I am trying to use largevis to do clustering. I have about ~200 dataset, each dataset has ~ 1000 - 100000 samples with 2 features (feature number is consistent). While the largevis function works for almost all my dataset, I still got this error message for one of my dataset:

error: SpMat::SpMat(): invalid row or column index Error in referenceWij(is, x@i, x@x^2, as.integer(threads), perplexity) : SpMat::SpMat(): invalid row or column index In addition: Warning message: In largeVis(t(as.matrix(memberships[, c("X", "Y")])), dim = 2, K = K, : The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector.

I realized that someone had such problem before, and the solution is to install the branch 'hotfix/twobugs', I successfully installed this version as well but no luck. Any ideas? Thanks!

The dataset is here: data.csv

The function I run is: largeVis(t(as.matrix(data[, c('X', 'Y')])), dim=2, K = K, tree_threshold = 100, max_iter = 5,sgd_batches = 1, threads = 1)
opened by Shawnmhy 3

Owner

GitHub

MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

MetPy MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data. MetPy follows semantic versioni

971 Dec 25, 2022

yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data.

The yt Project yt is an open-source, permissively-licensed Python library for analyzing and visualizing volumetric data. yt supports structured, varia

367 Dec 25, 2022

The official repository for ROOT: analyzing, storing and visualizing big data, scientifically

About The ROOT system provides a set of OO frameworks with all the functionality needed to handle and analyze large amounts of data in a very efficien

2k Dec 29, 2022

PCAfold is an open-source Python library for generating, analyzing and improving low-dimensional manifolds obtained via Principal Component Analysis (PCA).

4 Oct 13, 2022

A 2-dimensional physics engine written in Cairo

38 Nov 16, 2022

TextDescriptives - A Python library for calculating a large variety of statistics from text

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistics, readability metrics, and metrics related to dependency distance.

150 Dec 30, 2022

Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Fast Laplacian Eigenmaps in python Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python. Comes with an wrapper for NMS

17 Jul 9, 2022

This is an example of how to automate Ridit Analysis for a dataset with large amount of questions and many item attributes

1 Nov 17, 2021

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify.

An ETL Pipeline of a large data set from a fictitious music streaming service named Sparkify. The ETL process flows from AWS's S3 into staging tables in AWS Redshift.

1 Feb 11, 2022

A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

102 Nov 10, 2022