Subpopulation detection in high-dimensional single-cell data

Dana Pe'er Lab

Last update: Sep 5, 2022

Related tags

Deep Learning PhenoGraph

Overview

PhenoGraph for Python3

PhenoGraph is a clustering method designed for high-dimensional single-cell data. It works by creating a graph ("network") representing phenotypic similarities between cells and then identifying communities in this graph.

This software package includes compiled binaries that run community detection based on C++ code written by E. Lefebvre and J.-L. Guillaume in 2008 ("Louvain method"). The code has been altered to interface more efficiently with the Python code here. It should work on reasonably current Linux, Mac and Windows machines.

To install PhenoGraph, simply run the setup script:

pip install PhenoGraph

Expected use is within a script or interactive kernel running Python 3.x. Data are expected to be passed as a numpy.ndarray. When applicable, the code uses CPU multicore parallelism via multiprocessing.

To run basic clustering:

import phenograph
communities, graph, Q = phenograph.cluster(data)

For a dataset of N rows, communities will be a length N vector of integers specifying a community assignment for each row in the data. Any rows assigned -1 were identified as outliers and should not be considered as a member of any community. graph is a N x N scipy.sparse matrix representing the weighted graph used for community detection. Q is the modularity score for communities as applied to graph.

If you use PhenoGraph in work you publish, please cite our publication:

@article{Levine_PhenoGraph_2015,
  doi = {10.1016/j.cell.2015.05.047},
  url = {http://dx.doi.org/10.1016/j.cell.2015.05.047},
  year  = {2015},
  month = {jul},
  publisher = {Elsevier {BV}},
  volume = {162},
  number = {1},
  pages = {184--197},
  author = {Jacob H. Levine and Erin F. Simonds and Sean C. Bendall and Kara L. Davis and El-ad D. Amir and Michelle D. Tadmor and Oren Litvin and Harris G. Fienberg and Astraea Jager and Eli R. Zunder and Rachel Finck and Amanda L. Gedman and Ina Radtke and James R. Downing and Dana Pe'er and Garry P. Nolan},
  title = {Data-Driven Phenotypic Dissection of {AML} Reveals Progenitor-like Cells that Correlate with Prognosis},
  journal = {Cell}
}

Release Notes

Version 1.5.7

Updated leidenalg and scipy version requirements, revised parallel jaccard to support scipy==1.5.1, and created a test collection for use with pytest (see below).
Added PhenoGraph clustering tutorial with PBMC3K dataset from 10X Genomics (dataset included).

Version 1.5.6

Fix the multiprocessing code that doesn't close/join the pool.

Version 1.5.5

Exposed parameter n_iterations for Leiden, along with minor fixes to manage sorting parallelism, and updated documentation of the clustering and sorting methods.

Version 1.5.4

Faster and more efficient sorting by size of clusters, for large nearest neighbours graph, implementing multiprocessing and faster methods for sorting.

Version 1.5.3

Phenograph supports now Leiden algorithm for community detection. The new feature can be called from phenograph.cluster, by choosing leiden as the clustering algorithm.

Version 1.5.2

Include simple parallel implementation of brute force nearest neighbors search using scipy's cdist and multiprocessing. This may be more efficient than kdtree on very large high-dimensional data sets and avoids memory issues that arise in sklearn's implementation.
Refactor parallel_jaccard_kernel to remove unnecessary use of ctypes and multiprocessing.Array.

Version 1.5.1

Make louvain_time_limit a parameter to phenograph.cluster.

Version 1.5

phenograph.cluster can now take as input a square sparse matrix, which will be interpreted as a k-nearest neighbor graph. Note that this graph must have uniform degree (i.e. the same value of k at every point).
The default time_limit for Louvain iterations has been increased to a more generous 2000 seconds (~half hour).

Version 1.4.1

After observing inconsistent behavior of sklearn.NearestNeighbors with respect to inclusion of self-neighbors, the code now checks that self-neighbors have been included before deleting those entries.

Version 1.4

The dependence on IPython and/or ipyparallel has been removed. Instead the native multiprocessing package is used.
Multiple CPUs are used by default for computation of nearest neighbors and Jaccard graph.

Version 1.3

Proper support for Linux.

Running the Unit Tests

Unit tests for assessing the functionality of each module are included in the 'tests' directory. In addition to the dependencies required by PhenoGraph, to run these tests, you must first install the pytest module.

If your system uses Python >= 3.8.0 or greater, install pytest with:

pip install pytest #Python >= 3.8.0

Otherwise, install pytest with:

pip install pytest==6.0.2 #Python < 3.8.0

Once pytest is installed, navigate to the 'PhenoGraph/' directory and run with:

pytest

All tests should pass with no warnings.

Troubleshooting

Notebook freezes after several attempts of running PhenoGraph using Jypyter Notebook

Running PhenoGraph from a Jupyter Notebook repeatedly on macOS Catalina, but not Mojave, using Python 3.7.6, causes a hang and the notebook becomes unresponsive, even for a basic matrix of nearest neighbors. However, this issue was not reproducible in command line using Python interpreter in both Catalina and Mojave platforms, without using Jupyter Notebook.

It was found that attempting to plot principal components using
```
:func:`~matplotlib.pyplot.scatter`
```
in Jupyter Notebook causes a freeze, and PhenoGraph becomes unresponsive unless the kernel is restarted. When removing this line of code, everything goes back to normal and the Jupyter Notebook stopes crashing with repeated runs of PhenoGraph.

Architecture related error

When attempting to process very large nearest neighbours graph, e.g. a 2000000 x 2000000 kNN graph matrix with 300 nearest neighbours, a struct.error() is raised:
```
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
```
This issue was reported on stackoverflow and it's related to the multiprocessing while building the Jaccard object.

The struct.error() has been fixed in python >= 3.8.0.

`leidenalg` inside conda environment

When using PhenoGraph inside a conda environment leiden takes longer to complete for larger samples compared to the system Python.

`numpy.ufunc` runtime warning when running pytest

When running unit tests, pytest may deliver the following warning RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. This is caused by an incompatability between newer versions of the pytest module and an older version of Python. Check your Python version with:
```
python --version
```
If using Python >= 3.8.0, all version of pytest are compatible. If using Python < 3.8.0, downgrade to pytest version 6.0.2 (see above).

Comments

Added tests and updated requirements

requirements.txt now includes update to scipy and leidenalg

new tests folder added with 7 unit tests for functions which rely on scipy/leidenalg

cluster function in cluster.py refactored to separate community detection functions for testing

Note: Maybe we should create a "version 1.5.7" branch as mentioned?

opened by andrewmoorman 6
Issue with multiprocessing

Hello, I have installed PhenoGraph 1.5.7 together with the required libraries. PhenoGraph breaks when trying to run the example notebook not "line by line" but in full (have a try with this: PhenoGraph_demo_random.zip). Indeed, PhenoGraph crashes during internal execution of sort_by_size function, starting an infinite loop in which 4 identical plots are created.

When I set n_jobs = 4 or I omit that parameter, the execution crashes during Neighbors computation:

Instead, if I set n_jobs = 1, the execution stops at sort_by_size function:

In both cases the error, related to multiprocessing, is the following:

How could we solve that?

opened by giacomos97 5

TypeError: Expected list, got numpy.ndarray

Numpy version is 1.18.5. Running phenograph.cluster(pca_df), where pca_df is either a np.ndarray or pd.DataFrame, gives the following error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-127-8f4599a5b0ca> in <module>
      2 
      3 # Cluster and cluster centrolds
----> 4 communities, graph, Q = phenograph.cluster(pca_df.values)
      5 communities = pd.Series(communities, index=pca_df.index)

~/software/miniconda3/lib/python3.8/site-packages/phenograph/cluster.py in cluster(data, clustering_algo, k, directed, prune, min_cluster_size, jaccard, primary_metric, n_jobs, q_tol, louvain_time_limit, nn_method, partition_type, resolution_parameter, n_iterations, use_weights, seed, **kargs)
    243     else:
    244         del d
--> 245         graph = neighbor_graph(kernel, kernelargs)
    246         print(
    247             "Jaccard graph constructed in {} seconds".format(time.time() - subtic),

~/software/miniconda3/lib/python3.8/site-packages/phenograph/core.py in neighbor_graph(kernel, kernelargs)
     82     :return graph: n-by-n COO sparse matrix
     83     """
---> 84     i, j, s = kernel(**kernelargs)
     85     n, k = kernelargs["idx"].shape
     86     graph = sp.coo_matrix((s, (i, j)), shape=(n, n))

~/software/miniconda3/lib/python3.8/site-packages/phenograph/core.py in parallel_jaccard_kernel(idx)
    152         graph.data[i] = tup[1]
    153 
--> 154     i, j = graph.nonzero()
    155     s = graph.tocoo().data
    156     return i, j, s[s > 0]

~/software/miniconda3/lib/python3.8/site-packages/scipy/sparse/base.py in nonzero(self)
    774 
    775         # convert to COOrdinate format
--> 776         A = self.tocoo()
    777         nz_mask = A.data != 0
    778         return (A.row[nz_mask], A.col[nz_mask])

~/software/miniconda3/lib/python3.8/site-packages/scipy/sparse/base.py in tocoo(self, copy)
    904         the resultant coo_matrix.
    905         """
--> 906         return self.tocsr(copy=False).tocoo(copy=copy)
    907 
    908     def tolil(self, copy=False):

~/software/miniconda3/lib/python3.8/site-packages/scipy/sparse/lil.py in tocsr(self, copy)
    475         indices = np.empty(nnz, dtype=idx_dtype)
    476         data = np.empty(nnz, dtype=self.dtype)
--> 477         _csparsetools.lil_flatten_to_array(self.rows, indices)
    478         _csparsetools.lil_flatten_to_array(self.data, data)
    479 

_csparsetools.pyx in scipy.sparse._csparsetools.lil_flatten_to_array()

_csparsetools.pyx in scipy.sparse._csparsetools._lil_flatten_to_array_int32()

TypeError: Expected list, got numpy.ndarray

bug

opened by vincent6liu 5

Incompatibility scipy > 1.4.2

PhenoGraph seems to break down with the newest versions of scipy (1.5 and 1.6). When running the algorithm, the following error occurs:

TypeError: Expected list, got numpy.ndarray

Full error code can be found here: https://github.com/theislab/scanpy/issues/1407

Any ideas or updates on how to resolve this?

opened by prubbens 3
Best k value to use

Hi, I have a question about the parameter k, the number of nearest neighbors to use, is there a way to choose the best k value? like comparing the modularity for each k and choose based on some criteria. Thanks.

opened by dalide 2
Latest version PG 1.5.7 does not work with latest requirement Leiden 0.8.2

I updated my Phenograph to 1.5.7 in order to work with latest scipy version which is required by latest scanpy. A Leiden >=0.8.2 was installed as part of PhenoGraph requirement. However, after testing, I believe there is some change in the latest Leiden that leads to Phenograph failure.

Downgrading leiden to 0.8.1 works.

opened by YubinXie 2
no executable permission of Louvain/linux-* files

Hi there!

Just upgraded my PhenoGraph with pip install -U PhenoGraph, and after that I got an error when running PhenoGraph, stating that permission was denied to my [path_to_env]lib/python3.7/site-packages/phenograph/louvain/linux-convert file. Same for the other two linux- files in that folder. After changing permissions to those three files with chmod +x, the error was solved. Not sure if it was a glitch in my system or a problem with the PhenoGraph package, but maybe you guys could check :)

opened by LisaSikkema 2
Fixed the type error from upgraded Scipy (> 1.4.1)

Fixed the type error which results from the upgrade to Scipy > 1.4.1. This previously resulted from the way the sparse lil graph was constructed (using ndarray instead of list values) which requires lists as arguments.

opened by andrewmoorman 2
Running PhenoGraph from Jupyter Notebook sometimes hangs
Environment:

macOS Catalina

Python 3.7.6

Jupyter Notebook

matplotlib

How to Reproduce:

Create a new Juypyter Notebook

Draw a plot using matplotlib (e.g. scatter plot)

Run PhenoGraph multiple times

Tentative Solution:

Unfortunately, there is no proper solution to this problem yet. We will revisit this once the following issue is resolved on the matplotlib side: https://github.com/matplotlib/matplotlib/issues/15410

In the meantime, you can avoid this problem by doing one of the following:

Do not upgrade to macOS Catalina. Also, it seems that the problem doesn't happen to Linux users (e.g. HPC, AWS, GCP)

Upgrade to Python 3.8.2.

Set n_jobs to 1. (note that this disables multiprocessing, thus PhenoGraph will run slow on a huge matrix, but the performance difference is usually negligible if you work with relatively small matrices)

Use a different backend engine for matplotlib (e.g. no inline plotting)

If possible, run PhenoGraph first before importing the matplotlib library.
opened by hisplan 2
Fixed the type error from upgraded Scipy (> 1.4.1)

Fixed the type error which results from the upgrade to Scipy > 1.4.1. This previously resulted from the way the sparse lil graph was constructed (using ndarray instead of list values) which requires lists as arguments.

opened by andrewmoorman 0
Assign new points to defined clusters

Hi,

I often work with large amount of data (i.e., number of samples >>> number of features). So, I conduct clustering analysis by selecting a subset to determine the desired number of clusters. Then, I query the remaining data to assign unseen points to their respective clusters.

I was wondering if the produced graph by PhenoGraph package can be used to mimic the process outlined above. If so, could you provide guidelines on how to accomplish this.

Many thanks,

Ivan

opened by ivan-marroquin 0
`phonograph.cluster()` function doesn't seem to work correctly on Windows Subsystem for Linux

Hi I run this PhenoGraph tutorial "tutorial_pbmc3k.ipynb" But the “phonograph.cluster()” function can’t successfully calculate the communities, all the communities are zeros. But there is no error or warning. Is there something wrong with this function? IPython 7.20.0 jupyter_client 6.1.11 jupyter_core 4.7.1 jupyterlab 3.0.7 notebook 6.2.0 scanpy==1.6.1 anndata==0.7.5 umap==0.4.6 numpy==1.19.2 scipy==1.4.1 pandas==1.2.1 scikit-learn==0.24.1 statsmodels==0.12.1 python-igraph==0.8.3 louvain==0.7.0 leidenalg==0.8.1

Thanks Dan

opened by danli349 12

Releases(v1.5.7)

v1.5.7(Oct 5, 2020)
Revised parallel Jaccard to support scipy==1.5.1 (previously failed with scipy=1.4.1 / https://github.com/dpeerlab/PhenoGraph/issues/4)

Updated leidenalg and scipy version requirements.

Added PhenoGraph clustering tutorial with PBMC3K dataset from 10X Genomics.

Created a test collection for use with pytest.

Source code(tar.gz)
Source code(zip)
v1.5.6(May 3, 2020)
Fix the multiprocessing code that doesn't close/join the pool.

Source code(tar.gz)
Source code(zip)
v1.5.5(May 1, 2020)

Source code(tar.gz)
Source code(zip)
v1.5.4(Apr 29, 2020)

Fix to sort_by_size: faster and efficient sorting by size of clusters
Source code(tar.gz)
Source code(zip)
v1.5.3(Feb 28, 2020)

Phenograph supports now Leiden algorithm for community detection. The new feature can be called from phenograph.cluster, by choosing leiden as the clustering algorithm. Few adjustments to the code including black formatting.
Source code(tar.gz)
Source code(zip)
v1.5.2(Sep 19, 2019)
Include simple parallel implementation of brute force nearest neighbors search using scipy's cdist and multiprocessing. This may be more efficient than kdtree on very large high-dimensional data sets and avoids memory issues that arise in sklearn's implementation.

Refactor parallel_jaccard_kernel to remove unnecessary use of ctypes and multiprocessing.Array.

Source code(tar.gz)
Source code(zip)

Owner

Dana Pe'er Lab

Computational Biology @ Memorial Sloan Kettering Cancer Center

GitHub

A Parameter-free Deep Embedded Clustering Method for Single-cell RNA-seq Data

A Parameter-free Deep Embedded Clustering Method for Single-cell RNA-seq Data Overview Clustering analysis is widely utilized in single-cell RNA-seque

3 May 8, 2022

Scikit-event-correlation - Event Correlation and Forecasting over High Dimensional Streaming Sensor Data algorithms

scikit-event-correlation Event Correlation and Changing Detection Algorithm Theo

5 Oct 30, 2022

7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle

kaggle-hpa-2021-7th-place-solution Code for 7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle. A description of the met

8 Jul 9, 2021

Single Red Blood Cell Hydrodynamic Traps Via the Generative Design

Rbc-traps-generative-design - The generative design for single red clood cell hydrodynamic traps using GEFEST framework

4 Jun 16, 2022

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression.

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression. Not an official Google product. Me

27 Dec 12, 2022

Torch-based tool for quantizing high-dimensional vectors using additive codebooks

Trainable multi-codebook quantization This repository implements a utility for use with PyTorch, and ideally GPUs, for training an efficient quantizer

41 Jan 7, 2023

The implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets.

Joint t-sne This is the implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets. abstract: We present Jo

7 Dec 18, 2022

pcnaDeep integrates cutting-edge detection techniques with tracking and cell cycle resolving models.

pcnaDeep: a deep-learning based single-cell cycle profiler with PCNA signal Welcome! pcnaDeep integrates cutting-edge detection techniques with tracki

8 Oct 18, 2022

Learning cell communication from spatial graphs of cells

ncem Features Repository for the manuscript Fischer, D. S., Schaar, A. C. and Theis, F. Learning cell communication from spatial graphs of cells. 2021

77 Dec 30, 2022

Message Passing on Cell Complexes

CW Networks This repository contains the code used for the papers Weisfeiler and Lehman Go Cellular: CW Networks (Under review) and Weisfeiler and Leh

108 Jan 5, 2023

LIVECell - A large-scale dataset for label-free live cell segmentation

LIVECell dataset This document contains instructions of how to access the data associated with the submitted manuscript "LIVECell - A large-scale data

112 Jan 7, 2023

A lightweight Python-based 3D network multi-agent simulator. Uses a cell-based congestion model. Calculates risk, loudness and battery capacities of the agents. Suitable for 3D network optimization tasks.

AMAZ3DSim AMAZ3DSim is a lightweight python-based 3D network multi-agent simulator. It uses a cell-based congestion model. It calculates risk, battery

13 Nov 4, 2022

Kaggle: Cell Instance Segmentation

Kaggle: Cell Instance Segmentation The goal of this challenge is to detect cells in microscope images. with simple view on how many cels have been ann

9 Aug 12, 2022

Solution of Kaggle competition: Sartorius - Cell Instance Segmentation

Sartorius - Cell Instance Segmentation https://www.kaggle.com/c/sartorius-cell-instance-segmentation Environment setup Build docker image bash .dev_sc

68 Dec 9, 2022

End-to-end face detection, cropping, norm estimation, and landmark detection in a single onnx model

onnx-facial-lmk-detector End-to-end face detection, cropping, norm estimation, and landmark detection in a single onnx model, model.onnx. Demo You can

42 Dec 30, 2022

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

============================================================================================================ `MILA will stop developing Theano <https:

9.6k Dec 31, 2022

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

============================================================================================================ `MILA will stop developing Theano <https:

9.6k Jan 6, 2023

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

============================================================================================================ `MILA will stop developing Theano <https:

9.3k Feb 12, 2021

This is the code for "HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields".

HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields This is the code for "HyperNeRF: A Higher-Dimensional

702 Jan 2, 2023

Subpopulation detection in high-dimensional single-cell data

Related tags

Overview

PhenoGraph for Python3

Release Notes

Version 1.5.7

Version 1.5.6

Version 1.5.5

Version 1.5.4

Version 1.5.3

Version 1.5.2

Version 1.5.1

Version 1.5

Version 1.4.1

Version 1.4

Version 1.3

Running the Unit Tests

Troubleshooting

Notebook freezes after several attempts of running PhenoGraph using Jypyter Notebook

Architecture related error

leidenalg inside conda environment

numpy.ufunc runtime warning when running pytest

Comments

Releases(v1.5.7)

v1.5.7(Oct 5, 2020)

v1.5.6(May 3, 2020)

v1.5.5(May 1, 2020)

v1.5.4(Apr 29, 2020)

v1.5.3(Feb 28, 2020)

v1.5.2(Sep 19, 2019)

Owner

Dana Pe'er Lab

A Parameter-free Deep Embedded Clustering Method for Single-cell RNA-seq Data

Scikit-event-correlation - Event Correlation and Forecasting over High Dimensional Streaming Sensor Data algorithms

7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle

Single Red Blood Cell Hydrodynamic Traps Via the Generative Design

Code to run experiments in SLOE: A Faster Method for Statistical Inference in High-Dimensional Logistic Regression.

Torch-based tool for quantizing high-dimensional vectors using additive codebooks

The implementation for paper Joint t-SNE for Comparable Projections of Multiple High-Dimensional Datasets.

pcnaDeep integrates cutting-edge detection techniques with tracking and cell cycle resolving models.

Learning cell communication from spatial graphs of cells

Message Passing on Cell Complexes

LIVECell - A large-scale dataset for label-free live cell segmentation

A lightweight Python-based 3D network multi-agent simulator. Uses a cell-based congestion model. Calculates risk, loudness and battery capacities of the agents. Suitable for 3D network optimization tasks.

Kaggle: Cell Instance Segmentation

Solution of Kaggle competition: Sartorius - Cell Instance Segmentation

End-to-end face detection, cropping, norm estimation, and landmark detection in a single onnx model

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

This is the code for "HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields".

`leidenalg` inside conda environment

`numpy.ufunc` runtime warning when running pytest