Single-Cell Analysis in Python. Scales to >1M cells.

Theis Lab

Last update: Jan 5, 2023

Related tags

Data Analysis data-science machine-learning bioinformatics transcriptomics visualize-data scanpy anndata

Overview

Scanpy – Single-Cell Analysis in Python

Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata. It includes preprocessing, visualization, clustering, trajectory inference and differential expression testing. The Python-based implementation efficiently deals with datasets of more than one million cells.

Discuss usage on Discourse. Read the documentation. If you'd like to contribute by opening an issue or creating a pull request, please take a look at our contributing guide. If Scanpy is useful for your research, consider citing Genome Biology (2018).

Comments

Normalization and gene selection by analytical Pearson residuals
Hi everyone,

a while back, @giovp asked me to create a pull request that integrates analytical Pearson residuals in scanpy. We already discussed a bit with @LuckyMD and @ivirshup over at berenslab/umi-normalization#1 how to structure it, and now I made a version that should be ready for review.

As discussed earlier, this pull request implements two core methods:

sc.pp.normalize_pearson_residuals(), which applies the method to adata.X. Overall, the function is very similar in structure to sc.pp.normalize_total() (support for layers, inplace operation etc).

sc.pp.highly_variable_genes(flavor='pearson_residuals'), which selects genes based on Pearson residual variance. The "inner" function _highly_variable_pearson_residuals() is structured similarly to _highly_variable_seurat_v3() (support for multiple batches, median ranks for tie breaking). It includes the chunksize argument to allow for memory-efficient computation of the residual variance.

We discussed quite a lot how to implement a third function that would bundle gene selection, normalization by analytical residuals and PCA. This PR includes the two options that emerged at the end of that discussion, so now we have to choose ;)

sc.pp.recipe_pearson_residuals() which does HVG selection and normalization both via Pearson residuals prior to PCA

sc.pp.normalize_pearson_residuals_pca() which applies any HVG selection if the user previously added one to the adata object, and then normalizes via Pearson residuals and does PCA.

Both functions retain the raw input counts as adata.X and add fields for PCA/Normalization/HVG selection results (or return them) as applicable, most importantly the X_pca in adata.obsm['pearson_residuals_X_pca']. I hope this addresses some of the issues we discussed over at the other repo in a scanpy-y way.

Let me know what you think and where you think improvements are needed!

Cheers, Jan.
Enhancement ✨
opened by jlause 64
PCA for sparse data (v2)

I know this (quite ancient) pull request has been open (#403), but I wasn't sure on its status. I think the consensus was to wait for sklearn to integrate the necessary changes? If that's still the case, then please feel free to remove this PR.

Here I make use of scipy's extremely nifty LinearOperator class to customize the dot product functions for an input sparse matrix. In this case, the 'custom' dot product performs implicit mean centering.

In my benchmarks, performing implicit mean centering in this way does not affect the runtime whatsoever. However, this approach has to use svds, for which randomized SVD is not implemented. So we have to use 'arpack', which can be significantly slower (but not intractably so.... in my hands, I could still do PCA on datasets of 200k+ cells in minutes, and it sure beats densifying the data, if you want more thorough benchmarks I am happy to generate them!).

The way I incorporated this functionality into scanpy/preprocessing/_simple.py might be questionable, and would love any suggestions or advice on how to better integrate this if there is interest in pushing this PR through. Let me know!

opened by atarashansky 38
Type annotation considerations

@falexwolf wrote:

Wow, this looks great! One remark for future PRs: We’re migrating to a new documentation style using type annotations

I'm still not convinced that we should use type annotations for Scanpy toplevel functions. People use Scanpy in Jupyter Lab and notebooks and not in Pycharm. Hence, there is no gain in the annotations, by contrast, the function annotations simply look super complicated and it's no longer feasible to grasp at first sight what's going on. This also regards the output of the help in Jupyter Lab and Notebooks.

So, while I think that for AnnData and everything in the background, type annotations may make sense for a few developers (not for me, as I'm doing everything on remote servers using emacs), it doesn't make sense for the Scanpy user.

Also, all the other big packages I work with all the time simply don't have it (numpy, seaborn, pandas, tensorflow) and it makes it harder and lengthier for contributors to contribute if they need to go through it.

Finally, I'm still not happy about how the automatically generated docs from the type annotation look: which is from Clearly, the automatically generated line with Union[...] is just way too complicated for a human to make sense of. The mix of auto-generated types in the docs and the manual annotations also looks inhomogeneous.

So, please let's stay away from having more type annotations and corresponding docstrings at this stage and let's simply continue imitating what all the major packages are doing.

Also: regarding your comment about the use of '``' vs. '' in the docs: again, I think it leads to an inhomogeneous appearance to have *two* types of markup for code-related things. I agree that the read-the-docs italicized default style for '' might be supoptimal, and I'll work on that if there is some time. But in general, I think there should be essentially one markup for code, as it's done in the tensorflow docs and a couple of other examples.

Happy to also discuss offline, @flying-sheep ;).

Question

opened by flying-sheep 38
Switch to flit
What stays the same:

pip install scanpy

pip install .

pip install git+https://...

you can install your deps with conda

you can do a dev install

What changes:

Please check the install docs, in short:

pip install -e .[every,single,extra] → flit install -s for dev installs

beni pyproject.toml > environment.yml for conda

Extremely simple flit build and flit publish. Maybe install keyring to store your publish password, and you know everything you need to.

flit build doesn’t clutter your dev directory with build/ and *.egg-info/ junk, it just creates dist/scanpy-*{.whl,.tar.gz}.

No more obscure stuff nobody understands (MANIFEST.in, package_data, …)

Centralized setup configuration in pyproject.toml instead of spread over multiple files
opened by flying-sheep 35
Simpler scatter functions
I tried to collect in one file the code used for plotting functions that use matplotlib scatter like sc.pl.tsne, sc.pl.pca and sc.pl.umap and others.

Also, I tried to annotate the code and improve the readability.

Currently, the code is on a separate file called scatter.py and not integrated into the API as this facilitates comparison with previous code.

Besides readability the proposed code can:

Plot a large number of plots in multiple columms (instead of a long row of plots)

Pass arguments directly to matplotlib.pyplot.scatter like vmax and vmin to adjust the color scale. When plotting multiple plots, this is useful to have a consistent range of values)

See cells 15 and 15 in this example: https://gist.github.com/fidelram/8b43f786e7519bcfb7ffc0d5ccdbb0fe If the admins would like to merge these changes I can replaced the previous functions.

An example on how to use the code:

import scanpy.plotting.tools.scatter as spl spl.tsne(adata, color='louvain')

Further examples here
opened by fidelram 34
Ingest class

As discussed with @falexwolf this PR introduces a new Ingest class to process new small pieces of data.

sc.pp.neighbors(adata) # adata is huge with 1M observations

ingest = sc.Ingest(adata) # represents the existing data, learned annotations, structure and exposes it to functionality that allows to ingest new data very quickly

adata_small.obsm['X_model'] = model(adata_small.X)

ingest.neighbors(adata_small) # adata_small with just 1000 observations

now, we have the updated neighbors graph with 1,001,000 observations we want to do the same things as always

by leveraging the neighbors of the new data within the old data, map the new data into the embedding (umap), by just computing a correction to the existing embedding: a new data point gets the mean position of its k nearest neighbors

ingest.umap(adata_small)

update the clustering (mapping the 1000 observations into the existing clusters): a new data point maps into a cluster if the majority of its neighbors is a member of the cluster

ingest.louvain(adata_small)

opened by Koncopd 29
Adding Moran's I calculation to Scanpy

Could you add Moran's I calculation to Scanpy? It could be used in scIB and to also find variable genes across embedding (could be an alternative to SEMITONES that takes a while to be computed).
Enhancement ✨

opened by Hrovatin 28

Leiden restrict_to parameter

Added restrict_to parameter to leiden by using louvain code as template. Tests are not yet provided.

A simple example of execution and checks:

# First split on cluster 4
sc.tl.leiden(adata, restrict_to=('leiden_res0.4', ['4']), resolution=0.6,
    key_added='leiden_res0.4_4_sub')

# Additional split
sc.tl.leiden(adata, restrict_to=('leiden_res0.4_4_sub', ['1', '2', '3', '4,4']),
    resolution=0.6, key_added='leiden_res0.4_4_add_sub')

# All partitions together
sc.pl.tsne(adata, color=['leiden_res0.4', 'leiden_res0.4_4_sub',
    'leiden_res0.4_4_add_sub'])

# Partition size check
## Original size of clusters
adata.obs['leiden_res0.4'].value_counts()
0    932
1    853
3    676
2    676
4    338
5     57
Name: leiden_res0.4, dtype: int64

# Check if first split is correct (can be iterated for subsequent splits)
## Assignment of samples in original clusters to subsplit clusters
adata.obs.loc[(adata.obs['leiden_res0.4'].isin(['4'])),
    'leiden_res0.4_4_sub'].value_counts()
4,0    103
4,1     68
4,2     66
4,3     57
4,4     44
5        0
3        0
2        0
1        0
0        0
Name: leiden_res0.4_4_sub, dtype: int64
## Assignment of samples not in original clusters to subsplit clusters
adata.obs.loc[~(adata.obs['leiden_res0.4'].isin(['4'])),
    'leiden_res0.4_4_sub'].value_counts()
0      932
1      853
3      676
2      676
5       57
4,4      0
4,3      0
4,2      0
4,1      0
4,0      0
Name: leiden_res0.4_4_sub, dtype: int64

...

opened by fbrundu 28

Clustering with leidenalg
Hello,

It would appear that louvain-igraph has been obsoleted in favour of leidenalg, and the author makes a persuasive case as to the superiority of the new approach. To my untrained eye, the algorithm is conceptually similar to the Louvain modification used by Seurat, but introduces an extra collapsed network refinement step.

it should be easy to support this in Scanpy - the syntax appears to be identical to the old louvain innards, and I was able to construct a very minimal dummy function for testing by taking the key bits of sc.tl.louvain() and replacing louvain. with leidenalg.:

import leidenalg import numpy as np import pandas as pd from scanpy import utils from natsort import natsorted def leiden(adata, use_weights=False, resolution=1, iterations=-1): g = utils.get_igraph_from_adjacency(adata.uns['neighbors']['connectivities'], directed=True) weights = None if use_weights: weights = np.array(g.es["weight"]).astype(np.float64) part = leidenalg.find_partition( g, leidenalg.RBConfigurationVertexPartition, resolution_parameter = resolution, weights = weights, n_iterations = iterations, ) groups = np.array(part.membership) adata.obs['louvain'] = pd.Categorical( values=groups.astype('U'), categories=natsorted(np.unique(groups).astype('U')), )

As such, replacing any louvain. with leidenalg. in sc.tl.louvain() would do most of the work. Probably the only new thing that would need support would the the n_iterations parameter in leidenalg.find_partition(). The default value is 2, positive values control how many passes of the algorithm are performed. -1 just makes it run until it fails to improve the clustering.
opened by ktpolanski 28
Where should `sc.datasets` put data?

I'm adding that expression atlas downloader now (#489), and wondering where the files should go.

pbmc68k_reduced and toggleswitch put the datasets relative to where scanpy is installed (via __file__). All other functions place the data relative to where the python process was started.

While I like not storing the same files all over a filesystem, I'm not sure in the scanpy installation directory is the right place to be storing data.

Thoughts?

opened by ivirshup 26
Export expression matrix from h5ad

Hi,

I would like to extract the expression matrix (genes and counts) from a h5ad file. How can I do this? I have searched the documentation but I couldn't find anything about this (maybe I missed it).

opened by cartal 26
Multithreading for scanpy.tl.rank_genes_group?
[ X] Additional function parameters / changed functionality / changed defaults?

Hi ScanPy team

I emailed @ivirshup but others should be involved I think.

This function would be useful if we could specify the number of threads to use: https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.rank_genes_groups.html

Based on the number of items in the "groupby" field, we could use a basic split-merge approach here: each thread would take several of these items, the calculations are entirely independent of one another, and then when each is completed we would join + concatenate the results.

I'm happy to help write up a PR help (or participate), but I'd like to hear if this is something you'd be willing to prioritize. (It's related to a project whereby Fabian is the PI.)

Best, Evan
opened by evanbiederstedt 0
normalize_total affects layers
[x] I have checked that this issue has not already been reported.

[x] I have confirmed this bug exists on the latest version of scanpy.

[ ] (optional) I have confirmed this bug exists on the master branch of scanpy.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Hi, everyone: Many users probably do not rely on pp.normalize_total for downstream analysis, but I found a strange default behavior that I think is worth mentioning. pp.normalize_total() normalized my .layers['counts'] as well The documentation is a bit murky; not sure if that is the expected behavior when layer is unspecified, but such default behavior would undermine anyone who wishes to save the count information before RPKM normalization.

Minimal code sample (that we can copy&paste without having any data)

# Your code here adata = sc.datasets.pbmc3k() adata.layers['counts'] = adata.X cell = adata.obs.index[1] adata.var['mt'] = adata.var_names.str.startswith('MT-') # annotate the group of mitochondrial genes as 'mt' sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True) print("Run 1: initial values after simple processing: ") print('sum of count layer in designated cell: ', adata[cell,:].layers['counts'].sum()) print('obs[total_counts] value in cell: ', adata[cell,:].obs['total_counts'][0]) print('.X.sum() value in cell: ', adata[cell,:].X.sum()) print('sum of count layer of MALAT1 in cell: ', adata[cell,'MALAT1'].layers['counts']) print('.X value of MALAT1 in cell: ', adata[cell,'MALAT1'].X) print("\nRun 2: after sc.pp.normalize_total: ") sc.pp.normalize_total(adata, target_sum=1e4) print('sum of count layer in designated cell: ', adata[cell,:].layers['counts'].sum()) # Note that this changed too print('obs[total_counts] value in cell: ', adata[cell,:].obs['total_counts'][0]) print('.X.sum() value in cell: ', adata[cell,:].X.sum()) print('sum of count layer of MALAT1 in cell: ', adata[cell,'MALAT1'].layers['counts']) print('.X value of MALAT1 in cell: ', adata[cell,'MALAT1'].X) adata = sc.datasets.pbmc3k() adata.layers['counts'] = adata.X cell = adata.obs.index[1] adata.var['mt'] = adata.var_names.str.startswith('MT-') # annotate the group of mitochondrial genes as 'mt' sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True) print("\nRun 3: normalization, specifing argument layer=None") sc.pp.normalize_total(adata, target_sum=1e4, layer = None) print('sum of count layer in designated cell: ', adata[cell,:].layers['counts'].sum()) print('obs[total_counts] value in cell: ', adata[cell,:].obs['total_counts'][0]) print('.X.sum() value in cell: ', adata[cell,:].X.sum()) print('sum of count layer of MALAT1 in cell: ', adata[cell,'MALAT1'].layers['counts']) print('.X value of MALAT1 in cell: ', adata[cell,'MALAT1'].X)

#Output: Run 1: initial values after simple processing: sum of count layer in designated cell: 4903.0 obs[total_counts] value in cell: 4903.0 .X.sum() value in cell: 4903.0 sum of count layer of MALAT1 in cell: (0, 0) 142.0 .X value of MALAT1 in cell: (0, 0) 142.0 Run 2: after sc.pp.normalize_total: normalizing counts per cell finished (0:00:00) sum of count layer in designated cell: 10000.049 obs[total_counts] value in cell: 4903.0 .X.sum() value in cell: 10000.049 sum of count layer of MALAT1 in cell: (0, 0) 289.61862 .X value of MALAT1 in cell: (0, 0) 289.61862 Run 3: normalization, specifing argument layer=None normalizing counts per cell finished (0:00:00) sum of count layer in designated cell: 10000.049 obs[total_counts] value in cell: 4903.0 .X.sum() value in cell: 10000.049 sum of count layer of MALAT1 in cell: (0, 0) 289.61862 .X value of MALAT1 in cell: (0, 0) 289.61862

Versions

[Paste the output of scanpy.logging.print_versions() leaving a blank line after the details tag]

anndata 0.8.0 scanpy 1.9.1

PIL 9.2.0 anndata2ri 1.1 annoy NA backcall 0.2.0 backports NA bbknn NA beta_ufunc NA binom_ufunc NA cffi 1.15.1 cloudpickle 2.2.0 colorama 0.4.6 cycler 0.10.0 cython_runtime NA cytoolz 0.12.0 dask 2022.02.0 dateutil 2.8.2 debugpy 1.6.3 decorator 5.1.1 defusedxml 0.7.1 deprecated 1.2.13 entrypoints 0.4 fsspec 2022.11.0 future_fstrings NA google NA h5py 3.7.0 igraph 0.9.1 ipykernel 6.14.0 ipython_genutils 0.2.0 ipywidgets 8.0.2 jedi 0.18.1 jinja2 3.1.2 joblib 1.2.0 kiwisolver 1.4.4 leidenalg 0.8.10 llvmlite 0.39.1 louvain 0.7.2 markupsafe 2.1.1 matplotlib 3.5.3 matplotlib_inline 0.1.6 mpl_toolkits NA natsort 8.2.0 nbinom_ufunc NA numba 0.56.3 numpy 1.21.6 packaging 21.3 pandas 1.3.5 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prompt_toolkit 3.0.31 psutil 5.9.3 ptyprocess 0.7.0 pycparser 2.21 pydev_ipython NA pydevconsole NA pydevd 2.8.0 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.13.0 pynndescent 0.5.7 pyparsing 3.0.9 pytz 2022.5 pytz_deprecation_shim NA rpy2 3.5.1 scib 1.0.4 scipy 1.7.3 seaborn 0.12.1 session_info 1.0.0 six 1.16.0 sklearn 1.0.2 statsmodels 0.13.2 storemagic NA texttable 1.6.4 threadpoolctl 3.1.0 tlz 0.12.0 toolz 0.12.0 tornado 6.2 tqdm 4.64.1 traitlets 5.5.0 typing_extensions NA tzlocal NA umap 0.5.3 wcwidth 0.2.5 wrapt 1.14.1 yaml 6.0 zipp NA zmq 24.0.1 zope NA

IPython 7.33.0 jupyter_client 7.4.4 jupyter_core 4.11.1 notebook 6.5.1

Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:21) [GCC 9.4.0] Linux-5.4.0-131-generic-x86_64-with-debian-buster-sid

Session information updated at 2022-12-28 13:52
opened by erieslee 1
`sc.external.pp.magic`: what is `n_jobs`'s default _really_?

scanpy docs:

n_jobs : Optional[int] (default: None) Number of threads to use in training. All cores are used by default.

magic docs:

n_jobs (integer, optional, default: 1) – The number of jobs to use for the computation. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.

scanpy code:

https://github.com/scverse/scanpy/blob/536ed15bc73ab5d1131c0d530dd9d4f2dc9aee36/scanpy/external/pp/_magic.py#L164

https://github.com/scverse/scanpy/blob/536ed15bc73ab5d1131c0d530dd9d4f2dc9aee36/scanpy/_settings.py#L82

I'm guessing the scanpy docs are wrong when they say "All cores are used by default." ?

opened by alexlenail 0
sc.tl.dendrogram 'var_names' -parameter bug fix
Bug: var_names -parameter for sc.tl.dendrogram -function is not used properly.

Currently:

Hierarchical clustering is calculated on all of the var_names (genes) when var_names is not None

Fix:

Subset of genes defined by var_names is now used

In addition:

When all of the values of some row of rep_df (or mean_df) are equal, df.T.corr() is not defined for that row resulting in NaNs in correlation matrix.

This is quite common with a subset of genes var_names, e.g. all 0 in all cells (these cells have already passed quality control/filtering at this point of downstream analysis).

This throws an error in distance.squareform(1-corr_matrix): ValueError: Distance matrix 'X' must be symmetric.

Fix: In this case add 'dummy' feature rep_df["dummy"] = -1 to make sure that at least one feature in a row is distinct.

Notice, this addition affects (increases) the correlation between the rows. However, it should affect all rows equally and hence the hierarchy stays as is.
opened by lutrarutra 1

write anndata failed, pearson_residuals_df header message is too large

[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of scanpy.
[ ] (optional) I have confirmed this bug exists on the master branch of scanpy.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Minimal code sample (that we can copy&paste without having any data)

Write any anndata with pearson residuals in uns

ad_all.write(filename='output/10x_h5/ad_all_2cello.h5ad')

The pearson_residual_df looks like this, with 38291 rows (obs) and 5000 columns (features) :

{'theta': 100,
 'clip': None,
 'computed_on': 'adata.X',
 'pearson_residuals_df': gene_name                             A2M  AADACL2-AS1      AAK1     ABCA1  \
 barcode                                                                      
 GAACGTTCACACCGAC-1-placenta_81  -1.125285    -1.159130 -3.921314 -2.533474   
 TATACCTGTTAGCTAC-1-placenta_81  -1.091364     3.267127 -1.806667 -2.109586   
 CTCAAGAGTGACTGTT-1-placenta_81  -1.074943    12.272920 -1.948798 -2.735791   
 TTCATTGTCACGAACT-1-placenta_81  -1.098699    -1.131765  3.481171  4.472371   
 TATCAGGCAGCTCATA-1-placenta_81  -1.107734    -1.141064 -0.571775 -2.813671   
 ...                                   ...          ...       ...       ...   
 CACAACATCGGCGATC-1-placenta_314 -0.115585    -0.119107 -0.434686 -0.303945   
 AGCCAGCGTGCCCAGT-1-placenta_314 -0.097424    -0.100394 -0.366482 -0.256219   
 CCGGTGAGTGTTCGAT-1-placenta_314 -0.110334    -0.113696 -0.414971 -0.290148   
 AGGTCATAGCCTGACC-1-placenta_314 -0.115585    -0.119107 -0.434686 -0.303945   
 TTTATGCCAAAGGGTC-1-placenta_314 -0.112876    -0.116316 -0.424515 -0.296827

Unable to create attribute (object header message is too large)

Above error raised while writing key 'pearson_residuals_df' of <class 'h5py._hl.group.Group'> to /

Versions

scanpy==1.9.1 anndata==0.8.0 umap==0.5.2 numpy==1.21.5 scipy==1.8.0 pandas==1.4.1 scikit-learn==1.0.2 statsmodels==0.13.2 python-igraph==0.9.9 pynndescent==0.5.6

opened by brainfo 5

adata.var.sort_index(inplace=True) does not sort adata.X
[x] I have checked that this issue has not already been reported.

[x] I have confirmed this bug exists on the latest version of scanpy.

[x] (optional) I have confirmed this bug exists on the master branch of scanpy.

When sorting the adata object using adata.var.sort_index(inplace=True) the adata.X is not sorting accordingly (see example below).

import numpy as np import pandas as pd import scanpy as sc import matplotlib.pyplot as plt sc.settings.verbosity = 3 sc.logging.print_header() adata = sc.datasets.pbmc3k() adata.to_df()['MPO'].sum() adata.var.sort_index(inplace=True) adata.to_df()['MPO'].sum()

Versions

anndata 0.8.0 scanpy 1.9.1

PIL 9.2.0 asttokens NA backcall 0.2.0 beta_ufunc NA binom_ufunc NA cffi 1.15.1 cycler 0.10.0 cython_runtime NA dateutil 2.8.2 debugpy 1.5.1 decorator 5.1.1 defusedxml 0.7.1 entrypoints 0.4 executing 0.8.3 google NA h5py 3.7.0 hypergeom_ufunc NA igraph 0.9.11 ipykernel 6.17.1 ipython_genutils 0.2.0 ipywidgets 8.0.2 jedi 0.18.1 joblib 1.2.0 kiwisolver 1.4.4 leidenalg 0.8.10 llvmlite 0.39.1 loompy 3.0.7 matplotlib 3.6.0 matplotlib_inline NA mpl_toolkits NA natsort 8.2.0 nbinom_ufunc NA ncf_ufunc NA numba 0.56.2 numpy 1.23.3 numpy_groupies 0.9.19 packaging 21.3 pandas 1.4.4 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA platformdirs 2.6.0 prompt_toolkit 3.0.20 psutil 5.9.4 ptyprocess 0.7.0 pure_eval 0.2.2 pycparser 2.21 pydev_ipython NA pydevconsole NA pydevd 2.6.0 pydevd_concurrency_analyser NA pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.11.2 pynndescent 0.5.7 pyparsing 3.0.9 pytz 2022.2.1 scipy 1.9.1 scvelo 0.2.4 session_info 1.0.0 setuptools_scm NA six 1.16.0 sklearn 1.1.2 stack_data 0.2.0 statsmodels 0.13.2 texttable 1.6.4 threadpoolctl 3.1.0 tornado 6.2 tqdm 4.64.1 traitlets 5.6.0 typing_extensions NA umap 0.5.3 wcwidth 0.2.5 yaml 5.4.1 zipp NA zmq 24.0.1

IPython 8.4.0 jupyter_client 7.4.8 jupyter_core 5.1.0 notebook 6.5.2

Python 3.9.13 (main, Aug 25 2022, 23:26:10) [GCC 11.2.0] Linux-5.4.0-1092-aws-x86_64-with-glibc2.27
opened by sandrav-CGEN 0

Owner

Theis Lab

Institute of Computational Biology

GitHub https://scanpy.readthedocs.io

Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis ?? This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

1 Sep 3, 2022

Data Science Environment Setup in single line

datascienv is package that helps your to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries

55 Dec 16, 2022

Single machine, multiple cards training; mix-precision training; DALI data loader.

Template Script Category Description Category script comparison script train.py, loader.py for single-machine-multiple-cards training train_DP.py, tra

2 Jun 27, 2022

Important dataframe statistics with a single command

quick_eda Receiving dataframe statistics with one command Project description A python package for Data Scientists, Students, ML Engineers and anyone

2 Dec 19, 2021

Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

663 Jan 5, 2023

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

97 Dec 8, 2022

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

81 Dec 26, 2022

Repositori untuk menyimpan material Long Course STMKGxHMGI tentang Geophysical Python for Seismic Data Analysis

Long Course "Geophysical Python for Seismic Data Analysis" Instruktur: Dr.rer.nat. Wiwit Suryanto, M.Si Dipersiapkan oleh: Anang Sahroni Waktu: Sesi 1

0 Dec 4, 2021

A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

0 Sep 7, 2021

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

411 Dec 27, 2022

PyPSA: Python for Power System Analysis

1 Python for Power System Analysis Contents 1 Python for Power System Analysis 1.1 About 1.2 Documentation 1.3 Functionality 1.4 Example scripts as Ju

758 Dec 30, 2022

A collection of learning outcomes data analysis using Python and SQL, from DQLab.

Data Analyst with PYTHON Data Analyst berperan dalam menghasilkan analisa data serta mempresentasikan insight untuk membantu proses pengambilan keputu

6 Oct 11, 2022

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis. The main goal of the package is to accelerate the process of computing estimates of forward reachable sets for nonlinear dynamical systems.

2 Nov 8, 2021

BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

Mathematical modeling is a powerful method for the analysis of complex biological systems. Although there are many researches devoted on produ

22 Dec 27, 2022

Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

98 Dec 22, 2022

Python Project on Pro Data Analysis Track

Udacity-BikeShare-Project: Python Project on Pro Data Analysis Track Basic Data Exploration with pandas on Bikeshare Data Basic Udacity project using

0 Nov 10, 2021

Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Surf's Up Weather analysis with Python, SQLite, SQLAlchemy, and Flask Overview The purpose of this analysis was to examine weather trends (precipitati

1 Sep 5, 2021

Stock Analysis dashboard Using Streamlit and Python

StDashApp Stock Analysis Dashboard Using Streamlit and Python If you found the content useful and want to support my work, you can buy me a coffee! Th

27 Dec 9, 2022

Python implementation of Principal Component Analysis

Principal Component Analysis Principal Component Analysis (PCA) is a dimension-reduction algorithm. The idea is to use the singular value decompositio

1 Nov 6, 2021

Single-Cell Analysis in Python. Scales to >1M cells.

Related tags

Overview

Scanpy – Single-Cell Analysis in Python

Comments

Minimal code sample (that we can copy&paste without having any data)

Versions

[Paste the output of scanpy.logging.print_versions() leaving a blank line after the details tag]

anndata 0.8.0 scanpy 1.9.1

IPython 7.33.0 jupyter_client 7.4.4 jupyter_core 4.11.1 notebook 6.5.1

Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:21) [GCC 9.4.0] Linux-5.4.0-131-generic-x86_64-with-debian-buster-sid

Bug: var_names -parameter for sc.tl.dendrogram -function is not used properly.

Minimal code sample (that we can copy&paste without having any data)

Versions

Versions

anndata 0.8.0 scanpy 1.9.1

IPython 8.4.0 jupyter_client 7.4.8 jupyter_core 5.1.0 notebook 6.5.2

Owner

Theis Lab

Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Data Science Environment Setup in single line

Single machine, multiple cards training; mix-precision training; DALI data loader.

Important dataframe statistics with a single command

Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Repositori untuk menyimpan material Long Course STMKGxHMGI tentang Geophysical Python for Seismic Data Analysis

A data analysis using python and pandas to showcase trends in school performance.

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

PyPSA: Python for Power System Analysis

A collection of learning outcomes data analysis using Python and SQL, from DQLab.

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

Python-based Space Physics Environment Data Analysis Software

Python Project on Pro Data Analysis Track

Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Stock Analysis dashboard Using Streamlit and Python

Python implementation of Principal Component Analysis

Bug: `var_names` -parameter for `sc.tl.dendrogram` -function is not used properly.