Single-Cell Analysis in Python. Scales to >1M cells.

Overview

Stars PyPI PyPIDownloads BiocondaDownloads Docs Build Status

Scanpy – Single-Cell Analysis in Python

Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata. It includes preprocessing, visualization, clustering, trajectory inference and differential expression testing. The Python-based implementation efficiently deals with datasets of more than one million cells.

Discuss usage on Discourse. Read the documentation. If you'd like to contribute by opening an issue or creating a pull request, please take a look at our contributing guide. If Scanpy is useful for your research, consider citing Genome Biology (2018).

Comments
  • Normalization and gene selection by analytical Pearson residuals

    Normalization and gene selection by analytical Pearson residuals

    Hi everyone,

    a while back, @giovp asked me to create a pull request that integrates analytical Pearson residuals in scanpy. We already discussed a bit with @LuckyMD and @ivirshup over at berenslab/umi-normalization#1 how to structure it, and now I made a version that should be ready for review.

    As discussed earlier, this pull request implements two core methods:

    • sc.pp.normalize_pearson_residuals(), which applies the method to adata.X. Overall, the function is very similar in structure to sc.pp.normalize_total() (support for layers, inplace operation etc).
    • sc.pp.highly_variable_genes(flavor='pearson_residuals'), which selects genes based on Pearson residual variance. The "inner" function _highly_variable_pearson_residuals() is structured similarly to _highly_variable_seurat_v3() (support for multiple batches, median ranks for tie breaking). It includes the chunksize argument to allow for memory-efficient computation of the residual variance.

    We discussed quite a lot how to implement a third function that would bundle gene selection, normalization by analytical residuals and PCA. This PR includes the two options that emerged at the end of that discussion, so now we have to choose ;)

    • sc.pp.recipe_pearson_residuals() which does HVG selection and normalization both via Pearson residuals prior to PCA
    • sc.pp.normalize_pearson_residuals_pca() which applies any HVG selection if the user previously added one to the adata object, and then normalizes via Pearson residuals and does PCA.

    Both functions retain the raw input counts as adata.X and add fields for PCA/Normalization/HVG selection results (or return them) as applicable, most importantly the X_pca in adata.obsm['pearson_residuals_X_pca']. I hope this addresses some of the issues we discussed over at the other repo in a scanpy-y way.

    Let me know what you think and where you think improvements are needed!

    Cheers, Jan.

    Enhancement ✨ 
    opened by jlause 64
  • PCA for sparse data (v2)

    PCA for sparse data (v2)

    I know this (quite ancient) pull request has been open (#403), but I wasn't sure on its status. I think the consensus was to wait for sklearn to integrate the necessary changes? If that's still the case, then please feel free to remove this PR.

    Here I make use of scipy's extremely nifty LinearOperator class to customize the dot product functions for an input sparse matrix. In this case, the 'custom' dot product performs implicit mean centering.

    In my benchmarks, performing implicit mean centering in this way does not affect the runtime whatsoever. However, this approach has to use svds, for which randomized SVD is not implemented. So we have to use 'arpack', which can be significantly slower (but not intractably so.... in my hands, I could still do PCA on datasets of 200k+ cells in minutes, and it sure beats densifying the data, if you want more thorough benchmarks I am happy to generate them!).

    The way I incorporated this functionality into scanpy/preprocessing/_simple.py might be questionable, and would love any suggestions or advice on how to better integrate this if there is interest in pushing this PR through. Let me know!

    opened by atarashansky 38
  • Type annotation considerations

    Type annotation considerations

    @falexwolf wrote:

    Wow, this looks great! One remark for future PRs: We’re migrating to a new documentation style using type annotations

    I'm still not convinced that we should use type annotations for Scanpy toplevel functions. People use Scanpy in Jupyter Lab and notebooks and not in Pycharm. Hence, there is no gain in the annotations, by contrast, the function annotations simply look super complicated and it's no longer feasible to grasp at first sight what's going on. This also regards the output of the help in Jupyter Lab and Notebooks.

    So, while I think that for AnnData and everything in the background, type annotations may make sense for a few developers (not for me, as I'm doing everything on remote servers using emacs), it doesn't make sense for the Scanpy user.

    Also, all the other big packages I work with all the time simply don't have it (numpy, seaborn, pandas, tensorflow) and it makes it harder and lengthier for contributors to contribute if they need to go through it.

    Finally, I'm still not happy about how the automatically generated docs from the type annotation look: image which is from image Clearly, the automatically generated line with Union[...] is just way too complicated for a human to make sense of. The mix of auto-generated types in the docs and the manual annotations also looks inhomogeneous.

    So, please let's stay away from having more type annotations and corresponding docstrings at this stage and let's simply continue imitating what all the major packages are doing.

    Also: regarding your comment about the use of '``' vs. '' in the docs: again, I think it leads to an inhomogeneous appearance to have *two* types of markup for code-related things. I agree that the read-the-docs italicized default style for '' might be supoptimal, and I'll work on that if there is some time. But in general, I think there should be essentially one markup for code, as it's done in the tensorflow docs and a couple of other examples.

    Happy to also discuss offline, @flying-sheep ;).

    Question 
    opened by flying-sheep 38
  • Switch to flit

    Switch to flit

    What stays the same:

    • pip install scanpy
    • pip install .
    • pip install git+https://...
    • you can install your deps with conda
    • you can do a dev install

    What changes:

    • Please check the install docs, in short:
      • pip install -e .[every,single,extra]flit install -s for dev installs
      • beni pyproject.toml > environment.yml for conda
    • Extremely simple flit build and flit publish. Maybe install keyring to store your publish password, and you know everything you need to.
    • flit build doesn’t clutter your dev directory with build/ and *.egg-info/ junk, it just creates dist/scanpy-*{.whl,.tar.gz}.
    • No more obscure stuff nobody understands (MANIFEST.in, package_data, …)
    • Centralized setup configuration in pyproject.toml instead of spread over multiple files
    opened by flying-sheep 35
  • Simpler scatter functions

    Simpler scatter functions

    I tried to collect in one file the code used for plotting functions that use matplotlib scatter like sc.pl.tsne, sc.pl.pca and sc.pl.umap and others.

    Also, I tried to annotate the code and improve the readability.

    Currently, the code is on a separate file called scatter.py and not integrated into the API as this facilitates comparison with previous code.

    Besides readability the proposed code can:

    • Plot a large number of plots in multiple columms (instead of a long row of plots)
    • Pass arguments directly to matplotlib.pyplot.scatter like vmax and vmin to adjust the color scale. When plotting multiple plots, this is useful to have a consistent range of values)

    See cells 15 and 15 in this example: https://gist.github.com/fidelram/8b43f786e7519bcfb7ffc0d5ccdbb0fe If the admins would like to merge these changes I can replaced the previous functions.

    An example on how to use the code:

    import scanpy.plotting.tools.scatter as spl
    spl.tsne(adata, color='louvain')
    

    image

    Further examples here

    opened by fidelram 34
  • Ingest class

    Ingest class

    As discussed with @falexwolf this PR introduces a new Ingest class to process new small pieces of data.

    sc.pp.neighbors(adata) # adata is huge with 1M observations

    ingest = sc.Ingest(adata) # represents the existing data, learned annotations, structure and exposes it to functionality that allows to ingest new data very quickly

    adata_small.obsm['X_model'] = model(adata_small.X)

    ingest.neighbors(adata_small) # adata_small with just 1000 observations

    now, we have the updated neighbors graph with 1,001,000 observations we want to do the same things as always

    by leveraging the neighbors of the new data within the old data, map the new data into the embedding (umap), by just computing a correction to the existing embedding: a new data point gets the mean position of its k nearest neighbors

    ingest.umap(adata_small)

    update the clustering (mapping the 1000 observations into the existing clusters): a new data point maps into a cluster if the majority of its neighbors is a member of the cluster

    ingest.louvain(adata_small)

    opened by Koncopd 29
  • Adding Moran's I calculation to Scanpy

    Adding Moran's I calculation to Scanpy

    Could you add Moran's I calculation to Scanpy? It could be used in scIB and to also find variable genes across embedding (could be an alternative to SEMITONES that takes a while to be computed).

    Enhancement ✨ 
    opened by Hrovatin 28
  • Leiden restrict_to parameter

    Leiden restrict_to parameter

    Added restrict_to parameter to leiden by using louvain code as template. Tests are not yet provided.

    A simple example of execution and checks:

    # First split on cluster 4
    sc.tl.leiden(adata, restrict_to=('leiden_res0.4', ['4']), resolution=0.6,
        key_added='leiden_res0.4_4_sub')
    
    # Additional split
    sc.tl.leiden(adata, restrict_to=('leiden_res0.4_4_sub', ['1', '2', '3', '4,4']),
        resolution=0.6, key_added='leiden_res0.4_4_add_sub')
    
    # All partitions together
    sc.pl.tsne(adata, color=['leiden_res0.4', 'leiden_res0.4_4_sub',
        'leiden_res0.4_4_add_sub'])
    
    # Partition size check
    ## Original size of clusters
    adata.obs['leiden_res0.4'].value_counts()
    0    932
    1    853
    3    676
    2    676
    4    338
    5     57
    Name: leiden_res0.4, dtype: int64
    
    # Check if first split is correct (can be iterated for subsequent splits)
    ## Assignment of samples in original clusters to subsplit clusters
    adata.obs.loc[(adata.obs['leiden_res0.4'].isin(['4'])),
        'leiden_res0.4_4_sub'].value_counts()
    4,0    103
    4,1     68
    4,2     66
    4,3     57
    4,4     44
    5        0
    3        0
    2        0
    1        0
    0        0
    Name: leiden_res0.4_4_sub, dtype: int64
    ## Assignment of samples not in original clusters to subsplit clusters
    adata.obs.loc[~(adata.obs['leiden_res0.4'].isin(['4'])),
        'leiden_res0.4_4_sub'].value_counts()
    0      932
    1      853
    3      676
    2      676
    5       57
    4,4      0
    4,3      0
    4,2      0
    4,1      0
    4,0      0
    Name: leiden_res0.4_4_sub, dtype: int64
    
    ...
    

    Image

    opened by fbrundu 28
  • Clustering with leidenalg

    Clustering with leidenalg

    Hello,

    It would appear that louvain-igraph has been obsoleted in favour of leidenalg, and the author makes a persuasive case as to the superiority of the new approach. To my untrained eye, the algorithm is conceptually similar to the Louvain modification used by Seurat, but introduces an extra collapsed network refinement step.

    it should be easy to support this in Scanpy - the syntax appears to be identical to the old louvain innards, and I was able to construct a very minimal dummy function for testing by taking the key bits of sc.tl.louvain() and replacing louvain. with leidenalg.:

    import leidenalg
    import numpy as np
    import pandas as pd
    from scanpy import utils
    from natsort import natsorted
    
    def leiden(adata, use_weights=False, resolution=1, iterations=-1):
    	g = utils.get_igraph_from_adjacency(adata.uns['neighbors']['connectivities'], directed=True)
    	weights = None
    	if use_weights:
    		weights = np.array(g.es["weight"]).astype(np.float64)
    	part = leidenalg.find_partition(
    		g, leidenalg.RBConfigurationVertexPartition, 
    		resolution_parameter = resolution, weights = weights, 
    		n_iterations = iterations,
    	)
    	groups = np.array(part.membership)
    	adata.obs['louvain'] = pd.Categorical(
    		values=groups.astype('U'),
    		categories=natsorted(np.unique(groups).astype('U')),
    	)
    

    As such, replacing any louvain. with leidenalg. in sc.tl.louvain() would do most of the work. Probably the only new thing that would need support would the the n_iterations parameter in leidenalg.find_partition(). The default value is 2, positive values control how many passes of the algorithm are performed. -1 just makes it run until it fails to improve the clustering.

    opened by ktpolanski 28
  • Where should `sc.datasets` put data?

    Where should `sc.datasets` put data?

    I'm adding that expression atlas downloader now (#489), and wondering where the files should go.

    pbmc68k_reduced and toggleswitch put the datasets relative to where scanpy is installed (via __file__). All other functions place the data relative to where the python process was started.

    While I like not storing the same files all over a filesystem, I'm not sure in the scanpy installation directory is the right place to be storing data.

    Thoughts?

    opened by ivirshup 26
  • Export expression matrix from h5ad

    Export expression matrix from h5ad

    Hi,

    I would like to extract the expression matrix (genes and counts) from a h5ad file. How can I do this? I have searched the documentation but I couldn't find anything about this (maybe I missed it).

    opened by cartal 26
  • Multithreading for scanpy.tl.rank_genes_group?

    Multithreading for scanpy.tl.rank_genes_group?

    • [ X] Additional function parameters / changed functionality / changed defaults?

    Hi ScanPy team

    I emailed @ivirshup but others should be involved I think.

    This function would be useful if we could specify the number of threads to use: https://scanpy.readthedocs.io/en/stable/generated/scanpy.tl.rank_genes_groups.html

    Based on the number of items in the "groupby" field, we could use a basic split-merge approach here: each thread would take several of these items, the calculations are entirely independent of one another, and then when each is completed we would join + concatenate the results.

    I'm happy to help write up a PR help (or participate), but I'd like to hear if this is something you'd be willing to prioritize. (It's related to a project whereby Fabian is the PI.)

    Best, Evan

    opened by evanbiederstedt 0
  • normalize_total affects layers

    normalize_total affects layers

    • [x] I have checked that this issue has not already been reported.
    • [x] I have confirmed this bug exists on the latest version of scanpy.
    • [ ] (optional) I have confirmed this bug exists on the master branch of scanpy.

    Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

    Hi, everyone: Many users probably do not rely on pp.normalize_total for downstream analysis, but I found a strange default behavior that I think is worth mentioning. pp.normalize_total() normalized my .layers['counts'] as well The documentation is a bit murky; not sure if that is the expected behavior when layer is unspecified, but such default behavior would undermine anyone who wishes to save the count information before RPKM normalization.

    Minimal code sample (that we can copy&paste without having any data)

    # Your code here
    adata = sc.datasets.pbmc3k()
    adata.layers['counts'] = adata.X
    cell = adata.obs.index[1]
    adata.var['mt'] = adata.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
    sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
    
    print("Run 1: initial values after simple processing: ")
    print('sum of count layer in designated cell: ', adata[cell,:].layers['counts'].sum())
    print('obs[total_counts] value in cell: ', adata[cell,:].obs['total_counts'][0])
    print('.X.sum() value in cell: ', adata[cell,:].X.sum())
    print('sum of count layer of MALAT1 in cell: ', adata[cell,'MALAT1'].layers['counts'])
    print('.X value of MALAT1 in cell: ', adata[cell,'MALAT1'].X)
    
    print("\nRun 2: after sc.pp.normalize_total: ")
    sc.pp.normalize_total(adata, target_sum=1e4)
    print('sum of count layer in designated cell: ', adata[cell,:].layers['counts'].sum()) # Note that this changed too
    print('obs[total_counts] value in cell: ', adata[cell,:].obs['total_counts'][0])
    print('.X.sum() value in cell: ', adata[cell,:].X.sum())
    print('sum of count layer of MALAT1 in cell: ', adata[cell,'MALAT1'].layers['counts'])
    print('.X value of MALAT1 in cell: ', adata[cell,'MALAT1'].X)
    
    adata = sc.datasets.pbmc3k()
    adata.layers['counts'] = adata.X
    cell = adata.obs.index[1]
    adata.var['mt'] = adata.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
    sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
    
    print("\nRun 3: normalization, specifing argument layer=None")
    sc.pp.normalize_total(adata, target_sum=1e4, layer = None)
    print('sum of count layer in designated cell: ', adata[cell,:].layers['counts'].sum())
    print('obs[total_counts] value in cell: ', adata[cell,:].obs['total_counts'][0])
    print('.X.sum() value in cell: ', adata[cell,:].X.sum())
    print('sum of count layer of MALAT1 in cell: ', adata[cell,'MALAT1'].layers['counts'])
    print('.X value of MALAT1 in cell: ', adata[cell,'MALAT1'].X)
    
    #Output:
    Run 1: initial values after simple processing: 
    sum of count layer in designated cell:  4903.0
    obs[total_counts] value in cell:  4903.0
    .X.sum() value in cell:  4903.0
    sum of count layer of MALAT1 in cell:    (0, 0)	142.0
    .X value of MALAT1 in cell:    (0, 0)	142.0
    
    Run 2: after sc.pp.normalize_total: 
    normalizing counts per cell
        finished (0:00:00)
    sum of count layer in designated cell:  10000.049
    obs[total_counts] value in cell:  4903.0
    .X.sum() value in cell:  10000.049
    sum of count layer of MALAT1 in cell:    (0, 0)	289.61862
    .X value of MALAT1 in cell:    (0, 0)	289.61862
    
    Run 3: normalization, specifing argument layer=None
    normalizing counts per cell
        finished (0:00:00)
    sum of count layer in designated cell:  10000.049
    obs[total_counts] value in cell:  4903.0
    .X.sum() value in cell:  10000.049
    sum of count layer of MALAT1 in cell:    (0, 0)	289.61862
    .X value of MALAT1 in cell:    (0, 0)	289.61862
    

    Versions

    [Paste the output of scanpy.logging.print_versions() leaving a blank line after the details tag]

    anndata 0.8.0 scanpy 1.9.1

    PIL 9.2.0 anndata2ri 1.1 annoy NA backcall 0.2.0 backports NA bbknn NA beta_ufunc NA binom_ufunc NA cffi 1.15.1 cloudpickle 2.2.0 colorama 0.4.6 cycler 0.10.0 cython_runtime NA cytoolz 0.12.0 dask 2022.02.0 dateutil 2.8.2 debugpy 1.6.3 decorator 5.1.1 defusedxml 0.7.1 deprecated 1.2.13 entrypoints 0.4 fsspec 2022.11.0 future_fstrings NA google NA h5py 3.7.0 igraph 0.9.1 ipykernel 6.14.0 ipython_genutils 0.2.0 ipywidgets 8.0.2 jedi 0.18.1 jinja2 3.1.2 joblib 1.2.0 kiwisolver 1.4.4 leidenalg 0.8.10 llvmlite 0.39.1 louvain 0.7.2 markupsafe 2.1.1 matplotlib 3.5.3 matplotlib_inline 0.1.6 mpl_toolkits NA natsort 8.2.0 nbinom_ufunc NA numba 0.56.3 numpy 1.21.6 packaging 21.3 pandas 1.3.5 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA prompt_toolkit 3.0.31 psutil 5.9.3 ptyprocess 0.7.0 pycparser 2.21 pydev_ipython NA pydevconsole NA pydevd 2.8.0 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.13.0 pynndescent 0.5.7 pyparsing 3.0.9 pytz 2022.5 pytz_deprecation_shim NA rpy2 3.5.1 scib 1.0.4 scipy 1.7.3 seaborn 0.12.1 session_info 1.0.0 six 1.16.0 sklearn 1.0.2 statsmodels 0.13.2 storemagic NA texttable 1.6.4 threadpoolctl 3.1.0 tlz 0.12.0 toolz 0.12.0 tornado 6.2 tqdm 4.64.1 traitlets 5.5.0 typing_extensions NA tzlocal NA umap 0.5.3 wcwidth 0.2.5 wrapt 1.14.1 yaml 6.0 zipp NA zmq 24.0.1 zope NA

    IPython 7.33.0 jupyter_client 7.4.4 jupyter_core 4.11.1 notebook 6.5.1

    Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 06:08:21) [GCC 9.4.0] Linux-5.4.0-131-generic-x86_64-with-debian-buster-sid

    Session information updated at 2022-12-28 13:52

    opened by erieslee 1
  • `sc.external.pp.magic`: what is `n_jobs`'s default _really_?

    `sc.external.pp.magic`: what is `n_jobs`'s default _really_?

    scanpy docs:

    n_jobs : Optional[int] (default: None) Number of threads to use in training. All cores are used by default.

    magic docs:

    n_jobs (integer, optional, default: 1) – The number of jobs to use for the computation. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging.

    scanpy code:

    https://github.com/scverse/scanpy/blob/536ed15bc73ab5d1131c0d530dd9d4f2dc9aee36/scanpy/external/pp/_magic.py#L164

    https://github.com/scverse/scanpy/blob/536ed15bc73ab5d1131c0d530dd9d4f2dc9aee36/scanpy/_settings.py#L82

    I'm guessing the scanpy docs are wrong when they say "All cores are used by default." ?

    opened by alexlenail 0
  • sc.tl.dendrogram 'var_names' -parameter bug fix

    sc.tl.dendrogram 'var_names' -parameter bug fix

    Bug: var_names -parameter for sc.tl.dendrogram -function is not used properly.

    Currently:

    • Hierarchical clustering is calculated on all of the var_names (genes) when var_names is not None

    Fix:

    • Subset of genes defined by var_names is now used

    In addition:

    • When all of the values of some row of rep_df (or mean_df) are equal, df.T.corr() is not defined for that row resulting in NaNs in correlation matrix.
    • This is quite common with a subset of genes var_names, e.g. all 0 in all cells (these cells have already passed quality control/filtering at this point of downstream analysis).
    • This throws an error in distance.squareform(1-corr_matrix): ValueError: Distance matrix 'X' must be symmetric.
    • Fix: In this case add 'dummy' feature rep_df["dummy"] = -1 to make sure that at least one feature in a row is distinct.
    • Notice, this addition affects (increases) the correlation between the rows. However, it should affect all rows equally and hence the hierarchy stays as is.
    opened by lutrarutra 1
  • write anndata failed, pearson_residuals_df header message is too large

    write anndata failed, pearson_residuals_df header message is too large

    • [x] I have checked that this issue has not already been reported.
    • [x] I have confirmed this bug exists on the latest version of scanpy.
    • [ ] (optional) I have confirmed this bug exists on the master branch of scanpy.

    Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

    Minimal code sample (that we can copy&paste without having any data)

    Write any anndata with pearson residuals in uns

    ad_all.write(filename='output/10x_h5/ad_all_2cello.h5ad')
    

    The pearson_residual_df looks like this, with 38291 rows (obs) and 5000 columns (features) :

    {'theta': 100,
     'clip': None,
     'computed_on': 'adata.X',
     'pearson_residuals_df': gene_name                             A2M  AADACL2-AS1      AAK1     ABCA1  \
     barcode                                                                      
     GAACGTTCACACCGAC-1-placenta_81  -1.125285    -1.159130 -3.921314 -2.533474   
     TATACCTGTTAGCTAC-1-placenta_81  -1.091364     3.267127 -1.806667 -2.109586   
     CTCAAGAGTGACTGTT-1-placenta_81  -1.074943    12.272920 -1.948798 -2.735791   
     TTCATTGTCACGAACT-1-placenta_81  -1.098699    -1.131765  3.481171  4.472371   
     TATCAGGCAGCTCATA-1-placenta_81  -1.107734    -1.141064 -0.571775 -2.813671   
     ...                                   ...          ...       ...       ...   
     CACAACATCGGCGATC-1-placenta_314 -0.115585    -0.119107 -0.434686 -0.303945   
     AGCCAGCGTGCCCAGT-1-placenta_314 -0.097424    -0.100394 -0.366482 -0.256219   
     CCGGTGAGTGTTCGAT-1-placenta_314 -0.110334    -0.113696 -0.414971 -0.290148   
     AGGTCATAGCCTGACC-1-placenta_314 -0.115585    -0.119107 -0.434686 -0.303945   
     TTTATGCCAAAGGGTC-1-placenta_314 -0.112876    -0.116316 -0.424515 -0.296827 
    
    Unable to create attribute (object header message is too large)
    
    Above error raised while writing key 'pearson_residuals_df' of <class 'h5py._hl.group.Group'> to /
    

    Versions

    scanpy==1.9.1 anndata==0.8.0 umap==0.5.2 numpy==1.21.5 scipy==1.8.0 pandas==1.4.1 scikit-learn==1.0.2 statsmodels==0.13.2 python-igraph==0.9.9 pynndescent==0.5.6

    opened by brainfo 5
  • adata.var.sort_index(inplace=True) does not sort adata.X

    adata.var.sort_index(inplace=True) does not sort adata.X

    • [x] I have checked that this issue has not already been reported.
    • [x] I have confirmed this bug exists on the latest version of scanpy.
    • [x] (optional) I have confirmed this bug exists on the master branch of scanpy.

    When sorting the adata object using adata.var.sort_index(inplace=True) the adata.X is not sorting accordingly (see example below).

    Capture

    
    import numpy as np
    import pandas as pd
    import scanpy as sc
    import matplotlib.pyplot as plt
    sc.settings.verbosity = 3
    sc.logging.print_header()
    
    adata = sc.datasets.pbmc3k()
    adata.to_df()['MPO'].sum()
    adata.var.sort_index(inplace=True)
    adata.to_df()['MPO'].sum()
    
    

    Versions


    anndata 0.8.0 scanpy 1.9.1

    PIL 9.2.0 asttokens NA backcall 0.2.0 beta_ufunc NA binom_ufunc NA cffi 1.15.1 cycler 0.10.0 cython_runtime NA dateutil 2.8.2 debugpy 1.5.1 decorator 5.1.1 defusedxml 0.7.1 entrypoints 0.4 executing 0.8.3 google NA h5py 3.7.0 hypergeom_ufunc NA igraph 0.9.11 ipykernel 6.17.1 ipython_genutils 0.2.0 ipywidgets 8.0.2 jedi 0.18.1 joblib 1.2.0 kiwisolver 1.4.4 leidenalg 0.8.10 llvmlite 0.39.1 loompy 3.0.7 matplotlib 3.6.0 matplotlib_inline NA mpl_toolkits NA natsort 8.2.0 nbinom_ufunc NA ncf_ufunc NA numba 0.56.2 numpy 1.23.3 numpy_groupies 0.9.19 packaging 21.3 pandas 1.4.4 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 pkg_resources NA platformdirs 2.6.0 prompt_toolkit 3.0.20 psutil 5.9.4 ptyprocess 0.7.0 pure_eval 0.2.2 pycparser 2.21 pydev_ipython NA pydevconsole NA pydevd 2.6.0 pydevd_concurrency_analyser NA pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pygments 2.11.2 pynndescent 0.5.7 pyparsing 3.0.9 pytz 2022.2.1 scipy 1.9.1 scvelo 0.2.4 session_info 1.0.0 setuptools_scm NA six 1.16.0 sklearn 1.1.2 stack_data 0.2.0 statsmodels 0.13.2 texttable 1.6.4 threadpoolctl 3.1.0 tornado 6.2 tqdm 4.64.1 traitlets 5.6.0 typing_extensions NA umap 0.5.3 wcwidth 0.2.5 yaml 5.4.1 zipp NA zmq 24.0.1

    IPython 8.4.0 jupyter_client 7.4.8 jupyter_core 5.1.0 notebook 6.5.2

    Python 3.9.13 (main, Aug 25 2022, 23:26:10) [GCC 11.2.0] Linux-5.4.0-1092-aws-x86_64-with-glibc2.27

    opened by sandrav-CGEN 0
Owner
Theis Lab
Institute of Computational Biology
Theis Lab
Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis ?? This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

Andy Pham 1 Sep 3, 2022
Data Science Environment Setup in single line

datascienv is package that helps your to setup your environment in single line of code with all dependency and it is also include pyforest that provide single line of import all required ml libraries

Ashish Patel 55 Dec 16, 2022
Single machine, multiple cards training; mix-precision training; DALI data loader.

Template Script Category Description Category script comparison script train.py, loader.py for single-machine-multiple-cards training train_DP.py, tra

null 2 Jun 27, 2022
Important dataframe statistics with a single command

quick_eda Receiving dataframe statistics with one command Project description A python package for Data Scientists, Students, ML Engineers and anyone

Sven Eschlbeck 2 Dec 19, 2021
Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

SALib 663 Jan 5, 2023
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

Marc Skov Madsen 97 Dec 8, 2022
Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

Damien Farrell 81 Dec 26, 2022
Repositori untuk menyimpan material Long Course STMKGxHMGI tentang Geophysical Python for Seismic Data Analysis

Long Course "Geophysical Python for Seismic Data Analysis" Instruktur: Dr.rer.nat. Wiwit Suryanto, M.Si Dipersiapkan oleh: Anang Sahroni Waktu: Sesi 1

Anang Sahroni 0 Dec 4, 2021
A data analysis using python and pandas to showcase trends in school performance.

A data analysis using python and pandas to showcase trends in school performance. A data analysis to showcase trends in school performance using Panda

Jimmy Faccioli 0 Sep 7, 2021
HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

HyperSpy 411 Dec 27, 2022
PyPSA: Python for Power System Analysis

1 Python for Power System Analysis Contents 1 Python for Power System Analysis 1.1 About 1.2 Documentation 1.3 Functionality 1.4 Example scripts as Ju

null 758 Dec 30, 2022
A collection of learning outcomes data analysis using Python and SQL, from DQLab.

Data Analyst with PYTHON Data Analyst berperan dalam menghasilkan analisa data serta mempresentasikan insight untuk membantu proses pengambilan keputu

null 6 Oct 11, 2022
DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis.

DaDRA (day-druh) is a Python library for Data-Driven Reachability Analysis. The main goal of the package is to accelerate the process of computing estimates of forward reachable sets for nonlinear dynamical systems.

null 2 Nov 8, 2021
BioMASS - A Python Framework for Modeling and Analysis of Signaling Systems

Mathematical modeling is a powerful method for the analysis of complex biological systems. Although there are many researches devoted on produ

BioMASS 22 Dec 27, 2022
Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

SPEDAS 98 Dec 22, 2022
Python Project on Pro Data Analysis Track

Udacity-BikeShare-Project: Python Project on Pro Data Analysis Track Basic Data Exploration with pandas on Bikeshare Data Basic Udacity project using

Belal Mohammed 0 Nov 10, 2021
Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Surf's Up Weather analysis with Python, SQLite, SQLAlchemy, and Flask Overview The purpose of this analysis was to examine weather trends (precipitati

Art Tucker 1 Sep 5, 2021
Stock Analysis dashboard Using Streamlit and Python

StDashApp Stock Analysis Dashboard Using Streamlit and Python If you found the content useful and want to support my work, you can buy me a coffee! Th

StreamAlpha 27 Dec 9, 2022
Python implementation of Principal Component Analysis

Principal Component Analysis Principal Component Analysis (PCA) is a dimension-reduction algorithm. The idea is to use the singular value decompositio

Ignacio Darago 1 Nov 6, 2021