A scanpy extension to analyse single-cell TCR and BCR data.

Related tags

Deep Learning scirpy
Overview

Scirpy: A Scanpy extension for analyzing single-cell immune-cell receptor sequencing data

Build Status Documentation Status PyPI Bioconda AIRR-compliant The uncompromising python formatter

Scirpy is a scalable python-toolkit to analyse T cell receptor (TCR) or B cell receptor (BCR) repertoires from single-cell RNA sequencing (scRNA-seq) data. It seamlessly integrates with the popular scanpy library and provides various modules for data import, analysis and visualization.

The scirpy workflow

Getting started

Please refer to the documentation. In particular, the

In the documentation, you can also learn more about our immune-cell receptor model.

Case-study

The case study from our preprint is available here.

Installation

You need to have Python 3.7 or newer installed on your system. If you don't have Python installed, we recommend installing Miniconda.

There are several alternative options to install scirpy:

  1. Install the latest release of scirpy from PyPI:
pip install scirpy
  1. Get it from Bioconda:
conda install -c conda-forge -c bioconda scirpy
  1. Install the latest development version:
pip install git+https://github.com/icbi-lab/scirpy.git@master
  1. Run it in a container using Docker or Podman:
docker pull quay.io/biocontainers/scirpy:<tag>

where tag is one of these tags.

Support

We are happy to assist with problems when using scirpy. Please report any bugs, feature requests, or help requests using the issue tracker. We try to respond within two working days, however fixing bugs or implementing new features can take substantially longer, depending on the availability of our developers.

Release notes

See the release section.

Contact

Please use the issue tracker.

Citation

Sturm, G. Tamas, GS, ..., Finotello, F. (2020). Scirpy: A Scanpy extension for analyzing single-cell T-cell receptor sequencing data. Bioinformatics. doi:10.1093/bioinformatics/btaa611
Comments
  • Vdj plot - [merged]

    Vdj plot - [merged]

    In GitLab by @szabogtamas on Mar 25, 2020, 19:57

    Merges vdj_plot -> master

    Added a much faster version of #24
    Fixes #24.

    Test cases need to be added yet.

    opened by grst 70
  • Issues with installing SCIRPY

    Issues with installing SCIRPY

    I am trying to install Scirpy using anaconda/Jupyter on my desktop window.

    **when I try this: conda install -c conda-forge -c bioconda scirpy

    I go the error message:**

    Collecting package metadata (current_repodata.json): ...working... done
    Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
    Solving environment: ...working... failed with repodata from current_repodata.json, will retry with next repodata source.
    Collecting package metadata (repodata.json): ...working... done
    Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
    Solving environment: ...working... 
    Found conflicts! Looking for incompatible packages.
    This can take several minutes.  Press CTRL-C to abort.
    failed
    
    Note: you may need to restart the kernel to use updated packages.
    
    
    Building graph of deps:   0%|          | 0/2 [00:00<?, ?it/s]
    Examining python=3.8:   0%|          | 0/2 [00:00<?, ?it/s]  
    Examining scirpy:  50%|#####     | 1/2 [00:00<00:00,  2.94it/s]
    Examining scirpy: 100%|##########| 2/2 [00:00<00:00,  5.88it/s]
                                                                   
    
    Determining conflicts:   0%|          | 0/2 [00:00<?, ?it/s]
    Examining conflict for python scirpy:   0%|          | 0/2 [00:00<?, ?it/s]
                                                                               
    
    UnsatisfiableError: The following specifications were found to be incompatible with each other:
    
    Output in format: Requested package -> Available versions
    

    THEN I tried this: pip install scirpy, I got another error message:

    ERROR: Command errored out with exit status 1:
       command: 'C:\Users\tpeng\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\tpeng\\AppData\\Local\\Temp\\pip-install-jpyjqkeg\\python-levenshtein\\setup.py'"'"'; __file__='"'"'C:\\Users\\tpeng\\AppData\\Local\\Temp\\pip-install-jpyjqkeg\\python-levenshtein\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\tpeng\AppData\Local\Temp\pip-wheel-3n51vnw0'
           cwd: C:\Users\tpeng\AppData\Local\Temp\pip-install-jpyjqkeg\python-levenshtein\
      Complete output (27 lines):
      running bdist_wheel
      running build
      running build_py
      creating build
      creating build\lib.win-amd64-3.8
      creating build\lib.win-amd64-3.8\Levenshtein
      copying Levenshtein\StringMatcher.py -> build\lib.win-amd64-3.8\Levenshtein
      copying Levenshtein\__init__.py -> build\lib.win-amd64-3.8\Levenshtein
      running egg_info
      writing python_Levenshtein.egg-info\PKG-INFO
      writing dependency_links to python_Levenshtein.egg-info\dependency_links.txt
      writing entry points to python_Levenshtein.egg-info\entry_points.txt
      writing namespace_packages to python_Levenshtein.egg-info\namespace_packages.txt
      writing requirements to python_Levenshtein.egg-info\requires.txt
      writing top-level names to python_Levenshtein.egg-info\top_level.txt
      reading manifest file 'python_Levenshtein.egg-info\SOURCES.txt'
      reading manifest template 'MANIFEST.in'
      warning: no previously-included files matching '*pyc' found anywhere in distribution
      warning: no previously-included files matching '*so' found anywhere in distribution
      warning: no previously-included files matching '.project' found anywhere in distribution
      warning: no previously-included files matching '.pydevproject' found anywhere in distribution
      writing manifest file 'python_Levenshtein.egg-info\SOURCES.txt'
      copying Levenshtein\_levenshtein.c -> build\lib.win-amd64-3.8\Levenshtein
      copying Levenshtein\_levenshtein.h -> build\lib.win-amd64-3.8\Levenshtein
      running build_ext
      building 'Levenshtein._levenshtein' extension
      error: Microsoft Visual C++ 14.0 is required. Get it with "Build Tools for Visual Studio": https://visualstudio.microsoft.com/downloads/
      ----------------------------------------
      ERROR: Failed building wheel for python-levenshtein
        ERROR: Command errored out with exit status 1:
         command: 'C:\Users\tpeng\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\tpeng\\AppData\\Local\\Temp\\pip-install-jpyjqkeg\\python-levenshtein\\setup.py'"'"'; __file__='"'"'C:\\Users\\tpeng\\AppData\\Local\\Temp\\pip-install-jpyjqkeg\\python-levenshtein\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\tpeng\AppData\Local\Temp\pip-record-spj3ycwj\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\tpeng\Anaconda3\Include\python-levenshtein'
             cwd: C:\Users\tpeng\AppData\Local\Temp\pip-install-jpyjqkeg\python-levenshtein\
        Complete output (27 lines):
        running install
        running build
        running build_py
        creating build
        creating build\lib.win-amd64-3.8
        creating build\lib.win-amd64-3.8\Levenshtein
        copying Levenshtein\StringMatcher.py -> build\lib.win-amd64-3.8\Levenshtein
        copying Levenshtein\__init__.py -> build\lib.win-amd64-3.8\Levenshtein
        running egg_info
        writing python_Levenshtein.egg-info\PKG-INFO
        writing dependency_links to python_Levenshtein.egg-info\dependency_links.txt
        writing entry points to python_Levenshtein.egg-info\entry_points.txt
        writing namespace_packages to python_Levenshtein.egg-info\namespace_packages.txt
        writing requirements to python_Levenshtein.egg-info\requires.txt
        writing top-level names to python_Levenshtein.egg-info\top_level.txt
        reading manifest file 'python_Levenshtein.egg-info\SOURCES.txt'
        reading manifest template 'MANIFEST.in'
        warning: no previously-included files matching '*pyc' found anywhere in distribution
        warning: no previously-included files matching '*so' found anywhere in distribution
        warning: no previously-included files matching '.project' found anywhere in distribution
        warning: no previously-included files matching '.pydevproject' found anywhere in distribution
        writing manifest file 'python_Levenshtein.egg-info\SOURCES.txt'
        copying Levenshtein\_levenshtein.c -> build\lib.win-amd64-3.8\Levenshtein
        copying Levenshtein\_levenshtein.h -> build\lib.win-amd64-3.8\Levenshtein
        running build_ext
        building 'Levenshtein._levenshtein' extension
        error: Microsoft Visual C++ 14.0 is required. Get it with "Build Tools for Visual Studio": https://visualstudio.microsoft.com/downloads/
        ----------------------------------------
    ERROR: Command errored out with exit status 1: 'C:\Users\tpeng\Anaconda3\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\tpeng\\AppData\\Local\\Temp\\pip-install-jpyjqkeg\\python-levenshtein\\setup.py'"'"'; __file__='"'"'C:\\Users\\tpeng\\AppData\\Local\\Temp\\pip-install-jpyjqkeg\\python-levenshtein\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record 'C:\Users\tpeng\AppData\Local\Temp\pip-record-spj3ycwj\install-record.txt' --single-version-externally-managed --compile --install-headers 'C:\Users\tpeng\Anaconda3\Include\python-levenshtein' Check the logs for full command output.
    
      Stored in directory: c:\users\tpeng\appdata\local\pip\cache\wheels\8c\51\cb\423184c62cc06c302d2f54f9853e5acee6c15a3b04d49a5eb3
      Building wheel for python-levenshtein (setup.py): started
      Building wheel for python-levenshtein (setup.py): finished with status 'error'
      Running setup.py clean for python-levenshtein
      Building wheel for yamlordereddictloader (setup.py): started
      Building wheel for yamlordereddictloader (setup.py): finished with status 'done'
      Created wheel for yamlordereddictloader: filename=yamlordereddictloader-0.4.0-py3-none-any.whl size=4058 sha256=05980b7e37960621917874dd26c59adf1eb9d304ca1f21d7c81d38adc8bc2674
      Stored in directory: c:\users\tpeng\appdata\local\pip\cache\wheels\50\9a\6f\9cb3312fd9cd01ea93c3fdc1dbee95f5fa0133125d4c7cb09a
    Successfully built airr yamlordereddictloader
    Failed to build python-levenshtein
    Installing collected packages: yamlordereddictloader, airr, python-levenshtein, squarify, parasail, pytoml, scirpy
        Running setup.py install for python-levenshtein: started
        Running setup.py install for python-levenshtein: finished with status 'error'
    
    question 
    opened by taopeng1100 40
  • List of plots [REPLACEMENT ISSUE]

    List of plots [REPLACEMENT ISSUE]

    The original issue

    Id: 9
    Title: List of plots
    

    could not be created. This is a dummy issue, replacing the original one. It contains everything but the original issue description. In case the gitlab repository is still existing, visit the following link to show the original issue:

    TODO

    opened by grst 38
  • Memory usage ir_neighbors

    Memory usage ir_neighbors

    Dear authors,

    Currently, the ir_neighbors algorithm consumes over 100G of memory of data from 55k cells, causing memory failures on the server. Is there any way to limit memory consumption and is this normal behavior, which parameters can I adapt to control memory usage ?

    Kind Regards,

    opened by vladie0 27
  • Support for BCR

    Support for BCR

    BCR support meta issue.

    Initial PR #183 addresses:

    • [x] change datastructure
    • Instead of TRA_1, TRA_2, TRB_1 and TRB_2 have arm1_primary, arm1_secondary, arm2_primary and arm2_secondary
    • have additional 4 columns: arm1_primary_type, ...; These can accept values such as TRA, TRB, TRG, TRD, IGH, IGL (see AIRR locus names)
    • [x] adapt chain_categories to identify bona fide vs other pairings (e.g. flag a cell with TRA + IGH).
    • [x] clonotype network: separate by receptor_type (per default, no connections between BCR and TCR)
    • [x] update glossary and documentation. Make clear that now there are VJ and VDJ chains. (mostly done)
    • [x] rename tcr_ to cdr3_ or receptor_ or vdj_ or ir_

    To be resolved before next release (v0.5 "adding experimental BCR support")

    • [x] #194 vdj_usage is broken with new data structure
    • [x] #195 improve BCR-related documentation
    • [x] read_tracer() with gamma/delta
    • [x] read_bracer()
    • [x] #198 Add BCR example dataset

    BCR-related issues that can be resolved at a later point

    • [ ] #197 function to infer antibody class
    • [ ] #196 function to infer somatic hypermutation status
    • [ ] #199 BCR-tutorial
    • [ ] add support for CDR1 and 2 (#185)
    opened by grst 27
  • Find better name

    Find better name

    In GitLab by @grst on Mar 20, 2020, 18:16

    The current one, sctcrpy is hard to pronounce and remember.

    Also it would be nice if the name left the option to expand to BCRs later on.

    imm, sc, py, receptor, cr, ... ??

    opened by grst 21
  • Cannot convert output from Scirpy to dandelion

    Cannot convert output from Scirpy to dandelion

    Description of the bug

    I try to convert the AnnData after clonal assignment by Scirpy into dandelion format. The result showed that "field productive has invalid bool T + T". I would like to convert it for updating germline sequence of each BCR sequences using dandelion because I did not find this function in Scirpy. However, if you could suggest the other ways. Feel free to let me know.

    Minimal reproducible example

    import scirpy as ir
    
    ABC_irdata_exclude_orphan_dandelion = ir.io.to_dandelion(ABC_irdata_exclude_orphan)
    ABC_irdata_exclude_orphan_dandelion
    

    The error message produced by the code above

    ~/.conda/envs/dandelion/lib/python3.8/site-packages/airr/schema.py in validate_row(self, row)
        276                 if spec == 'number':  self.to_float(row[f], validate=True)
        277             except ValidationError as e:
    --> 278                 raise ValidationError('field %s has %s' %(f, e))
        279 
        280         return True
    
    ValidationError: field productive has invalid bool T + T
    

    Version information

    versions
    
    
    bug 
    opened by sbenjamaporn 19
  • Ranking genes between specific clusters in clonotype network

    Ranking genes between specific clusters in clonotype network

    Dear ICBI lab,

    I have some questions regarding the Scirpy package, I would really appreciate your help. For my analyses I first merged TCR and transcriptomics data from two different samples (organ 1 and organ 2), subsequently I then merged these two files.

    1. a) If using the Scirpy clonotype_network tool the 'sequence' is set to 'nt', are the clusters in the clonotype network (where each node represents a cell) based on identical nucleotide sequences, or is it based on similarity, meaning that one cluster could consists of cells with slightly different nucleotide sequences?

    As I understand, each node represents a cell, and edges connect cells belonging to the same clonotype. The function makes visualization of the clonotype-network possible, analogous to the construction of a neighborhood graph from transcriptomics data with the Scanpy package, so that based on the above, it computes a neighborhood graph of CDR3 nucleotide sequences with "scirpy.pp.tcr_neighbors())". However answer to my question I couldn't find, I was hoping you can help me out?

    b) Do the lines that connect the cells within a cluster have any meaning?

    Screen Shot 2020-10-14 at 2 24 29 PM

    c) In the plot above, is it correct that the closer the different clusters are together, the more similar their nucleotide sequences are? Meaning that the sequences of the clonotypes consisting of only 2 cells in this case are most different from the clonotype clusters consisting of >5 cells (as they are further apart)?

    2. a) The package allows for specifying what organs the clusters consist of:

    Screen Shot 2020-10-14 at 2 32 06 PM

    My question is: how can I select specific clusters in the clonotype network graph (I only would like to include the clonotypes in the network that have identical clonotypes between the two different samples from my data, i.e. organ 1 and organ 2), meaning that in the plot above I would like to filter for the clusters only that display both blue and orange nodes.

    b) How can I assign numbers to my clusters based on identical nucleotide sequences shared between two samples?

    c) How can I add a legend to my plot, and how can I change the name 'batch' into 'sample'?

    3. Using Scirpy, how can one best specify differentially expressed genes (based on the transcriptomics data) between the different clusters based on shared nucleotide sequences between blood and fat, for example cluster 1 vs. rest of the clusters, and cluster 1 and 2 vs. cluster 3 and 4? I have tried implementing Scanpy's tool to rank genes using WIlcoxon, but unfortunately I can't make It work.

    sc.tl.rank_genes_groups(adata, 'clonotype', groups=['1','2'], reference=['3','4'], method='wilcoxon') sc.pl.rank_genes_groups(adata, groups=['1', '2'], n_genes=20)

    -The key of the observations grouping to be considered would be: "clonotype clusters based on share identical nucleotide sequences between organ 1 and organ 2". -Subset of groups to which comparison wouldl be restricted: "clonotype 1 and clonotype 2" -Comparison: Compare with respect to a specific group -Group identifier with respect to which compare: "clonotype 3 and clonotype 4" “The number of genes that appear in the returned tables”: 100 “Method”: Wilcoxon-Rank-Sum

    Thanks in advance,

    Josine

    question 
    opened by josinejansen 19
  • BD Rhapsody data import

    BD Rhapsody data import

    I have a question about BD Rhapsody CDR3 VDJ data import. Following scirpy introduction, imported data object is very similar with BD VDJ data.

    If you have an interest wiht BD VDJ data, please check the possibility of BD VDJ data import into scirpy. I can share my data to you.

    opened by wajm 18
  • Plot overhaul - [merged]

    Plot overhaul - [merged]

    In GitLab by @grst on Mar 2, 2020, 14:21

    Merges feat/plot-overhaul -> master

    This PR aims at addressing issues in %"plot overhaul".

    • [x] Simplify and restructure tools (Don't add to uns when inexpensive, Fixes #25)
    • [x] Rudimentary support for figure themes (atm, only a default theme is supported, but can be extended easily, Fixes #18)
    • [ ] Simplify and restructure plots

    Plot checklist: Make sure every plotting function

    • [x] accepts
      • styling kwargs
      • ax object
      • (this is implicitly given by kwargs forwarding. Maybe requires better documentation)
    • [x] returns ax object
    • [x] has sensible defaults for ax labelling and title
    opened by grst 17
  • List of Tools

    List of Tools

    In GitLab by @grst on Jan 30, 2020, 12:59

    Tools are functions that work with the data parsed from 10x/tracer and add either

    • new columns to obs
    • new matrices to obsm (e.g. distance matrices)
    • other summary data to uns.

    They are usually required as an additional processing step before running certain plotting functions. Here's a list of tools we want to implement.

    @szabogtamas, feel free to add to/edit the list.

    List of tools

    • [x] st.tl.define_clonotypes(adata) assignes clonotypes to cells based on their CDR3 sequences
    • [x] st.tl.tcr_dist(adata, chains=["TRA_1, "TRB_1"], combination=np.min) adds TCR dist to obsm (#11)
    • [x] st.tl.kidera_dist adds Kidera distances to obsm
    • [x] st.tl.chain_convergence(adata, groupby) adds column to obs that contains the number of nucleotide versions for each CDR3 AA sequence
    • [x] st.tl.alpha_diversity(adata, groupby, diversityforgroup) Now we were only thinking about calculating diversity of clonotypes in different groups. But the diversity of any group could just as well be calculated.
    • [ ] st.tl.sequence_logos(adata, ?forgroup?) Precompute MSAs and sequence logos for plotting with st.pl.sequence_logos.
    • [ ] st.tl.dendrogram(adata, groupby) Compute a dendrogram on an arbitrary distance matrix (e.g. from tcr_dist).

    Needs discussion

    • [ ] st.tl.create_group(group_membership={Group1: ['barcode1', barcode2']} adds a group membership to each cell by adding a column to obsm and the name of the grouping to a list in uns (by default, groups based on samples, V gene usage and even clonotypes could be created at initial run); might call chain_convergence and alpha_diversity functions to calculate these measures right when creating a group

    Ideas, might be implemented at later stage

    • [ ] Shared Kmers
    • [ ] GLIPH
    • [ ] Chains recognizing the same eiptopes based on McPAS-TCR
    • [ ] epitope reactivity -> query external database
    • [ ] tcellmatch (Fischer, Theis et al. )
    opened by grst 17
  • antigen specificity prediction

    antigen specificity prediction

    Description of feature

    The first reasonable methods to predict antigen specificity emerge, e.g.

    • ERGO-II (https://www.frontiersin.org/articles/10.3389/fimmu.2021.664514/full, https://github.com/IdoSpringer/ERGO-II)

    These are conceptually different from querying databases through sequence distance metrics or autoencoders, as they do not simply model the sequence similarity, but explicitly model the specificity.

    Would be nice to call them directly from scirpy.

    @FFinotello, potentially another good student task.

    opened by grst 0
  • Scalability to >1M cells

    Scalability to >1M cells

    Description of feature

    I have been playing with omniscope's COVID dataset that provides 8M TCR receptors. By doing so, I identified several bottlenecks that make working with >1M cells in scirpy painful or impossible.

    This meta issue is to give an overview of the progress improving scirpy's scalability.

    graph TB
        subgraph legend
             legend1(could be faster -- minutes)
             OK(OK -- seconds)
             legend2(prohibitively slow -- hours)
             legend3(not profiled yet)
             style legend1 stroke:#ff7f00
             style OK stroke:#4daf4a
             style legend2 stroke:#e41a1c
        end
    
    graph TB
        subgraph preprocessing
          IO --> QC
          QC --> dist_id[ir_dist identity]
          QC --> dist_levenshtein[ir_dist levenshtein]
          QC --> dist_alignment[ir_dist alignment]
          dist_id --> define_clonotypes
          dist_levenshtein --> define_clonotypes
          dist_alignment --> define_clonotypes
          define_clonotypes --> clonotypes
          QC -.-> autoencoder
          autoencoder -.-> clonotypes
          autoencoder -.-> define_clonotypes
    
          clonotypes[(CLONOTYPES)]
          
          style IO stroke:#ff7f00
          style QC stroke:#4daf4a
          style dist_id stroke:#4daf4a
          style define_clonotypes stroke:#e41a1c
          style dist_levenshtein stroke:#e41a1c
          style dist_alignment stroke:#e41a1c
          style clonotypes stroke:white
       end
       
       subgraph downstream
          clonotypes --> clonotype_network
          clonotypes --> other[other tools]
       end
    

    Action items

    1. data structure (#356). The foundation for other changes. Might also speed up saving the anndata object.
    2. reading data (#367). User experience can be improved, but not a top priority atm.
    3. ir_dist (#304). Needs more scalable methods for computing sequence distances.
    4. define_clonotypes (#368). At the very least needs a better parallelization. Maybe there's room for some jax/numba.
    5. autoencoder-based embedding (#369). Possible alternative to ir_dist. Maybe it even makes sense to combine ir_dist and define_clonotypes into a single step.
    opened by grst 0
  • Autoencoder-based sequence embedding

    Autoencoder-based sequence embedding

    Description of feature

    IMO autoencoder-based sequence embedding has a huge potential for finding similar immune receptors, potentially improving both the speed and the accuracy compared to alignment-based metrics. In particular, finding similar sequences is important in two scirpy functions:

    • defining clonotypes
    • querying immune receptor databases.

    For the database query, an online-update algorithm similar to scArches for gene expression would be nice: The autoencoder could be trained on the database (which might have millions of unique receptors) once. A new dataset (which might only have 10k-100k unique receptors), could be projected into the same latent space as the database, significantly improving query time.

    An extension to this idea is to embed gene expression and TCR/BCR data into the same latent space.

    Existing tools

    • Trex by @ncborcherding. Based on keras.
    • mvTCR by @b-schubert's lab. Combines receptor/Gex data. Based on pytorch.
    • TESSA. Combines receptor/Gex data. Not even sure it's an autoencoder, need yet to check in detail, but it seems to use some clever sequence embeddings.
    • There are likely more...

    @drEast mentioned he is working on something like that a few months ago. Are you willing to share a few details and if you would be interested in integrating it with scirpy? @adamgayoso, any chance there's AirrVI soon? :stuck_out_tongue_winking_eye:

    opened by grst 7
  • speed up define_clonotypes

    speed up define_clonotypes

    Description of feature

    The define_clonotypes function scales badly. There are two problems with it

    • it could be faster (while it relies heavily on numpy, there are parts implemented in Python)
    • parallelization doesn't work properly with large data. Due to how multiprocessing is implemented in Python, parallelization involves a lot of copying. If parallelization worked properly, the speed would still be bearable if one throws enough cores at the problem.

    Where's the bottleneck of the function?

    INPUT:

    • 2 distance matrices, one for unique VJ sequences, one for unique VDJ sequences

    OUTPUT:

    • a clonotype id for each cell

    CURRENT IMPLEMENTATION:

    1. compute unique receptor configurations (i.e. combining cells with the same sequences into a single entry) (fast)
    2. build a lookup table from which the neighbors of each cell can be retrieved (fast enough)
    3. loop through all unique receptor configurations and find neighbors (SLOW)
    4. build a distance matrix (fast)
    5. graph partition using igraph (fast)

    ALTERNATIVE IMPLEMENTATIONS I considered but discarded

    • reindexing sequence distance matrices such that they match the table of unique receptor configurations
    • Then perform matrix operations to combine primary/secondary and TRA/TRB matrices.
    • The problem with this approach is that large dense blocks in the sparse matrices can arise if many unique receptors have the same sequence (e.g. same TRA but different TRB).

    Possible solutions

    • fix parallelization (shared memory)
    • reimplement using jax/numba (this may also solve the parallelization and provide GPU support)
    • Combine 2-4 into a single step (maybe possible with sequence embedding -- see #369 ). Note that this would be an alternative route and wouldn't replace ir_dist/define_clonotypes completely.
    • Special-casing: In the case of omniscope data (which only has TRB chains), the problem simplifies to reindexing a sparse matrix. If using only one pair of sequences per cell, the problem is likely also simpler.
    opened by grst 0
  • Speed up read_airr

    Speed up read_airr

    Description of feature

    Loading AIRR data with 1.5M rows takes ~10 minutes. This is not too bad, but could be made less annoying

    • Make validation optional (I expect the validation of the airr implementation takes a good chunk of that time)
    • Parallelization (read different parts of the file in parallel - or first read into pandas, parse several chunks of the dataframe in parallel)
    • Progress bar (this at least shows that this will eventually finish)
    opened by grst 0
Releases(v0.11.2)
  • v0.11.2(Nov 20, 2022)

  • v0.11.1(Aug 18, 2022)

    Fixes

    • Solve incompatibility with scipy v1.9.0 (#360)

    Internal changes

    • do not autodeploy docs via CI (currently broken)
    • updated patched version of scikit-learn
    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Jul 5, 2022)

    Additions

    • Add data loader for BD Rhapsody single-cell immune-cell receptor data (io.read_bd_rhapsody) (#351)

    Fixes

    • Fix type conversions in from_dandelion (#349).
    • Update minimal dandelion version

    Documentation

    • Rebranding to scverse (#324, #326)
    • Add issue templates
    • Fix IMGT typos (#344 by @emjbishop)

    Internal changes

    • Bump default CI python version to 3.9
    • Use patched version of scikit-bio in CI until https://github.com/biocore/scikit-bio/pull/1813 gets merged
    Source code(tar.gz)
    Source code(zip)
  • v0.10.1(Nov 22, 2021)

  • v0.10.0(Nov 15, 2021)

    Additions

    This release adds a new feature to query reference databases (#298) comprising

    • an extension of pp.ir_dist to compute distances to a reference dataset,
    • tl.ir_query, to match immune receptors to a reference database based on the distances computed with ir_dist,
    • tl.ir_query_annotate and tl.ir_query_annotate_df to annotate cells based on the result of tl.ir_query, and
    • datasets.vdjdb which conveniently downloads and processes the latest version of VDJDB.

    Fixes

    • Bump minimal dependencies for networkx and tqdm (#300)
    • Fix issue with repertoire_overlap (Fix #302 via #305)
    • Fix issue with define_clonotype_clusters (Fix #303 via #305)
    • Suppress FutureWarnings from pandas in tutorials (#307)

    Internal changes

    • Update sphinx to >= 4.1 (#306)
    • Update black version
    • Update the internal folder structure: tl, pp etc. are now real packages instead of aliases
    Source code(tar.gz)
    Source code(zip)
  • v0.9.1(Sep 24, 2021)

    Fixes

    • Scirpy can now import additional columns from Cellranger 6 (#279 by @naity)
    • Fix minor issue with include_fields in AirrCell (#297)

    Documentation

    • Fix broken link in README (#296)
    • Add developer documentation (#294)
    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Sep 7, 2021)

    Additions

    • Add the new "clonotype modularity" tool which ranks clonotypes by how strongly connected their gene expression neighborhood graph is. (#282).

    The below example shows three clonotypes (164, 1363, 942), two of which consist of cells that are transcriptionally related.

    example clonotypes clonotype modularity vs. FDR

    Deprecations

    • tl.clonotype_imbalance is now deprecated in favor of the new clonotype modularity tool.

    Fixes

    • Fix calling locus from gene name in some cases (#288)
    • Compatibility with networkx>=2.6 (#292)

    Minor updates

    • Fix some links in README (#284)
    • Fix old instances of clonotype in docs (should be clone_id) (#287)
    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Jul 22, 2021)

    Additions

    • tl.alpha_diversity now supports all metrics from scikit-bio, the D50 metric and custom callback functions (#277 by @naity)

    Fixes

    • Handle input data with "productive" chains which don't have a junction_aa sequence annotated (#281)
    • Fix issue with serialized "extra chains" not being imported correctly (#283 by @zktuong)

    Minor changes

    • The CI can now build documentation from pull-requests from forks. PR docs are not deployed to github-pages anymore, but can be downloaded as artifact from the CI run.
    Source code(tar.gz)
    Source code(zip)
  • v0.7.1(Jul 2, 2021)

    Fixes

    • Ensure Compatibility with latest version of dandelion (e78701c)
    • Add links to older versions of documentation (#275)
    • Fix issue, where clonotype analysis couldn't be continued after saving and reloading h5ad object (#274)
    • Allow "None" values to be present as cell-level attributes during merge_airr_chains (#273)

    Minor changes

    • Require anndata >= 0.7.6 in conda tests (#266)
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Apr 28, 2021)

    This update features a

    • change of Scirpy's data structure to improve interoperability with the AIRR standard
    • a complete re-write of the clonotype definition module for improved performance.

    This required several backwards-incompatible changes. Please read the release notes below and the updated tutorials.

    Backwards-incompatible changes

    Improve Interoperability by fully supporting the AIRR standard (#241)

    Scirpy stores receptor information in adata.obs. In this release, we updated the column names to match the AIRR Rearrangement standard. Our data model is now much more flexible, allowing to import arbitrary immune-receptor (IR)-chain related information. Use scirpy.io.upgrade_schema() to update existing AnnData objects to the latest format.

    Closed issues #240, #253, #258, #255, #242, #215.

    This update includes the following changes:

    • IrCell is now replaced by AirrCell which has additional functionality
    • IrChain has been removed. Use a plain dictionary instead.
    • CDR3 information is now read from the junction and junction_aa columns instead of cdr3_nt and cdr3, respectively.
    • Clonotype assignments are now per default stored in the clone_id column.
    • expr and expr_raw are now duplicate_count and consensus_count.
    • {v,d,j,c}_gene is now {v,d,j,c}_call.
    • There's now an extra_chains column containing all IR-chains that don't fit into our receptor model. These chains are not used by scirpy, but can be re-exported to different formats.
    • merge_with_ir is now split up into merge_with_ir (to merge IR data with transcriptomics data) and merge_airr_chains (to merge several adatas with IR information, e.g. BCR and TCR data).
    • Tutorial and documentation updates, to reflect these changes
    • Sequences are not converted to upper case on import. Scirpy tools that consume the sequences convert them to upper case on-the-fly.
    • {to,from}_ir_objs has been renamed to {to,from}_airr_cells.

    Refactor CDR3 network creation (#230)

    Previously, pp.ir_neighbors constructed a cell x cell network based on clonotype similarity. This led to performance issues with highly expanded clonotypes (i.e. thousands of cells with exactly the same receptor configuration). Such cells would form dense blocks in the sparse adjacency matrix (see issue #217). Another downside was that expensive alignment-distances had to be recomputed every time the parameters of ir_neighbors was changed.

    The new implementation computes distances between all unique receptor configurations, only considering one instance of highly expanded clonotypes.

    Closed issues #243, #217, #191, #192, #164.

    This update includes the following changes:

    • pp.ir_neighbors has been replaced by pp.ir_dist.
    • The options receptor_arms and dual_ir have been moved from pp.ir_neighbors to tl.define_clonotypes and tl.define_clonotype_clusters.
    • The default key for clonotype clusters is now cc_{distance}_{metric} instead of ct_cluster_{distance}_{metric}.
    • same_v_gene now fully respects the options dual_ir and receptor_arms
    • v-genes and receptor types were previously simply appended to clonotype ids (when same_v_gene=True). Now clonotypes with different v-genes get assigned a different numeric id.
    • Distance metric classes have been moved from ir_dist to ir_dist.metrics.
    • Distances matrices generated by ir_dist are now square and symmetric instead of triangular.
    • The default value for dual_ir is now any instead of primary_only (Closes #164).
    • The API of clonotype_network has changed.
    • Clonotype network now visualizes cells with identical receptor configurations. The number of cells with identical receptor configurations is shown as point size (and optionally, as color). Clonotype network does not support plotting multiple colors at the same time any more.

    | Clonotype network (previous implementation) | Clonotype network (now) | | ------------------------------------------------------------|------------------| | Each dot represents a cell. Cells with identical receptors form a fully connected subnetwork | Each dot represents cells with identical receptors. The dot size refers to the number of cells | image | image |

    Drop Support for Python 3.6

    • Support Python 3.9, drop support for Python 3.6, following the numpy guidelines. (#229)

    Fixes

    • tl.clonal_expansion and tl.clonotype_convergence now respect cells with missing receptors and return nan for those cells. (#252)

    Additions

    • util.graph.igraph_from_sparse_matrix allows to convert a sparse connectivity or distance matrix to an igraph object.
    • ir_dist.sequence_dist now also works sequence arrays that contain duplicate entries (#192)
    • from_dandelion and to_dandelion facilitate interaction with the Dandelion package (#240)
    • write_airr allows to write scirpy's adata.obs back to the AIRR Rearrangement format.
    • read_airr now tries to infer the locus from gene names, if no locus column is present.
    • ir.io.upgrade_schema allows to upgrade an existing scirpy anndata object to be compatible with the latest version of scirpy
    • define_clonotypes and define_clonotype_clusters now prints a logging message indicating where the results have been stored (#215)

    Minor changes

    • tqdm now uses IPython widgets to display progress bars, if available
    • the process_map from tqdm is now used to display progress bars for parallel computations instead the custom implementation used previously f307c2b
    • matplotlibs "grid lines" are now suppressed by default in all plots.
    • Docs from the master branch are now deployed to icbi-lab.github.io/scirpy/develop instead of the main documentation website. The main website only gets updated on releases.
    • Refactored the _is_na function that checks if a string evaluates to None.
    • Fixed outdated documentation of the receptor_arms parameter (#264)
    Source code(tar.gz)
    Source code(zip)
  • v0.6.1(Jan 30, 2021)

    Fixes

    • Fix an issue where define_clonotype failed when the clonotype network had no edges (#236).
    • Require pandas >= 1.0 and fix a pandas incompatibility in merge_with_ir (#238).
    • Ensure consistent order of the spectratype dataframe (#238).

    Minor changes

    • Fix missing bibtex_bibfiles option in sphinx configuration
    • Work around https://github.com/takluyver/flit/issues/383.
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Dec 10, 2020)

    Backwards-incompatible changes:

    • Set more sensible defaults the the cutoff parameter in ir_neighbors. The default is now 2 for hamming and levenshtein distance metrics and 10 for the alignment distance metric.

    Additions:

    • Add Hamming-distance as additional distance metric for ir_neighbors (#216 by @ktpolanski)

    Minor changes:

    • Fix MacOS CI (#221)
    • Use mamba instead of conda in CI (#216)
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Oct 20, 2020)

    Add support for BCRs and gamma-delta TCRs

    Backwards-incompatible changes:

    • The data structure has changed. Column have been renamed from TRA_xxx and TRB_xxx to IR_VJ_xxx and IR_VDJ_xxx. Additionally a locus column has been added for each chain.
    • All occurences of tcr in the function and class names have been replaced with ir. Aliases for the old names have been created and emit a FutureWarning.

    Additions:

    • There's now a mixed TCR/BCR example dataset (maynard2020) available (#211)
    • BCR-related amendments to the documentation (#206)
    • tl.chain_qc which supersedes chain_pairing. It additionally provides information about the receptor type.
    • io.read_tracer now supports gamma-delta T-cells (#207)
    • io.to_ir_objs allows to convert adata to a list of IrCells (#210)
    • io.read_bracer allows to read-in BraCeR BCR data. (#208)
    • The pp.merge_with_ir function now can handle the case when both the left and the right AnnData object contain immune receptor information. This is useful when integrating both TCR and BCR data into the same dataset. (#210)

    Fixes:

    • Fix a bug in vdj_usage which has been triggered by the new data structure (#203)

    Minor changes:

    • Removed the tqdm monkey patch, as the issue has been resolved upstream (#200)
    • Add AIRR badge, as scirpy is now certified to comply with the AIRR software standard v1. (#202)
    • Require pycairo >1.20 which provides a windows wheel, eliminating the CI problems.
    Source code(tar.gz)
    Source code(zip)
  • d0.1.0(Oct 20, 2020)

  • v0.4.2(Oct 1, 2020)

  • v0.4.1(Sep 30, 2020)

    • Fix pythonpublish CI action
    • Update black version (and code style, accordingly)
    • Changes for AIRR-complicance:
      • Add support level to README
      • Add Biocontainer instructions to README
      • Add a minimal test suite to be ran on conda CI
    Source code(tar.gz)
    Source code(zip)
  • v0.4(Aug 26, 2020)

    • Adapt tcr_dist to support second array of sequences (#166). This enables comparing CDR3 sequences against a list of reference sequences.
    • Add tl.clonotype_convergence which helps to find evidence of convergent evolution (#168)
    • Optimize parallel sequence distance calculation (#171). There is now less communication overhead with the worker processes.
    • Fixed an error when runing pp.tcr_neighbors (#177)
    • Improve packaging. Use setuptools_scm instead of get_version. Remove redundant metadata. (#180). More tests for conda (#180).
    Source code(tar.gz)
    Source code(zip)
  • v0.3(Jun 5, 2020)

    • More extensive CI tests (now also testing on Windows, MacOS and testing the conda recipe) (#136, #138)
    • Add example images to API documentation (#140)
    • Refactor IO to expose TcrCell and TcrChain (#139)
    • Create data loading tutorial (#139)
    • Add a progressbar to TCR neighbors (#143)
    • Move clonotype_network_igraph to tools (#144)
    • Add read_airr to support the AIRR rearrangement format (#147)
    • Add option to take v-gene into account during clonotype definition (#148)
    • Store colors in AnnData to ensure consistent coloring across plots (#151)
    • Divide define_clontoypes into define_clonotypes and define_clonotype_clusters (#152). Now, the user has to specify explicitly sequence and metric for both tl.tcr_neighbors, tl.define_clonotype_clusters and tl.clonotype_network. This makes it more straightforward to have multiple, different versions of the clonotype network at the same time. The default parameters changed to sequence="nt" and `metric="identity" to comply with the traditional definition of clonotypes. The changes are also reflected in the glossary and the tutorial.
    • Update the workflow figure (#154)
    • Fix a bug that caused labels in the repertoire_overlap heatmap to be mixed up. (#157)
    • Add a label to the heatmap annotation in repertoire_overlap (#158).
    Source code(tar.gz)
    Source code(zip)
  • v0.2(May 22, 2020)

    • Documentation overhaul. A lot of docstrings got corrected and improved and the formatting of the documentation now matches scanpy's.
    • Experimental function to assess bias in clonotype abundance between conditions (#92)
    • Scirpy now has a logo (#123)
    • Update default parameters for clonotype_network:
      • Edges are now only automatically displayed if plotting < 1000 nodes
      • If plotting variables with many categories, the legend is hidden.
    • Update default parameters for alignment-based tcr_neighbors
      • The gap extend penalty now equals the gap open penalty (11).
    Source code(tar.gz)
    Source code(zip)
  • v0.1.2(Apr 15, 2020)

    • Make 10x csv and json import consistent (#109)
    • Fix version requirements (#112)
    • Fix compatibility issues with pandas > 1 (#112)
    • Updates to tutorial and README
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Apr 10, 2020)

Owner
ICBI
Institute of Bioinformatics @ Medical University of Innsbruck
ICBI
A Parameter-free Deep Embedded Clustering Method for Single-cell RNA-seq Data

A Parameter-free Deep Embedded Clustering Method for Single-cell RNA-seq Data Overview Clustering analysis is widely utilized in single-cell RNA-seque

AI-Biomed @NSCC-gz 3 May 8, 2022
7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle

kaggle-hpa-2021-7th-place-solution Code for 7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle. A description of the met

null 8 Jul 9, 2021
Single Red Blood Cell Hydrodynamic Traps Via the Generative Design

Rbc-traps-generative-design - The generative design for single red clood cell hydrodynamic traps using GEFEST framework

Natural Systems Simulation Lab 4 Jun 16, 2022
Vector AI — A platform for building vector based applications. Encode, query and analyse data using vectors.

Vector AI is a framework designed to make the process of building production grade vector based applications as quickly and easily as possible. Create

Vector AI 267 Dec 23, 2022
A new GCN model for Point Cloud Analyse

Pytorch Implementation of PointNet and PointNet++ This repo is implementation for VA-GCN in pytorch. Classification (ModelNet10/40) Data Preparation D

null 12 Feb 2, 2022
pcnaDeep integrates cutting-edge detection techniques with tracking and cell cycle resolving models.

pcnaDeep: a deep-learning based single-cell cycle profiler with PCNA signal Welcome! pcnaDeep integrates cutting-edge detection techniques with tracki

ChanLab 8 Oct 18, 2022
A lightweight Python-based 3D network multi-agent simulator. Uses a cell-based congestion model. Calculates risk, loudness and battery capacities of the agents. Suitable for 3D network optimization tasks.

AMAZ3DSim AMAZ3DSim is a lightweight python-based 3D network multi-agent simulator. It uses a cell-based congestion model. It calculates risk, battery

Daniel Hirsch 13 Nov 4, 2022
Learning cell communication from spatial graphs of cells

ncem Features Repository for the manuscript Fischer, D. S., Schaar, A. C. and Theis, F. Learning cell communication from spatial graphs of cells. 2021

Theis Lab 77 Dec 30, 2022
Message Passing on Cell Complexes

CW Networks This repository contains the code used for the papers Weisfeiler and Lehman Go Cellular: CW Networks (Under review) and Weisfeiler and Leh

Twitter Research 108 Jan 5, 2023
LIVECell - A large-scale dataset for label-free live cell segmentation

LIVECell dataset This document contains instructions of how to access the data associated with the submitted manuscript "LIVECell - A large-scale data

Sartorius Corporate Research 112 Jan 7, 2023
Kaggle: Cell Instance Segmentation

Kaggle: Cell Instance Segmentation The goal of this challenge is to detect cells in microscope images. with simple view on how many cels have been ann

Jirka Borovec 9 Aug 12, 2022
Solution of Kaggle competition: Sartorius - Cell Instance Segmentation

Sartorius - Cell Instance Segmentation https://www.kaggle.com/c/sartorius-cell-instance-segmentation Environment setup Build docker image bash .dev_sc

null 68 Dec 9, 2022
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2020 Links Doc

Sebastian Raschka 4.2k Jan 2, 2023
Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-wise Distributed Data based on Pytorch Framework

VFedPCA+VFedAKPCA This is the official source code for the Paper: Vertical Federated Principal Component Analysis and Its Kernel Extension on Feature-

John 9 Sep 18, 2022
QuakeLabeler is a Python package to create and manage your seismic training data, processes, and visualization in a single place — so you can focus on building the next big thing.

QuakeLabeler Quake Labeler was born from the need for seismologists and developers who are not AI specialists to easily, quickly, and independently bu

Hao Mai 15 Nov 4, 2022
Code for You Only Cut Once: Boosting Data Augmentation with a Single Cut

You Only Cut Once (YOCO) YOCO is a simple method/strategy of performing augmenta

null 88 Dec 28, 2022
Little Ball of Fur - A graph sampling extension library for NetworKit and NetworkX (CIKM 2020)

Little Ball of Fur is a graph sampling extension library for Python. Please look at the Documentation, relevant Paper, Promo video and External Resour

Benedek Rozemberczki 619 Dec 14, 2022
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intention of Apex is to make up-to-date utilities available to users as quickly as possible.

NVIDIA Corporation 6.9k Jan 3, 2023
A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

Introduction This is a Python package available on PyPI for NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pyto

Artit 'Art' Wangperawong 5 Sep 29, 2021