User-friendly bulk RNAseq deconvolution using simulated annealing

Welcome to `cellanneal` - The user-friendly application for deconvolving omics data sets.

cellanneal is an application for deconvolving biological mixture data into constituting cell types. It comes both as a python package which includes a command line interface (CLI) and as a graphical software (graphical user interface, GUI) with the entire application bundled into a single executable. The python package with CLI can be downloaded from this repository; the graphical version is available for Microsoft Windows and macOS and can be downloaded from zenodo.

Download `cellanneal` graphical software for Windows

Download `cellanneal` graphical software for macOS

IMPORTANT: The graphical software has a startup time of up to one minute.

IMPORTANT: On macOS, when opening the graphical software for the first time, you must do so via right-click --> "Open" and then choose "Open" in the emerging dialogue, see below for more info.

How does cellanneal work?
Installation
a. python package and command line interface
b. graphical software
Requirements for input data files
Parameters
Using cellanneal
a. python package
b. command line interface
c. graphical software
cellanneal output
a. deconvolution results
b. figures
c. genewise comparison
FAQs

1. How does `cellanneal` work?

Given a gene-expression vector of a cellular mixture (for example derived from bulk RNA sequencing, the "mixture data") and gene-expression vectors characterising individual cell types (for example derived from clustered single-cell RNA sequencing data, the "signature data"), cellanneal provides an estimate of what fraction of each cell type is present in the bulk sample.

During the deconvolution process, a computational mixture sample is constructed from a set of cell type fractions and the signature data. The resulting synthetic gene expression vector is compared to the gene expression vector of the real mixture by calculating Spearman's correlation coefficient between the two. Cell type fractions are then changed until this correlation is maximised using the optimisation algorithm simulated annealing as implemented in scipy's dual_annealing. The cell type fractions associated with the highest Spearman correlation between the gene expression data of the experimental mixture (bulk sample gene expression) and the computational mixture are the cellanneal estimate for the mixture composition in terms of the cell types supplied in the signature file.

2. Installation

The python package comes with a set of functions which can be included in python workflows, scripts and notebooks as well as with a command-line entry point to cellanneal. The required code can be downloaded from this repository. The graphical software is available for Microsoft Windows and macOS operating systems and can be downloaded from zenodo.

2a. Installing the python package and CLI

Clone this code repository or download the zipped version and unpack it into a location of choice.

Installing cellanneal into a virtual environment, for example via anaconda, is recommended. cellanneal has been tested with python 3.7 and python 3.8.

It is recommended to install cellanneal's dependencies first; if using conda:

conda install numpy scipy matplotlib pandas seaborn xlrd openpyxl

If using pip:

pip install numpy scipy matplotlib pandas seaborn xlrd openpyxl

Next, navigate into the cellanneal directory (cellanneal-master, the directory containing the file setup.py) on the command line. There, execute the command

pip install .

That's it. Now you should be able to use cellanneal in your python projects via

import cellanneal

and via the command line as

cellanneal mixture_data.csv signature_data.csv output_folder

For more details on how to use it, see Using cellanneal.

2b. Installing the GUI

Installing the graphical software is as simple as downloading the correct version for your operating system zenodo and unzipping the content. The archive contains an executable file (the cellanneal application), an example mixture data file and an example signature data file. Please note that the GUI has an initial start-up time of up to one minute.

macOS: Under most security settings, macOS does not allow to open software from unidentified developers via double-click. To circumvent this, right-click (secondary click) onto the cellanneal executable and choose "Open" at the top of the emerging context menu. This primes a dialogue in which you can then press "Open". Subsequently, the software will also be accessible via double-click.
Windows: Antiviral software may inhibit the launching of the software - it may be necessary to set an exception or click "Allow" when asked whether to procede.

For more information on how to use the GUI, see Using the graphical software.

3. Requirements for input data

cellanneal accepts text files (*.csv and *.txt) as well as excel files (*.xlsx and *.xls) as inputs for both mixture and signature data provided that they are formatted correctly. Specifically, gene names need to appear in the first column for both mixture and signature data files, and sample names (for mixture data file) or cell type names (for signature data file) need to appear in the first row. Example data files can be found in this repository in the example directory. The top of an exemplary mixture.csv file may look like this and the top of an exemplary signature.xlsx file looks like this

Further important points regarding the input data:

It is not required that mixture and signature data sets contain exactly the same genes, or that these genes appear in the same order (or in alphabetical order).
Please do not logarithmise the input data before passing it into cellanneal.
Normalisation of mixture data: it is not required that the individual sample columns are normalised to a specific sum value; the normalisation will not affect the outcome.
Normalisation of signature data: normalising the individual cell type columns to the same count sum or not will lead to different results and whether you wish to normalise or not may depend on your biological question and available data. Specifically, if you do normalise all cell types to the same count sum, the output of cellanneal will tell you which fraction of the overall RNA was contributed by each cell type. This will not take into account size differences between cell types. In a toy example, if you analyse a mixture of one cell of type A and one cell of type B, where cell A at the base had ten times more RNA than cell B, after normalisation you will obtain the result that 10/11=91% of the RNA stem from type A. If all your reference data stems from the same data set, and you think that the average count sum of cells of a given type is a good proxy for their size, you may instead not normalise the values (in this case, the sum of all counts for cell type A would be 10 times higher than for B). Then, cellanneal’s output can be interpreted as cell fractions, i.e. the above example would return 50% type A and 50% type B. Following this concept, you can also think about normalising your cell types to different sum values based on known or estimated size factors.

4. Parameters

cellanneal allows the user to set four parameters. The first three govern the set of genes underlying the deconvolution process for each sample; the fourth parameter (iteration number) specifies for how long to run the optimisation process. Each parameter is discussed below.

Minimum expression in mixture (bulk_min) - minimum required expression level in the mixture sample (where total expression is normalised to sum up to 1) for a gene to be considered, default=1e-5. Allowed values are in the range [0, 1) but must be smaller than the maximum allowed expression. This parameter allows to exclude lowly expressed and potentially noisy genes.
Maximum expression in mixture (bulk_max) - maximum allowed expression level in the mixture sample (where total expression is normalised to sum up to 1) for a gene to be considered, default=0.01. Allowed values are in the range (0, 1] but must be larger than the minimum allowed expression. This parameter allows to exclude very highly expressed, potential contaminant, genes.
Minimum scaled dispersion (min_disp) - minimum scaled dispersion (variance/mean) over cell types for a gene to be considered, default=0.5. The value indicates the number of standard deviations which the dispersion of a specific gene lies above or below the mean when compared to genes of similar expression. All numerical values are allowed, but reasonable values for most cases lie between 0 and 1 as this parameter is used to select for genes which vary across cell types in the signature file while still keeping a broad gene base for robust deconvolution.
Maximum number of iterations (maxiter) - the maximum number of iterations through the logical chain of the underlying optimisation algorithm, scipy’s dual annealing. default=1000, after which typical problems have converged. Problems with a very high number of celltypes may require a higher number of iterations.

5. Using `cellanneal`

cellanneal can be used as part of a python workflow or individually via the command line or the graphical software. All three use cases are explained below.

5a. Using the python package

The python package provides functions for the three main steps of a deconvolution analysis with cellanneal: identification of a gene set for deconvolution, deconvolution using simulated annealing, and plotting the results. A quick start workflow is available in the examples folder.

In order to use cellanneal in your python workflow, you need to import it:

import cellanneal

As a first step, a gene set on which to base deconvolution has to be identified for each mixture sample. This step uses the parameters bulk_min, bulk_max and disp_min which are explained in the section [Parameters]((#4-parameters). The function make_gene_dictionary takes these inputs and produces a dictionary holding a gene list for each mixture sample:

gene_dict = cellanneal.make_gene_dictionary(
                    signature_df,
                    mixture_df,
                    disp_min=0.5,
                    bulk_min=1e-5,
                    bulk_max=0.01)

Next, deconvolution is run and a pandas.DataFrame holding the results is returned:

all_mix_df = cellanneal.deconvolve(
                signature_df,
                mixture_df,
                maxiter=1000,
                gene_dict=gene_dict)

Finally, four plotting options for deconvolution results are provided with cellanneal - pie charts, two heatmaps, and a scatter plot showing correlations between computational and real mixture samples.

cellanneal.plot_pies(all_mix_df)
cellanneal.plot_mix_heatmap(all_mix_df)
cellanneal.plot_mix_heatmap_log(all_mix_df)
cellanneal.plot_scatter(all_mix_df, mixture_df, signature_df, gene_dict)

5b. Using the command line interface

After installing the python package, a single command line command, cellanneal, becomes available. Note that if you are using conda environments, this command will only be available inside the environment into which you installed it and you need to activate this environment via conda activate my_env before you can make calls to cellanneal.

cellanneal requires three arguments,

the path to the mixture data file (*.csv, *.txt, *.xlsx, or *.xls)
the path to the signature data file (*.csv, *.txt, *.xlsx, or *.xls)
the path to the folder in which the results are to be stored

and allows the user to set four parameters,

bulk_min, the minimum required gene expression in the mixture
bulk_max, the maximum allowd gene expression in the mixture
min_disp, the minimum required scaled dispersion across cell types
maxiter, the maximum iteration number for scipy's dual_annealing

resulting in the following call signature:

cellanneal [-h] [--bulk_min BULK_MIN] [--bulk_max BULK_MAX]
                [--disp_min DISP_MIN] [--maxiter MAXITER]
                bulk_data_path celltype_data_path output_path

Further information about each parameter can be found in section Parameters.

5c. Using the graphical software

After download, the graphical user interface can be opened by double-clicking the executable. A console directly opens up; the graphical user interface follows with a delay of up to one minute as the bundled python packages have to be unpacked into a temporary directory. Please be patient and do not close the console. Once started, the interface looks like this:

The user can now select mixture data, signature data and an output folder from the file system using the three Browse file system buttons in the upper half of the interface. Optionally, the four parameters (see also section Parameters) can be changed via the Change parameters button which opens a separate window for entering parameter values. Parameter value defaults can be restored by clicking on the Reset to default values button.

Finally, a deconvolution run is started by pressing the button run cellanneal at the bottom of the interface. Once running, the interface becomes unresponsive until the process finishes. While cellanneal is running, detailed progress updates are printed into the accompanying console. When the run has finished, all results can be found in a directory labelled with the name of the mixture file and a timestamp inside the user-defined output folder. For further information on the output created by cellanneal, see section Output.

In order to shut down the application, the console window needs to be closed.

6. `cellanneal` output

cellanneal runs which were started from either the command line or the graphical user interface create a timestamped directory containing three folders with tabular results and figures into the user-specifed output folder. Their contents are discussed below. Additionally, a text file containing the names of mixture and signature files and the parameters of the run is stored at the top level of the results folder.

6a. Folder "deconvolution results"

This folder contains a CSV file with the main result of cellanneal: the fractional composition of each mixture in terms of cell types. Cell type names are shown in the first row; mixture sample names in the first column. Each numerical value in the table indicates the fraction the corresponding cell type occupies in the corresponding sample.

6b. Folder "figures"

If the input mixture file contains up to 100 samples, a standard cellanneal run produces four figures:

A figure with one pie chart per sample with each part of the pie representing the size of a cell type fraction.
A heatmap in which mixture samples run across the horizontal axis and cell types along the vertical one, each coloured square indicating the corresponding cell type fraction.
A second heatmap, similar to the first one, but showing log10(cell type fractions) instead in order to display small cell populations more clearly.
A figure with one scatter plot per sample. In each scatter plot, each dot represents one gene and a dot's location is determined by its expression in the real mixture (x-axis) and its expression in the optimal computational mixture (i.e. the cellannealresult, y-axis). This figure helps judge how well cellanneal was able to approximate the real mixture sample by producing a computational mixture of the supplied cell types.

6c. Folder "genewise comparison"

This folder contains one CSV file per mixture sample in the input data. Based on the deconvolution gene set for each sample , the file shows the normalised gene-wise expression in the experimental mixture (user input) in the first column and the corresponding expression in the optimal computational mixture in the second. The third column gives the ratio between the two (experimental/computational); the fourth the logarithm of this fold change. The purpose of this file is to allow to search for genes with particularly high discrepancies between experimental and computational mixtures. Such genes may be of biological or medical interest: as an example, if the signature data stemmed from healthy people, but the mixture file from a pathology, genes with high fold change between experiment and deconvolution result may have implications in the disease.

6. Frequently Asked Questions

How small/large should my gene sets be? cellanneal draws its robustness and ability to identify small cell populations from its permissive gene selection strategy. Ideally, we want to include every gene which we believe to carry information (i.e. which is expressed above the noise level and has at least some meaningful variability between cell types). Very large gene sets (~5000 genes and above) may lead to long runtimes, but if the data quality allows, the user should aim to have gene set sizes upwards of 2000 or even 3000 genes.
Why can my data not be imported? Please make sure that your data is formatted as described in the Input data section. Common pitfalls include:
- excel files downloaded from publications contain a title row (e. g. "Supplementary Table 2")
- excel may have converted some of your gene names to dates (e.g. "MAR1", "SEPT9"...)
- CSV files have an unequal number of columns in the first row (the row with the sample or cell type names) compared to subsequent rows because the first row looks like this sample1, sample2, sample3 instead of gene_name, sample1, sample2, sample3 as it should be (in the subsequent data rows, the first column contains the gene name).
What happens to mitochondrial genes? I noticed they are not part of my output. Mitochondrial genes (gene names starting with "MT-", "Mt-", "mt-") are removed from the gene list on which deconvolution is based in the cellanneal workflow. This happens after genes are selected based on minimum and maximum expression thresholds.

Probably not very relevant in real data scenarios, but came up in mini example which yielded only 1 hghly variable gene within bounds for one of the samples, output:

`+++ Welcome to cellanneal! +++

+++ Constructing gene sets ... +++ 10 highly variable genes identified in cell type reference. 4 of these are within thresholds for sample 1_nmc_H5 3 of these are within thresholds for sample 2_nmb_H5 1 of these are within thresholds for sample 3_mb_H5 2 of these are within thresholds for sample 4_mc_H5 2 of these are within thresholds for sample 5_mcol_H5

+++ Running cellanneal ... +++ Deconvolving sample 1_nmc_H5 ... Deconvolving sample 2_nmb_H5 ... Deconvolving sample 3_mb_H5 ... Exception in Tkinter callback Traceback (most recent call last): File "/Users/lisa/opt/anaconda3/envs/cellanneal/lib/python3.7/tkinter/init.py", line 1705, in call return self.func(*args) File "cellgui.py", line 224, in command=lambda: self.cellanneal(), File "cellgui.py", line 533, in cellanneal Path(self.output_path.get())) # path object! File "/Users/lisa/X/lisabu/cellanneal_gui_dev/cellanneal/cellanneal/pipelines.py", line 43, in cellanneal_pipe no_local_search=False) File "/Users/lisa/X/lisabu/cellanneal_gui_dev/cellanneal/cellanneal/general.py", line 316, in deconvolve maxiter=maxiter, File "/Users/lisa/X/lisabu/cellanneal_gui_dev/cellanneal/cellanneal/dual_annealing.py", line 610, in dual_annealing energy_state.reset(func_wrapper, rand_state, x0) File "/Users/lisa/X/lisabu/cellanneal_gui_dev/cellanneal/cellanneal/dual_annealing.py", line 189, in reset raise ValueError(message) ValueError: Stopping algorithm because function create NaN or (+/-) infinity values even with trying new random parameters`

bug

Trained on Simulated Data, Tested in the Real World

43 Nov 18, 2022

Simulated garment dataset for virtual try-on

Simulated garment dataset for virtual try-on This repository contains the dataset used in the following papers: Self-Supervised Collision Handling via

33 Dec 20, 2022

PINN Burgers - 1D Burgers equation simulated by PINN

PINN(s): Physics-Informed Neural Network(s) for Burgers equation This is an impl

1 Feb 12, 2022

NUANCED is a user-centric conversational recommendation dataset that contains 5.1k annotated dialogues and 26k high-quality user turns.

NUANCED: Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions Overview NUANCED is a user-centric conversational recommen

18 Dec 28, 2021

Fake-user-agent-traffic-geneator - Python CLI Tool to generate fake traffic against URLs with configurable user-agents

could not be deconvolved"">

"Error: Sample could not be deconvolved"

I got the following error message: For example: "Error: Sample GTEX-11DXY-0526-SM-5EGGQ could not be deconvolved. Possibly the gene set for this sample is too small. See online documentation for more info."

When running GTEX bulk data with signature derived from Tabula Sapiens.

Using CellAnneal app with the same input is successful

opened by ronsend 1
Uncaught exception if there is only one hv gene.

Probably not very relevant in real data scenarios, but came up in mini example which yielded only 1 hghly variable gene within bounds for one of the samples, output:

`+++ Welcome to cellanneal! +++

+++ Constructing gene sets ... +++ 10 highly variable genes identified in cell type reference. 4 of these are within thresholds for sample 1_nmc_H5 3 of these are within thresholds for sample 2_nmb_H5 1 of these are within thresholds for sample 3_mb_H5 2 of these are within thresholds for sample 4_mc_H5 2 of these are within thresholds for sample 5_mcol_H5

+++ Running cellanneal ... +++ Deconvolving sample 1_nmc_H5 ... Deconvolving sample 2_nmb_H5 ... Deconvolving sample 3_mb_H5 ... Exception in Tkinter callback Traceback (most recent call last): File "/Users/lisa/opt/anaconda3/envs/cellanneal/lib/python3.7/tkinter/init.py", line 1705, in call return self.func(*args) File "cellgui.py", line 224, in command=lambda: self.cellanneal(), File "cellgui.py", line 533, in cellanneal Path(self.output_path.get())) # path object! File "/Users/lisa/X/lisabu/cellanneal_gui_dev/cellanneal/cellanneal/pipelines.py", line 43, in cellanneal_pipe no_local_search=False) File "/Users/lisa/X/lisabu/cellanneal_gui_dev/cellanneal/cellanneal/general.py", line 316, in deconvolve maxiter=maxiter, File "/Users/lisa/X/lisabu/cellanneal_gui_dev/cellanneal/cellanneal/dual_annealing.py", line 610, in dual_annealing energy_state.reset(func_wrapper, rand_state, x0) File "/Users/lisa/X/lisabu/cellanneal_gui_dev/cellanneal/cellanneal/dual_annealing.py", line 189, in reset raise ValueError(message) ValueError: Stopping algorithm because function create NaN or (+/-) infinity values even with trying new random parameters`
bug

opened by LiBuchauer 1
Uncaught exception if all genes in a mixture have equal counts - np.corrcoeff breaks down because std=0

Problem identified by Shalev on Bella's data:

I got the following error message:

Deconvolving sample 47W_CT ... Exception in Tkinter callback Traceback (most recent call last): File "tkinter_init_.py", line 1892, in call File "cellgui.py", line 238, in File "cellgui.py", line 593, in cellanneal File "cellanneal\pipelines.py", line 38, in cellanneal_pipe File "cellanneal\general.py", line 313, in deconvolve File "cellanneal\dual_annealing.py", line 610, in dual_annealing File "cellanneal\dual_annealing.py", line 189, in reset ValueError: Stopping algorithm because function create NaN or (+/-) infinity values even with trying new random parameters

When running on the following files:

Mixture: X:/Shalevi/cell_anneal_test_gui/Bella_files/Bella_all_mixtures2.xlsx Signatures: X:/Shalevi/cell_anneal_test_gui/Bella_files/human_colon_signature_pool_cleaned.xlsx

Parameters were 1e-4 min expression and 0.6 dispersion.

In the attached file there were only 54 genes within these thresholds (a few hundred genes in other samples), might be good to see how failed samples could be elegantly aborted.

This happens because all 54 genes have only one UMI count, thus all get ranked as 27.5, and np.corrcoef then returns NaN hen trying to calculate correlation between this vector of 54*[27.5] and anything else
bug

opened by LiBuchauer 0
Introduce "Stop" button; allow several processes in parallel?

A Stop button allowing to abort running deconvolution processes from within the gui has been requested. This is not possible with the current simple architecture which does not use a separate subprocess for the deconvolution. Same goes for the request to allow several deconvolution processes in parallel.

If the architecture is ever changed to rely on deconvolution subprocesses again, both could be implemented.
enhancement

opened by LiBuchauer 1

User-friendly bulk RNAseq deconvolution using simulated annealing

Related tags

Overview

Welcome to cellanneal - The user-friendly application for deconvolving omics data sets.

Contents

1. How does cellanneal work?

2. Installation

2a. Installing the python package and CLI

2b. Installing the GUI

3. Requirements for input data

4. Parameters

5. Using cellanneal

5a. Using the python package

5b. Using the command line interface

5c. Using the graphical software

6. cellanneal output

6a. Folder "deconvolution results"

6b. Folder "figures"

6c. Folder "genewise comparison"

6. Frequently Asked Questions

You might also like...

Trained on Simulated Data, Tested in the Real World

Simulated garment dataset for virtual try-on

PINN Burgers - 1D Burgers equation simulated by PINN

NUANCED is a user-centric conversational recommendation dataset that contains 5.1k annotated dialogues and 26k high-quality user turns.

Fake-user-agent-traffic-geneator - Python CLI Tool to generate fake traffic against URLs with configurable user-agents

Scalable, event-driven, deep-learning-friendly backtesting library

TorchFlare is a simple, beginner-friendly, and easy-to-use PyTorch Framework train your models effortlessly.

A tiny, friendly, strong baseline code for Person-reID (based on pytorch).

The official implementation of CSG-Stump: A Learning Friendly CSG-Like Representation for Interpretable Shape Parsing

Comments

"Error: Sample could not be deconvolved"

Uncaught exception if there is only one hv gene.

Uncaught exception if all genes in a mixture have equal counts - np.corrcoeff breaks down because std=0

Introduce "Stop" button; allow several processes in parallel?

Owner

Official Python implementation of the 'Sparse deconvolution'-v0.3.0

AdamW optimizer and cosine learning rate annealing with restarts

Pmapper is a super-resolution and deconvolution toolkit for python 3.6+

Cosine Annealing With Warmup

Bulk2Space is a spatial deconvolution method based on deep learning frameworks

A user-friendly research and development tool built to standardize RL competency assessment for custom agents and environments.

A complete end-to-end demonstration in which we collect training data in Unity and use that data to train a deep neural network to predict the pose of a cube. This model is then deployed in a simulated robotic pick-and-place task.

A 35mm camera, based on the Canonet G-III QL17 rangefinder, simulated in Python.

A framework for analyzing computer vision models with simulated data

"Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback"

Welcome to `cellanneal` - The user-friendly application for deconvolving omics data sets.

1. How does `cellanneal` work?

5. Using `cellanneal`

6. `cellanneal` output