MarcoPolo is a clustering-free approach to the exploration of bimodally expressed genes along with group information in single-cell RNA-seq data

Chanwoo Kim

Last update: Dec 18, 2022

Related tags

Overview

MarcoPolo is a method to discover differentially expressed genes in single-cell RNA-seq data without depending on prior clustering

Overview

MarcoPolo is a novel clustering-independent approach to identifying DEGs in scRNA-seq data. MarcoPolo identifies informative DEGs without depending on prior clustering, and therefore is robust to uncertainties from clustering or cell type assignment. Since DEGs are identified independent of clustering, one can utilize them to detect subtypes of a cell population that are not detected by the standard clustering, or one can utilize them to augment HVG methods to improve clustering. An advantage of our method is that it automatically learns which cells are expressed and which are not by fitting the bimodal distribution. Additionally, our framework provides analysis results in the form of an HTML file so that researchers can conveniently visualize and interpret the results.

Datasets	URL
Human liver cells (MacParland et al.)	https://chanwkimlab.github.io/MarcoPolo/HumanLiver/
Human embryonic stem cells (The Koh et al.)	https://chanwkimlab.github.io/MarcoPolo/hESC/
Peripheral blood mononuclear cells (Zheng et al.)	https://chanwkimlab.github.io/MarcoPolo/Zhengmix8eq/

Installation

Currently, MarcoPolo was tested only on Linux machines. Dependencies are as follows:

python (3.7)
- numpy (1.19.5)
- pandas (1.2.1)
- scipy (1.6.0)
- scikit-learn (0.24.1)
- pytorch (1.4.0)
- rpy2 (3.4.2)
- jinja2 (2.11.2)
R (4.0.3)
- Seurat (3.2.1)
- scran (1.18.3)
- Matrix (1.3.2)
- SingleCellExperiment (1.12.0)

Download MarcoPolo by git clone

git clone https://github.com/chanwkimlab/MarcoPolo.git

We recommend using the following pipeline to install the dependencies.

Install Anaconda Please refer to https://docs.anaconda.com/anaconda/install/linux/ make conda environment and activate it

conda create -n MarcoPolo python=3.7
conda activate MarcoPolo

Install Python packages

pip install numpy=1.19.5 pandas=1.21 scipy=1.6.0 scikit-learn=0.24.1 jinja2==2.11.2 rpy2=3.4.2

Also, please install PyTorch from https://pytorch.org/ (If you want to install CUDA-supported PyTorch, please install CUDA in advance)

Install R and required packages

conda install -c conda-forge r-base=4.0.3

In R, run the following commands to install packages.

install.packages("devtools")
devtools::install_version(package = 'Seurat', version = package_version('3.2.1'))
install.packages("Matrix")
install.packages("BiocManager")
BiocManager::install("scran")
BiocManager::install("SingleCellExperiment")

Getting started

Converting scRNA-seq dataset you have to python-compatible file format.

If you have a Seurat object seurat_object, you can save it to a Python-readable file format using the following R codes. An example output by the function is in the example directory with the prefix sample_data. The data has 1,000 cells and 1,500 genes in it.

save_sce <- function(sce,path,lowdim='TSNE'){
    
    sizeFactors(sce) <- calculateSumFactors(sce)
    
    save_data <- Matrix(as.matrix(assay(sce,'counts')),sparse=TRUE)
    
    writeMM(save_data,sprintf("%s.data.counts.mm",path))
    write.table(as.matrix(rownames(save_data)),sprintf('%s.data.row',path),row.names=FALSE, col.names=FALSE)
    write.table(as.matrix(colnames(save_data)),sprintf('%s.data.col',path),row.names=FALSE, col.names=FALSE)
    
    tsne_data <- reducedDim(sce, lowdim)
    colnames(tsne_data) <- c(sprintf('%s_1',lowdim),sprintf('%s_2',lowdim))
    print(head(cbind(as.matrix(colData(sce)),tsne_data)))
    write.table(cbind(as.matrix(colData(sce)),tsne_data),sprintf('%s.metadatacol.tsv',path),row.names=TRUE, col.names=TRUE,sep='\t')    
    write.table(cbind(as.matrix(rowData(sce))),sprintf('%s.metadatarow.tsv',path),row.names=TRUE, col.names=TRUE,sep='\t')    
    
    write.table(sizeFactors(sce),file=sprintf('%s.size_factor.tsv',path),sep='\t',row.names=FALSE, col.names=FALSE)    

}

sce_object <- as.SingleCellExperiment(seurat_object)
save_sce(sce_object, 'example/sample_data')

Running MarcoPolo

Please use the same path argument you used for running the save_sce function above. You can incorporate covariate - denoted as ß in the paper - in modeling the read counts by setting the Covar parameter.

import MarcoPolo.QQscore as QQ
import MarcoPolo.summarizer as summarizer

path='scRNAdata'
QQ.save_QQscore(path=path,device='cuda:0')
allscore=summarizer.save_MarcoPolo(input_path=path,
                                   output_path=path)

Generating MarcoPolo HTML report

import MarcoPolo.report as report
report.generate_report(input_path="scRNAdata",output_path="report/hESC",top_num_table=1000,top_num_figure=1000)

Note
- User can specify the number of genes to include in the report file by setting the top_num_table and top_num_figure parameters.
- If there are any two genes with the same MarcoPolo score, a gene with a larger fold change value is prioritized.

The function outputs the two files:

report/hESC/index.html (MarcoPolo HTML report)
report/hESC/voting.html (For each gene, this file shows the top 10 genes of which on/off information is similar to the gene.)

To-dos

supporting AnnData object, which is used by scanpy by default.
building colab running environment

Citation

If you use any part of this code or our data, please cite our paper.

@article{kim2022marcopolo,
  title={MarcoPolo: a method to discover differentially expressed genes in single-cell RNA-seq data without depending on prior clustering},
  author={Kim, Chanwoo and Lee, Hanbin and Jeong, Juhee and Jung, Keehoon and Han, Buhm},
  journal={Nucleic Acids Research},
  year={2022}
}

Contact

If you have any inquiries, please feel free to contact

Chanwoo Kim (Paul G. Allen School of Computer Science & Engineering @ the University of Washington)

Comments

KeyError: 'phenoid' (from generate_report function)

Hi, I greatly thank you for the wonderful software.

I was trying MarcoPolo with my scRNA-seq data, I found that save_QQscore, save_MarcoPolo function works well, but "generate_report" function keeps giving me this error

------Drawing figures------
Traceback (most recent call last):
  File "/home/data/.conda/envs/MarcoPolo/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 4554, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 4562, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'phenoid'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/data/MarcoPolo/MarcoPolo_install/MarcoPolo/MarcoPolo/report.py", line 306, in generate_report
    plot_value=exp_data_meta_transformed['phenoid']
  File "/home/data/.conda/envs/MarcoPolo/lib/python3.7/site-packages/pandas/core/frame.py", line 3024, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/home/data/.conda/envs/MarcoPolo/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3082, in get_loc
    raise KeyError(key) from err
KeyError: 'phenoid'`

when I look at the outputs, "files index.html", "voting.html" and directory "assets" seems to be properly generated, but directory "plot_image" is empty. Can you help me resolve this error?

Thank you,

opened by hongsilv 2

A question regarding augmenting HVG with marcopolo

First of all, thank you for the wonderful tool. I have been working on multiple samples from different developmental time points and was hoping that MarcoPolo could cluster out some of the developing cells with more accuracy.

I am not super familial with the scanpy pipeline and relatively new to the field so please kindly bear with my rudimentary questions.

I was wondering what specific steps are needed to augment HVG from the standard pipeline. Am I supposed to simply replace .var.highly_variable from the scanpy pipeline with marker_results from MarcoPolo? Also, if I am working with multiple samples, can I still adopt this process (augmented HVG) for the integrated data set (concatenated anndata)?

Thank you

opened by revolvefire 1

Run MarcoPolo in local machine with Jupyter Notebook

Hello @chanwkimlab ,

I'm a beginner for Python world. I would like to test your tool for my dataset and from my local machine with Jupyter Notebook. Can you help me with your code? Actually when I try your vignette, I get this error :

AssertionError                            Traceback (most recent call last)
C:\Users\AppData\Local\Temp/ipykernel_2416/4231211302.py in <module>
      5     adata.obs["size_factor"] = norm_factor/norm_factor.mean()
      6     print("size factor was calculated")
----> 7 regression_result = MarcoPolo.run_regression(adata=adata, size_factor_key="size_factor",
      8                          num_threads=8)
      9 # If you use a local machine, you can set `num_threads` to higher than 1 (maybe upto 4), which will speed up the regression a lot. For some reason, num_threads>1 does not seem to work well on colab (maybe due to the the limited RAM).
.
.
.
AssertionError: Torch not compiled with CUDA enabled

Do you have any idea to resolve this issue?

Thanks a lot in advance.

Regards, Sha

opened by SHADJIA 3

Resolve sample batch

Dear authors, Please accept my sincere thanks for providing such a useful tool. How to solve the sample batch of input counts, and can I use the normalized data for calculation? Best, Miller

opened by millersan 1
Running without R dependencies

Hi, could you please elaborate on what should be in the directory given as path argument when running MarcoPolo? If I have a h5ad file produced with Scanpy could I use it as input for the algorithm and therefore skip then R dependencies?

opened by afrendeiro 5

Owner

Chanwoo Kim

Ph.D. student in Computer Science at the University of Washington

GitHub https://chanwkimlab.github.io/MarcoPolo/HumanLiver/index.html

Interpretation of T cell states using reference single-cell atlases

Interpretation of T cell states using reference single-cell atlases ProjecTILs is a computational method to project scRNA-seq data into reference sing

139 Jan 3, 2023

Using deep learning to predict gene structures of the coding genes in DNA sequences of Arabidopsis thaliana

DeepGeneAnnotator: A tool to annotate the gene in the genome The master thesis of the "Using deep learning to predict gene structures of the coding ge

3 Sep 9, 2022

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

Cutoff: A Simple Data Augmentation Approach for Natural Language This repository contains source code necessary to reproduce the results presented in

49 Dec 22, 2022

A scanpy extension to analyse single-cell TCR and BCR data.

Scirpy: A Scanpy extension for analyzing single-cell immune-cell receptor sequencing data Scirpy is a scalable python-toolkit to analyse T cell recept

145 Jan 3, 2023

[ICLR 2021] Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments.

[ICLR 2021] RAPID: A Simple Approach for Exploration in Reinforcement Learning This is the Tensorflow implementation of ICLR 2021 paper Rank the Episo

48 Nov 21, 2022

Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning Source Code

8 Sep 14, 2022

Keyhole Imaging: Non-Line-of-Sight Imaging and Tracking of Moving Objects Along a Single Optical Path

Keyhole Imaging Code & Dataset Code associated with the paper "Keyhole Imaging: Non-Line-of-Sight Imaging and Tracking of Moving Objects Along a Singl

20 Feb 3, 2022

Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

5 Jul 18, 2022

Awesome Deep Graph Clustering is a collection of SOTA, novel deep graph clustering methods

ADGC: Awesome Deep Graph Clustering ADGC is a collection of state-of-the-art (SOTA), novel deep graph clustering methods (papers, codes and datasets).

297 Dec 27, 2022

7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle

kaggle-hpa-2021-7th-place-solution Code for 7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle. A description of the met

8 Jul 9, 2021

Single Red Blood Cell Hydrodynamic Traps Via the Generative Design

Rbc-traps-generative-design - The generative design for single red clood cell hydrodynamic traps using GEFEST framework

4 Jun 16, 2022

LIVECell - A large-scale dataset for label-free live cell segmentation

LIVECell dataset This document contains instructions of how to access the data associated with the submitted manuscript "LIVECell - A large-scale data

112 Jan 7, 2023

Graph-based community clustering approach to extract protein domains from a predicted aligned error matrix

Using a predicted aligned error matrix corresponding to an AlphaFold2 model , returns a series of lists of residue indices, where each list corresponds to a set of residues clustering together into a pseudo-rigid domain.

24 Nov 23, 2022

A Protein-RNA Interface Predictor Based on Semantics of Sequences

PRIP PRIP：A Protein-RNA Interface Predictor Based on Semantics of Sequences installation gensim==3.8.3 matplotlib==3.1.3 xgboost==1.3.3 prettytable==2

0 Mar 25, 2022

[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

DataFree A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation" Authors: Gongfa

47 Jan 9, 2023

Code for paper: Group-CAM: Group Score-Weighted Visual Explanations for Deep Convolutional Networks

Group-CAM By Zhang, Qinglong and Rao, Lu and Yang, Yubin [State Key Laboratory for Novel Software Technology at Nanjing University] This repo is the o

98 Nov 16, 2022

BC3407-Group-5-Project - BC3407 Group Project With Python

BC3407-Group-5-Project As the world struggles to contain the ever-changing varia

1 Jan 26, 2022

Official PyTorch implementation of "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image", ICCV 2019

PoseNet of "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image" Introduction This repo is official Py

677 Dec 25, 2022

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

364 Dec 28, 2022

MarcoPolo is a clustering-free approach to the exploration of bimodally expressed genes along with group information in single-cell RNA-seq data

Related tags

Overview

Overview

Installation

Getting started

To-dos

Citation

Contact

Comments

KeyError: 'phenoid' (from generate_report function)

A question regarding augmenting HVG with marcopolo

Run MarcoPolo in local machine with Jupyter Notebook

Resolve sample batch

Running without R dependencies

Owner

Chanwoo Kim

Interpretation of T cell states using reference single-cell atlases

Using deep learning to predict gene structures of the coding genes in DNA sequences of Arabidopsis thaliana

The source code for the Cutoff data augmentation approach proposed in this paper: "A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation".

A scanpy extension to analyse single-cell TCR and BCR data.

[ICLR 2021] Rank the Episodes: A Simple Approach for Exploration in Procedurally-Generated Environments.

Learning from Guided Play: A Scheduled Hierarchical Approach for Improving Exploration in Adversarial Imitation Learning Source Code

Keyhole Imaging: Non-Line-of-Sight Imaging and Tracking of Moving Objects Along a Single Optical Path

Graph Regularized Residual Subspace Clustering Network for hyperspectral image clustering

Awesome Deep Graph Clustering is a collection of SOTA, novel deep graph clustering methods

7th place solution of Human Protein Atlas - Single Cell Classification on Kaggle

Single Red Blood Cell Hydrodynamic Traps Via the Generative Design

LIVECell - A large-scale dataset for label-free live cell segmentation

Graph-based community clustering approach to extract protein domains from a predicted aligned error matrix

A Protein-RNA Interface Predictor Based on Semantics of Sequences

[IJCAI-2021] A benchmark of data-free knowledge distillation from paper "Contrastive Model Inversion for Data-Free Knowledge Distillation"

Code for paper: Group-CAM: Group Score-Weighted Visual Explanations for Deep Convolutional Networks

BC3407-Group-5-Project - BC3407 Group Project With Python

Official PyTorch implementation of "Camera Distance-aware Top-down Approach for 3D Multi-person Pose Estimation from a Single RGB Image", ICCV 2019

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.