CHERRY is a python library for predicting the interactions between viral and prokaryotic genomes

Related tags

Deep Learning CHERRY
Overview

CHERRY CHERRY is a python library for predicting the interactions between viral and prokaryotic genomes. CHERRY is based on a deep learning model, which consists of a graph convolutional encoder and a link prediction decoder.

Overview

There are two kind of tasks that CHERRY can work:

  1. Host prediction for virus
  2. Identifying viruses that infect pathogenic bacteria

Users can choose one of the task when running CHERRY. If you have any trouble installing or using CHERRY, please let us know by opening an issue on GitHub or emailing us ([email protected]).

Required Dependencies

  • Python 3.x
  • Numpy
  • Pytorch>1.8.0
  • Networkx
  • Pandas
  • Diamond
  • BLAST
  • MCL
  • Prodigal

All these packages can be installed using Anaconda.

If you want to use the gpu to accelerate the program:

  • cuda
  • Pytorch-gpu

An easiler way to install

We recommend you to install all the package with Anaconda

After cloning this respository, you can use anaconda to install the CHERRY.yaml. This will install all packages you need with gpu mode (make sure you have installed cuda on your system to use the gpu version. Othervise, it will run with cpu version). The command is: conda env create -f CHERRY.yaml

  • For cpu version pytorch: conda install pytorch torchvision torchaudio cpuonly -c pytorch
  • For gpu version pytorch: Search pytorch to find the correct cuda version according to your computer Note: we suggest you to install all the package using conda (both miniconda and anaconda are ok). We supply a

Prepare the database

Due to the limited size of the GitHub, we zip the database. Before using CHEERY, you need to unpack them using the following commands.

cd CHEERY/dataset
bzip2 -d protein.fasta.bz2
bzip2 -d nucl.fasta.bz2
cd ../prokaryote
gunzip *
cd ..

Usage

1 Predicting host for viruses

If you want to predict hosts for viruses, the input should be a fasta file containing the virual sequences. We support an example file named "test_contigs.fa" in the Github folder. Then, the only command that you need to run is

python run_Speed_up.py [--contigs INPUT_FA] [--len MINIMUM_LEN] [--model MODEL] [--topk TOPK_PRED]

Options

  --contigs INPUT_FA
                        input fasta file
  --len MINIMUM_LEN
                        predict only for sequence >= len bp (default 8000)
  --model MODEL (pretrain or retrain)
                        predicting host with pretrained parameters or retrained paramters (default pretrain)
  --topk TOPK_PRED
                        The host prediction with topk score (default 1)

Example

Prediction on species level with pretrained paramters:

python run_Speed_up.py --contigs test_contigs.fa --len 8000 --model pretrain --topk 3

Note: Commonly, you do not need to retrain the model, especially when you do not have gpu unit.

OUTPUT

The format of the output file is a csv file ("final_prediction.csv") which contain the prediction of each virus. Column contig_name is the accession from the input.

Since the topk method is given, we cannot give the how taxaonmic tree for each prediction. However, we will supply a script for you to convert the prediction into a complte taxonmoy tree. Use the following command to generate taxonomy tree:

python run_Taxonomy_tree.py [--k TOPK_PRED]

Because there are k prediction in the "final_prediction.csv" file, you need to specify the k to generate the tree. The output of program is 'Top_k_prediction_taxonomy.csv'.

2 Predicting virus infecting prokaryote

If you want to predict hosts for viruses, you need to supply two kinds of inputs:

  1. Place your prokaryotic genomes in new_prokaryote/ folder.
  2. A fasta file containing the virus squences. Then, the program will output which virus in your fasta file will infect the prkaryotes in the new_prokaryote/ folder.

The command is simlar to the previous one but two more paramter is need:

python run_Speed_up.py [--mode MODE] [--t THRESHOLD]

Example

python run_Speed_up.py --contigs test_contigs.fa --mode prokaryote --t 0.98

Options

  --mode MODE (prokaryote or virus)
                        Switch mode for predicting virus or predicting host
  --t THRESHOLD
                        The confident threshold for predicting virus, the higier the threshold the higher the precision. (default 0.98)

OUTPUT

The format of the output file is a csv file which contain the prediction of each virus. Column prokaryote is the accession of your given prokaryotic genomes. Column virus is the list of viruses that might infect these genomes.

Extension of the parokaryotic genomes database

Due to the limitation of storage on GitHub, we only provided the parokaryote with known interactions (Date up to 2020) in prokaryote folder. If you want to predict interactions with more species, please place your parokaryotic genomes into prokaryote/ folder and add an entry of taxonomy information into dataset/prokaryote.csv. We also recommand you only add the prokaryotes of interest to save the computation resourse and time. This is because all the genomes in prokaryote folder will be used to generate the multimodal graph, which is a O(n^2) algorithm.

Example

If you have a metagenomic data and you know that only E. coli, Butyrivibrio fibrisolvens, and Faecalibacterium prausnitzii exist in the metagenomic data. Then you can placed the genomes of these three species into the prokaryote/ and add the entry in dataset/prokaryote.csv. An example of the entry is look like:

GCF_000007445,Bacteria,Proteobacteria,Gammaproteobacteria,Enterobacterales,Enterobacteriaceae,Escherichia,Escherichia coli

The corresponding header of the entry is: Accession,Superkingdom,Phylum,Class,Order,Family,Genus,Species. If you do not know the whole taxonomy tree, you can directly use a specific name for all columns. Because CHERRY is a link prediction tool, it will directly use the given name for prediction.

Noted: Since our program will use the accession for searching and constructing the knowledge graph, the name of the fasta file of your genomes should be the same as the given accession. For example, if your accession is GCF_000007445, your file name should be GCF_000007445.fa. Otherwise, the program cannot find the entry.

Extension of the virus-prokaryote interactions database

If you know more virus-prokaryote interactions than our pre-trained model (given in Interactiondata), you can add them to train a custom model. Several steps you need to do to train your model:

  1. Add your viral genomes into the nucl.fasta file and run the python refresh.py to generate new protein.fasta and database_gene_to_genome.csv files. They will replace the old one in the dataset/ folder automatically.
  2. Add the entrys of host taxonomy information into dataset/virus.csv. The corresponding header of the entry is: Accession (of the virus), Superkingdom, Phylum, Class, Order, Family, Genus, Species. The required field is Species. You can left it blank if you do not know other fields. Also, the accession of the virus shall be the same as your fasta entry.
  3. Place your prokaryotic genomes into the the prokaryote/ folder and add an entry in dataset/prokaryote.csv. The guideline is the same as the previous section.
  4. Use retrain as the parameter for --mode option to run the program.

References

The paper is submitted to the Briefings in Bioinformatics.

The arXiv version can be found via: CHERRY: a Computational metHod for accuratE pRediction of virus-pRokarYotic interactions using a graph encoder-decoder model

Contact

If you have any questions, please email us: [email protected]

Notes

  1. if the program output an error (which is caused by your machine): Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library. You can type in the command export MKL_SERVICE_FORCE_INTEL=1 before runing run_Speed_up.py
You might also like...
Implementation of Enformer, Deepmind's attention network for predicting gene expression, in Pytorch
Implementation of Enformer, Deepmind's attention network for predicting gene expression, in Pytorch

Enformer - Pytorch (wip) Implementation of Enformer, Deepmind's attention network for predicting gene expression, in Pytorch. The original tensorflow

Predicting path with preference based on user demonstration using Maximum Entropy Deep Inverse Reinforcement Learning in a continuous environment
Predicting path with preference based on user demonstration using Maximum Entropy Deep Inverse Reinforcement Learning in a continuous environment

Preference-Planning-Deep-IRL Introduction Check my portfolio post Dependencies Gym stable-baselines3 PyTorch Usage Take Demonstration python3 record.

 Predicting Student Attentiveness using OpenCV
Predicting Student Attentiveness using OpenCV

Predicting-Student-Attentiveness-using-OpenCV The model will predict if a student is attentive or not through facial parameter received through the st

[CIKM 2019] Code and dataset for "Fi-GNN: Modeling Feature Interactions via Graph Neural Networks for CTR Prediction"

FiGNN for CTR prediction The code and data for our paper in CIKM2019: Fi-GNN: Modeling Feature Interactions via Graph Neural Networks for CTR Predicti

Learning Intents behind Interactions with Knowledge Graph for Recommendation, WWW2021

Learning Intents behind Interactions with Knowledge Graph for Recommendation This is our PyTorch implementation for the paper: Xiang Wang, Tinglin Hua

git《Self-Attention Attribution: Interpreting Information Interactions Inside Transformer》(AAAI 2021) GitHub:

Self-Attention Attribution This repository contains the implementation for AAAI-2021 paper Self-Attention Attribution: Interpreting Information Intera

Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time
Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time

Semi Hand-Object Semi-Supervised 3D Hand-Object Poses Estimation with Interactions in Time (CVPR 2021).

Official code for
Official code for "Focal Self-attention for Local-Global Interactions in Vision Transformers"

Focal Transformer This is the official implementation of our Focal Transformer -- "Focal Self-attention for Local-Global Interactions in Vision Transf

A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

Comments
  • A error in prediction of viral hosts with my own bacterial genomes

    A error in prediction of viral hosts with my own bacterial genomes

    Hi, Kennth I tried to use my own genomes to predict the viral host, but a error occured as follows: Command: python run_Speed_up.py --contigs all_viral_combined_MMseq_out2_rep_seq.fasta --mode prokaryote --t 0.98 --len 1500

    Building a new DB, current time: 06/29/2022 18:44:58 New DB name: /home/jyzhang/softwares/CHERRY/new_blast_db/bin214 New DB title: new_prokaryote/bin214.fa Sequence type: Nucleotide Keep MBits: T Maximum file size: 1000000000B Adding sequences from FASTA; added 60 sequences in 0.14666 seconds. Running blastn... Traceback (most recent call last): File "edge_virus_prokaryote.py", line 151, in with open(blast_tab_out+file) as file_in: FileNotFoundError: [Errno 2] No such file or directory: 'blast_tab/bin117.tab' phage_host Error for file contig_0

    I have already put my genomes in the "new_prokaryote/" folder, and added corresponding taxonomies in the dataset/prokaryote.csv file. When I used my own viral contigs to predict hosts, the above-mentioned error ocurred. I have tried to sovle this problem. I modified the line 151 of edge_virus_prokaryote.py, that is I changed "with open(blast_tab_out+file)" to "with open(new_blast_tab_out+file)". Then it worked. I wonder whether the modification is right or not. Besides, I did not find information about Crispr spacers of my own genomes in new_prokaryote/ folder in the result folder. Thus, I also wonder whether Cherry identify Crispr spacers of my own bacterial genomes in new_prokaryote/ folder, and whether Cherry will predict viral hosts according to the Crispr spacers of my own genomes. Look forward to your reply.

    Jiayu Zhang

    opened by Rainingfeeling 6
Owner
Kenneth Shang
Kenneth Shang
Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

Selene is a Python library and command line interface for training deep neural networks from biological sequence data such as genomes.

Troyanskaya Laboratory 323 Jan 1, 2023
Python KNN model: Predicting a probability of getting a work visa. Tableau: Non-immigrant visas over the years.

The value of international students to the United States. Probability of getting a non-immigrant visa. Project timeline: Jan 2021 - April 2021 Project

Zinaida Dvoskina 2 Nov 21, 2021
A DNN inference latency prediction toolkit for accurately modeling and predicting the latency on diverse edge devices.

Note: This is an alpha (preview) version which is still under refining. nn-Meter is a novel and efficient system to accurately predict the inference l

Microsoft 244 Jan 6, 2023
Predicting Tweet Sentiment Maching Learning and streamlit

Predicting-Tweet-Sentiment-Maching-Learning-and-streamlit (I prefere using Visual Studio Code ) Open the folder in VS Code Run the first cell in requi

null 1 Nov 20, 2021
This repository contains the code used for Predicting Patient Outcomes with Graph Representation Learning (https://arxiv.org/abs/2101.03940).

Predicting Patient Outcomes with Graph Representation Learning This repository contains the code used for Predicting Patient Outcomes with Graph Repre

Emma Rocheteau 76 Dec 22, 2022
PAWS 🐾 Predicting View-Assignments with Support Samples

This repo provides a PyTorch implementation of PAWS (predicting view assignments with support samples), as described in the paper Semi-Supervised Learning of Visual Features by Non-Parametrically Predicting View Assignments with Support Samples.

Facebook Research 437 Dec 23, 2022
Predicting Semantic Map Representations from Images with Pyramid Occupancy Networks

This is the code associated with the paper Predicting Semantic Map Representations from Images with Pyramid Occupancy Networks, published at CVPR 2020.

Thomas Roddick 219 Dec 20, 2022
We are More than Our JOints: Predicting How 3D Bodies Move

We are More than Our JOints: Predicting How 3D Bodies Move Citation This repo contains the official implementation of our paper MOJO: @inproceedings{Z

null 72 Oct 20, 2022
A geometric deep learning pipeline for predicting protein interface contacts.

A geometric deep learning pipeline for predicting protein interface contacts.

null 44 Dec 30, 2022
An end-to-end regression problem of predicting the price of properties in Bangalore.

Bangalore-House-Price-Prediction An end-to-end regression problem of predicting the price of properties in Bangalore. Deployed in Heroku using Flask.

Shruti Balan 1 Nov 25, 2022