PhaMer is a python library for identifying bacteriophages from metagenomic data. PhaMer is based on a Transorfer model and rely on protein-based vocabulary to convert DNA sequences into sentences.
Overview
The main function of PhaMer is to identify phage-like contigs from metagenomic data. The input of the program should be fasta files and the output will be a csv file showing the predictions. Since it is a Deep learning model, if you have GPU units on your PC, we recommand you to use them to save your time.
If you have any trouble installing or using PhaMer, please let us know by opening an issue on GitHub or emailing us ([email protected]).
Required Dependencies
If you want to use the gpu to accelerate the program:
-
cuda
-
Pytorch-gpu
-
For cpu version pytorch:
conda install pytorch torchvision torchaudio cpuonly -c pytorch
-
For gpu version pytorch: Search pytorch to find the correct cuda version according to your computer
An easiler way to install
Note: we suggest you to install all the package using conda (both miniconda and Anaconda are ok).
After cloning this respository, you can use anaconda to install the PhaMer.yaml. This will install all packages you need with gpu mode (make sure you have installed cuda on your system to use the gpu version. Othervise, it will run with cpu version). The command is: conda env create -f PhaMer.yaml -n phamer
Prepare the database and environment
Due to the limited size of the GitHub, we zip the database. Before using PhaMer, you need to unpack them using the following commands.
- When you use PhaMer at the first time
cd PhaMer/
conda env create -f PhaMer.yaml -n phamer
conda activate phamer
cd database/
bzip2 -d database.fa.bz2
git lfs install
rm transformer.pth
git checkout .
cd ..
Note: Because the parameter is larger than 100M, please make sure you have installed git-lfs to downloaded it from GitHub
- If the example can be run without any but bugs, you only need to activate your 'phamer' environment before using PhaMer.
conda activate phamer
Usage
python preprocessing.py [--contigs INPUT_FA] [--len MINIMUM_LEN]
python PhaMer.py [--out OUTPUT_CSV] [--reject THRESHOLD]
Options
--contigs INPUT_FA
input fasta file
--len MINIMUM_LEN
predict only for sequence >= len bp (default 3000)
--out OUTPUT_CSV
The output csv file (prediction)
--reject THRESHOLD
Threshold to reject prophage. The higher the value, the more prophage will be rejected (default 0.3)
Example
Prediction on the example file:
python preprocessing.py --contigs test_contigs.fa
python PhaMer.py --out example_prediction.csv
The prediction will be written in example_prediction.csv. The CSV file has three columns: contigs names, prediction, and prediction score.
References
The paper is submitted to the ISMB 2022.
The arXiv version can be found via: Accurate identification of bacteriophages from metagenomic data using Transformer
Contact
If you have any questions, please email us: [email protected]