Description
A geometric deep learning pipeline for predicting protein interface contacts.
Citing this work
If you use the code or data associated with this package, please cite:
@article{morehead2021deepinteract,
title = {Geometric Transformers for Protein Interface Contact Prediction},
author = {Alex Morehead, Chen Chen, and Jianlin Cheng},
year = {2021},
eprint = {N/A},
archivePrefix = {arXiv},
primaryClass = {cs.LG}
}
First time setup
The following step is required in order to run DeepInteract:
Genetic databases
This step requires aria2c
to be installed on your machine.
DeepInteract needs only one of the following genetic (sequence) databases compatible with HH-suite3 to run:
- BFD (Requires ~1.7TB of Space When Unextracted)
- Small BFD (Requires ~17GB of Space When Unextracted)
- Uniclust30 (Requires ~86GB of Space When Unextracted)
Install the BFD for HH-suite3
# Following script originally from AlphaFold2 (https://github.com/deepmind/alphafold):
DOWNLOAD_DIR="~/Data/Databases"
ROOT_DIR="${DOWNLOAD_DIR}/bfd"
mkdir "~/Data" "$DOWNLOAD_DIR" "$ROOT_DIR"
# Mirror of:
# https://bfd.mmseqs.com/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz.
SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt.tar.gz"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
tar --extract --verbose --file="${ROOT_DIR}/${BASENAME}" \
--directory="${ROOT_DIR}"
rm "${ROOT_DIR}/${BASENAME}"
# The CLI argument --hhsuite_db for lit_model_predict.py
# should then become '~/Data/Databases/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt'
(Smaller Alternative) Install the Small BFD for HH-suite3
# Following script originally from AlphaFold2 (https://github.com/deepmind/alphafold):
DOWNLOAD_DIR="~/Data/Databases"
ROOT_DIR="${DOWNLOAD_DIR}/small_bfd"
mkdir "~/Data" "$DOWNLOAD_DIR" "$ROOT_DIR"
SOURCE_URL="https://storage.googleapis.com/alphafold-databases/reduced_dbs/bfd-first_non_consensus_sequences.fasta.gz"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
pushd "${ROOT_DIR}"
gunzip "${ROOT_DIR}/${BASENAME}"
popd
# The CLI argument --hhsuite_db for lit_model_predict.py
# should then become '~/Data/Databases/small_bfd/bfd-first_non_consensus_sequences.fasta'
(Smaller Alternative) Install Uniclust30 for HH-suite3
# Following script originally from AlphaFold2 (https://github.com/deepmind/alphafold):
DOWNLOAD_DIR="~/Data/Databases"
ROOT_DIR="${DOWNLOAD_DIR}/uniclust30"
mkdir "~/Data" "$DOWNLOAD_DIR" "$ROOT_DIR"
# Mirror of:
# http://wwwuser.gwdg.de/~compbiol/uniclust/2018_08/uniclust30_2018_08_hhsuite.tar.gz
SOURCE_URL="https://storage.googleapis.com/alphafold-databases/casp14_versions/uniclust30_2018_08_hhsuite.tar.gz"
BASENAME=$(basename "${SOURCE_URL}")
mkdir --parents "${ROOT_DIR}"
aria2c "${SOURCE_URL}" --dir="${ROOT_DIR}"
tar --extract --verbose --file="${ROOT_DIR}/${BASENAME}" \
--directory="${ROOT_DIR}"
rm "${ROOT_DIR}/${BASENAME}"
# The CLI argument --hhsuite_db for lit_model_predict.py
# should then become '~/Data/Databases/uniclust30/uniclust30_2018_08/uniclust30_2018_08'
Repository Directory Structure
DeepInteract
│
└───docker
│
└───img
│
└───project
│
└───checkpoints
│
└───datasets
│ │
│ └───builder
│ │
│ └───CASP_CAPRI
│ │ │
│ │ └───final
│ │ │ │
│ │ │ └───processed
│ │ │ │
│ │ │ └───raw
│ │ │
│ │ casp_capri_dgl_data_module.py
│ │ casp_capri_dgl_dataset.py
│ │
│ └───DIPS
│ │ │
│ │ └───final
│ │ │ │
│ │ │ └───processed
│ │ │ │
│ │ │ └───raw
│ │ │
│ │ dips_dgl_data_module.py
│ │ dips_dgl_dataset.py
│ │
│ └───Input
│ │ │
│ │ └───final
│ │ │ │
│ │ │ └───processed
│ │ │ │
│ │ │ └───raw
│ │ │
│ │ └───interim
│ │ │ │
│ │ │ └───complexes
│ │ │ │
│ │ │ └───external_feats
│ │ │ │ │
│ │ │ │ └───PSAIA
│ │ │ │ │
│ │ │ │ └───INPUT
│ │ │ │
│ │ │ └───pairs
│ │ │ │
│ │ │ └───parsed
│ │ │
│ │ └───raw
│ │
│ └───PICP
│ picp_dgl_data_module.py
│
└───test_data
│
└───utils
│ deepinteract_constants.py
│ deepinteract_modules.py
│ deepinteract_utils.py
│ dips_plus_utils.py
│ graph_utils.py
│ protein_feature_utils.py
│ vision_modules.py
│
lit_model_predict.py
lit_model_predict_docker.py
lit_model_train.py
.gitignore
CONTRIBUTING.md
environment.yml
LICENSE
README.md
requirements.txt
setup.cfg
setup.py
Running DeepInteract via Docker
The simplest way to run DeepInteract is using the provided Docker script.
The following steps are required in order to ensure Docker is installed and working correctly:
-
Install Docker.
- Install NVIDIA Container Toolkit for GPU support.
- Setup running Docker as a non-root user.
-
Check that DeepInteract will be able to use a GPU by running:
docker run --rm --gpus all nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu20.04 nvidia-smi
The output of this command should show a list of your GPUs. If it doesn't, check if you followed all steps correctly when setting up the NVIDIA Container Toolkit or take a look at the following NVIDIA Docker issue.
Now that we know Docker is functioning properly, we can begin building our Docker image for DeepInteract:
-
Clone this repository and
cd
into it.git clone https://github.com/BioinfoMachineLearning/DeepInteract cd DeepInteract/ DI_DIR=$(pwd)
-
Download the trained model checkpoint.
mkdir -p project/checkpoints wget -P project/checkpoints https://zenodo.org/record/5546775/files/LitGINI-GeoTran-DilResNet.ckpt
-
Build the Docker image (Warning: Requires ~13GB of Space):
docker build -f docker/Dockerfile -t deepinteract .
-
Install the
run_docker.py
dependencies. Note: You may optionally wish to create a Python Virtual Environment to prevent conflicts with your system's Python environment.pip3 install -r docker/requirements.txt
-
Create directory in which to generate input features and outputs:
mkdir -p project/datasets/Input
-
Run
run_docker.py
pointing to two input PDB files containing the first and second chains of a complex for which you wish to predict the contact probability map. For example, for the DIPS-Plus test target with the PDB ID4HEQ
:python3 docker/run_docker.py --left_pdb_filepath "$DI_DIR"/project/test_data/4heq_l_u.pdb --right_pdb_filepath "$DI_DIR"/project/test_data/4heq_r_u.pdb --input_dataset_dir "$DI_DIR"/project/datasets/Input --ckpt_name "$DI_DIR"/project/checkpoints/LitGINI-GeoTran-DilResNet.ckpt --hhsuite_db ~/Data/Databases/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --num_gpus 0
This script will generate and (as NumPy array files - e.g.,
test_data/4heq_contact_prob_map.npy
) save to the given input directory the predicted interface contact map as well as the Geometric Transformer's learned node and edge representations for both chain graphs. -
Note that by using the default
--num_gpus 0
flag when executing
run_docker.py
, the Docker container will only make use of the system's available CPU(s) for prediction. However, by specifying--num_gpus 1
when executing
run_docker.py
, the Docker container will then employ the first available GPU for prediction.
Running DeepInteract via a Traditional Installation (for Linux-Based Operating Systems)
First, install and configure Conda environment:
# Clone this repository:
git clone https://github.com/BioinfoMachineLearning/DeepInteract
# Change to project directory:
cd DeepInteract
DI_DIR=$(pwd)
# Set up Conda environment locally
conda env create --name DeepInteract -f environment.yml
# Activate Conda environment located in the current directory:
conda activate DeepInteract
# (Optional) Perform a full install of the pip dependencies described in 'requirements.txt':
pip3 install -r requirements.txt
# (Optional) To remove the long Conda environment prefix in your shell prompt, modify the env_prompt setting in your .condarc file with:
conda config --set env_prompt '({name})'
Installing PSAIA
Install GCC 10 for PSAIA:
# Install GCC 10 for Ubuntu 20.04
sudo apt install software-properties-common
sudo add-apt-repository ppa:ubuntu-toolchain-r/ppa
sudo apt update
sudo apt install gcc-10 g++-10
# Or install GCC 10 for Arch Linux/Manjaro
yay -S gcc10
Install QT4 for PSAIA:
# Install QT4 for Ubuntu 20.04:
sudo add-apt-repository ppa:rock-core/qt4
sudo apt update
sudo apt install libqt4* libqtcore4 libqtgui4 libqtwebkit4 qt4* libxext-dev
# Or install QT4 for Arch Linux/Manjaro
yay -S qt4
Compile PSAIA from source:
# Select the location to install the software:
MY_LOCAL=~/Programs
# Download and extract PSAIA's source code:
mkdir "$MY_LOCAL"
cd "$MY_LOCAL"
wget http://complex.zesoi.fer.hr/data/PSAIA-1.0-source.tar.gz
tar -xvzf PSAIA-1.0-source.tar.gz
# Compile PSAIA (i.e., a GUI for PSA):
cd PSAIA_1.0_source/make/linux/psaia/
qmake-qt4 psaia.pro
make
# Compile PSA (i.e., the protein structure analysis (PSA) program):
cd ../psa/
qmake-qt4 psa.pro
make
# Compile PIA (i.e., the protein interaction analysis (PIA) program):
cd ../pia/
qmake-qt4 pia.pro
make
# Test run any of the above-compiled programs:
cd "$MY_LOCAL"/PSAIA_1.0_source/bin/linux
# Test run PSA inside a GUI:
./psaia/psaia
# Test run PIA through a terminal:
./pia/pia
# Test run PSA through a terminal:
./psa/psa
Finally, substitute your absolute filepath for DeepInteract (i.e., where on your local storage device you downloaded the repository to) anywhere DeepInteract's local repository is referenced in project/datasets/builder/psaia_config_file_input.txt
.
Training
Download training and cross-validation DGLGraphs
To train, retrain, or cross-validate DeepInteract models using DIPS-Plus and/or CASP-CAPRI targets, we first need to download the preprocessed DGLGraphs from Zenodo:
# Download and extract preprocessed DGLGraphs for DIPS-Plus and CASP-CAPRI
# Requires ~55GB of free space
mkdir -p project/datasets/DIPS/final
cd project/datasets/DIPS/final
# Download DIPS-Plus
wget https://zenodo.org/record/5546775/files/final_raw_dips.tar.gz
wget https://zenodo.org/record/5546775/files/final_processed_dips.tar.gz.partaa
wget https://zenodo.org/record/5546775/files/final_processed_dips.tar.gz.partab
# First, reassemble all processed DGLGraphs
# We split the (tar.gz) archive into two separate parts with
# 'split -b 4096M final_processed_dips.tar.gz "final_processed_dips.tar.gz.part"'
# to upload it to Zenodo, so to recover the original archive:
cat final_processed_dips.tar.gz.parta* >final_processed_dips.tar.gz
# Extract DIPS-Plus
tar -xzf final_raw_dips.tar.gz
tar -xzf final_processed_dips.tar.gz
rm final_processed_dips.tar.gz.parta* final_raw_dips.tar.gz final_processed_dips.tar.gz
# Download CASP-CAPRI
mkdir -p ../../CASP_CAPRI/final
cd ../../CASP_CAPRI/final
wget https://zenodo.org/record/5546775/files/final_raw_casp_capri.tar.gz
wget https://zenodo.org/record/5546775/files/final_processed_casp_capri.tar.gz
# Extract CASP-CAPRI
tar -xzf final_raw_casp_capri.tar.gz
tar -xzf final_processed_casp_capri.tar.gz
rm final_raw_casp_capri.tar.gz final_processed_casp_capri.tar.gz
Navigate to the project directory and run the training script with the parameters desired:
# Hint: Run `python3 lit_model_train.py --help` to see all available CLI arguments
cd project
python3 lit_model_train.py --lr 1e-3 --weight_decay 1e-2
cd ..
Inference
Download trained model checkpoint
# Return to root directory of DeepInteract repository
cd "$DI_DIR"
# Download the trained model checkpoint
mkdir -p project/checkpoints
wget -P project/checkpoints https://zenodo.org/record/5546775/files/LitGINI-GeoTran-DilResNet.ckpt
Predict interface contact probability maps
Navigate to the project directory and run the prediction script with the filenames of the left and right PDB chains.
# Hint: Run `python3 lit_model_predict.py --help` to see all available CLI arguments
cd project
python3 lit_model_predict.py --left_pdb_filepath "$DI_DIR"/project/test_data/4heq_l_u.pdb --right_pdb_filepath "$DI_DIR"/project/test_data/4heq_r_u.pdb --ckpt_dir "$DI_DIR"/project/checkpoints --ckpt_name LitGINI-GeoTran-DilResNet.ckpt --hhsuite_db ~/Data/Databases/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt
cd ..
This script will generate and (as NumPy array files - e.g., test_data/4heq_contact_prob_map.npy
) save to the given input directory the predicted interface contact map as well as the Geometric Transformer's learned node and edge representations for both chain graphs.
Acknowledgements
DeepInteract communicates with and/or references the following separate libraries and packages:
We thank all their contributors and maintainers!
License and Disclaimer
Copyright 2021 University of Missouri-Columbia Bioinformatics & Machine Learning (BML) Lab.
DeepInteract Code License
Licensed under the GNU Public License, Version 3.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at https://www.gnu.org/licenses/gpl-3.0.en.html.
Third-party software
Use of the third-party software, libraries or code referred to in the Acknowledgements section above may be governed by separate terms and conditions or license provisions. Your use of the third-party software, libraries or code is subject to any such terms and you should check that you can comply with any applicable restrictions or terms and conditions before use.