EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction

Hannes Stärk

Last update: Jan 3, 2023

Related tags

Deep Learning geometry proteins protein-structure drug-discovery molecules graph-neural-networks equivariance

Overview

EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction

Paper on arXiv

EquiBind, is a SE(3)-equivariant geometric deep learning model performing direct-shot prediction of both i) the receptor binding location (blind docking) and ii) the ligand’s bound pose and orientation. EquiBind achieves significant speed-ups and better quality compared to traditional and recent baselines. If you have questions, don't hesitate to open an issue or ask me via [email protected] or social media or Octavian Ganea via [email protected]. We are happy to hear from you!

Dataset

Our preprocessed data (see dataset section in the paper Appendix) is available from zenodo.
The files in data contain the names for the time-based data split.

If you want to train one of our models with the data then:

download it from zenodo
unzip the directory and place it into data such that you have the path data/PDBBind

Use provided model weights to predict binding structure of your own protein-ligand pairs:

Step 1: What you need as input

Ligand files of the formats .mol2 or .sdf or .pdbqt or .pdb.
Receptor files of the format .pdb
For each complex you want to predict you need a directory containing the ligand and receptor file. Like this:

my_data_folder
└───name1
    │   name1_protein.pdb
    │   name1_ligand.sdf
└───name2
    │   name2_protein.pdb
    │   name2_ligand.sdf
...

Step 2: Setup Environment

We will set up the environment using Anaconda. Clone the current repo

git clone https://github.com/HannesStark/EquiBind

Create a new environment with all required packages using environment.yml (this can take a while). While in the project directory run:

conda env create

Activate the environment

conda activate equibind

Here are the requirements themselves if you want to install them manually instead of using the environment.yml:

python=3.7
pytorch 1.10
torchvision
cudatoolkit=10.2
torchaudio
dgl-cuda10.2
rdkit
openbabel
biopython
rdkit
biopandas
pot
dgllife
joblib
pyaml
icecream
matplotlib
tensorboard

Step 3: Predict Binding Structures!

In the config file configs_clean/inference.yml set the path to your input data folder inference_path: path_to/my_data_folder.
Then run:

python inference.py --config=configs_clean/inference.yml

Done! 🎉
Your results are saved as .sdf files in the directory specified in the config file under output_directory: 'data/results/output' and as tensors at runs/flexible_self_docking/predictions_RDKitFalse.pt!

Reproducing paper numbers

Download the data and place it as described in the "Dataset" section above.

Using the provided model weights

To predict binding structures using the provided model weights run:

python inference.py --config=configs_clean/inference_file_for_reproduce.yml

This will give you the results of EquiBind-U and then those of EquiBind after running the fast ligand point cloud fitting corrections.
The numbers are a bit better than what is reported in the paper. We will put the improved numbers into the next update of the paper.

Training a model yourself and using those weights

To train the model yourself, run:

python train.py --config=configs_clean/RDKitCoords_flexible_self_docking.yml

The model weights are saved in the runs directory.
You can also start a tensorboard server tensorboard --logdir=runs and watch the model train.
To evaluate the model on the test set, change the run_dirs: entry of the config file inference_file_for_reproduce.yml to point to the directory produced in runs. Then you can runpython inference.py --config=configs_clean/inference_file_for_reproduce.yml as above!

Reference

📃 Paper on arXiv

@misc{stark2022equibind,
      title={EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction}, 
      author={Hannes Stärk and Octavian-Eugen Ganea and Lagnajit Pattanaik and Regina Barzilay and Tommi Jaakkola},
      year={2022}
}

Comments

joblib Parallel issue with specific complex '3m1s' in the PDBBind data

It could be a problem with RDKit. My version is "rdkit 2021.09.4".

import pickle
pdbbind_dir = "PDBBind_processed/"
name = '3m1s'
lig = read_molecule(os.path.join(pdbbind_dir, name, f'{name}_ligand.sdf'), sanitize=True,
                    remove_hs=True)
if lig == None:  # read mol2 file if sdf file cannot be sanitized
    lig = read_molecule(os.path.join(pdbbind_dir, name, f'{name}_ligand.mol2'), sanitize=True,
                        remove_hs=True)
# lig = Chem.MolFromSmiles('O=C[Ru+9]12345(C6=C1C2C3=C64)n1c2ccc(O)cc2c2c3c(c4ccc[n+]5c4c21)C(=O)NC3=O')
pickle.dump(lig, open("test.pkl", "bw"))
pickle.load(open("test.pkl", "rb"))

RuntimeError: invalid value in pickle

opened by luwei0917 13

Not able to run with python 2. DGL-cuda without CUDA GPU.

python inference.py --config=configs_clean/inference.yml File "inference.py", line 119 sys.stdout = Logger(logpath=os.path.join(os.path.dirname(args.checkpoint), f'inference.log'), syspart=sys.stdout) ^ SyntaxError: invalid syntax

opened by GsGithub17 12
Does current models supported multiple suggestion outputs?

As I know that, current models only return one output for each pair ligand-receptor. Does the current model extend to support multiple suggestions binding sites output with the ranking?

opened by PhungVanDuy 6
Scale off for results

The program runs without issue, but the scale of the SDF molecule does not match the input scale/protein. Have you seen this before? Is there any recourse here?

opened by jadolfbr 6
FileNotFoundError: train_arguments.yaml
Hi I am attempting to the run this software on an Ubuntu Virtual Machine. Setting up the environment went smoothly. However when I try to run the inference.py script I get the following error:

Traceback (most recent call last): File "/home/tony/Documents/EquiBind-main/inference.py", line 460, in <module> with open(os.path.join(os.path.dirname(args.checkpoint), 'train_arguments.yaml'), 'r') as arg_file: FileNotFoundError: [Errno 2] No such file or directory: 'runs/flexible_self_docking/train_arguments.yaml'

It seems like the concatenation of the path for the train_arguments.yaml is not working correctly, hopefully this is quite an easy fix though?

Thanks in advance for your help.
opened by Tonylac77 6

Error when run get_receptor_inference

First of all, Your research has had a huge impact on drug discovery based on AI. Thank you so much!!!

I got the error like below. My Inputs are protein pdb from https://www.rcsb.org/ and 3d ligand conformer from PubChem.

[2022-04-06 19:21:16.672575] [ Using Seed :  1  ]

Processing SOS1: complex 1 of 1
Trying to load data/my_data_folder/SOS1/SOS1_ligand.sdf
Docking the receptor data/my_data_folder/SOS1/SOS1_protein.pdb
To the ligand data/my_data_folder/SOS1/SOS1_ligand.sdf
Traceback (most recent call last):
  File "inference.py", line 473, in <module>
    inference_from_files(args)
  File "inference.py", line 340, in inference_from_files
    rec, rec_coords, c_alpha_coords, n_coords, c_coords = get_receptor_inference(rec_path)
  File "/home/sejeong/codes/EquiBind/commons/process_mols.py", line 421, in get_receptor_inference
    c_alpha_coords = np.concatenate(valid_c_alpha_coords, axis=0)  # [n_residues, 3]
  File "<__array_function__ internals>", line 6, in concatenate
ValueError: need at least one array to concatenate

Is there any problem in my input setting? I would be very grateful if you could give me a solution.

Thank you for your work again. :)

opened by SejeongPark8354 6

Duplicate of Issue #13 + The `model_type` parameter in .yml config files
Dear authors, I just start working with your instrument, I installed cpu-version (conda env create -f environment_cpuonly.yml) and successfully run it. I used 5tgz psb structure as a target. And docked 1743 known inhibitors. Versions:

rdkit 2021.09.5 ` openbabel 3.1.1

I run it by this command:

python docking/EquiBind/inference.py --config=docking/equibind_run/inference.yml

inference.yml:

run_dirs: - flexible_self_docking # the resulting coordinates will be saved here as tensors in a .pt file (but also as .sdf files if you specify an "output_directory" below) inference_path: 'docking/equibind_run' # this should be your input file path as described in the main readme test_names: timesplit_test output_directory: 'docking/equibind_run/output' # the predicted ligands will be saved as .sdf file here run_corrections: True use_rdkit_coords: False # generates the coordinates of the ligand with rdkit instead of using the provided conformer. If you already have a 3D structure that you want to use as initial conformer, then lea$ save_trajectories: False num_confs: 1 # usually this should be 1 seed: 120 device: cpu

Initial conformers were obtained by rdkit ( params = AllChem.ETKDGv3(); AllChem.EmbedMolecule(mol, params)). All missing hydrogens were added to the ligands (by rdkit) and to the protein structure (by chimera). As input I used sdf files of ligands and pdb file of protein (put each ligand and the protein to separate directories). Example of the input files: CHEMBL1088245_protein.pdb.txt CHEMBL1088245_ligand.sdf.txt

The problem is that resulted binding poses are incorrect, like ligand's atoms crosses protein's atoms.
lig_equibind_corrected.sdf.txt Maybe I didn't set some special parameters? Could you help me please? Thank you!
duplicate
opened by avnikonenko 6
Torch not compiled with CUDA enabled

Hello，when I run the multiligand_inference.py , it prompts this error:

python multiligand_inference.py -o ./my_data_folder/result/ -r ./my_data_folder/multiligand-test/5v4q_protein.pdb -l ./my_data_folder/multiligand-test/ligand.sdf

Namespace(batch_size=8, checkpoint=None, config=None, device='cpu', lazy_dataload=None, lig_slice=None, ligands_sdf='./my_data_folder/multiligand-test/ligand.sdf', n_workers_data_load=0, num_confs=1, output_directory='./my_data_folder/result/', rec_pdb='./my_data_folder/multiligand-test/5v4q_protein.pdb', run_corrections=True, seed=1, skip_in_output=True, train_args=None, use_rdkit_coords=False) [2022-07-08 10:34:33.719185] [ Using Seed : 1 ] Found 0 previously calculated ligands device = cpu Entering batch ending in index 5/5 Traceback (most recent call last): File "multiligand_inference.py", line 278, in main() File "multiligand_inference.py", line 275, in main write_while_inferring(lig_loader, model, args) File "multiligand_inference.py", line 217, in write_while_inferring lig_graphs = lig_graphs.to(args.device) File "/data/anaconda/envs/equibind/lib/python3.7/site-packages/dgl/heterograph.py", line 5448, in to ret._graph = self._graph.copy_to(utils.to_dgl_context(device)) File "/data/anaconda/envs/equibind/lib/python3.7/site-packages/dgl/utils/internal.py", line 533, in to_dgl_context device_id = F.device_id(ctx) File "/data/anaconda/envs/equibind/lib/python3.7/site-packages/dgl/backend/pytorch/tensor.py", line 90, in device_id return 0 if ctx.type == 'cpu' else th.cuda.current_device() File "/data/anaconda/envs/equibind/lib/python3.7/site-packages/torch/cuda/init.py", line 479, in current_device _lazy_init() File "/data/anaconda/envs/equibind/lib/python3.7/site-packages/torch/cuda/init.py", line 208, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled

How can solve this error？

opened by struggle007 4
Questions about SE(3)-equivariant

Hi，

I'm still a newbie in ligand-receptor binding. I did an experiment on translation equivariance in the 5ol3 complex in PDBBind dataset. I experimented with a translation of both ligand and receptor of the 5ol3 complex by shifting1 Å along the y-axis and shifting for the ligand only. However, the results showed that the binding poses were different for no shifting (the control group), shifting ligand, and shifting both ligand and receptor. The figure below shows the conformations of the molecules which inferenced by the model. It does not look like the model guarantees SE(3)-equivariant. I would like to understand the reason why these three conformations are not similar.

Many Thanks!

opened by Surviveagainsttheodds 4
Support for inference on multiple ligands in sdf and smi formats
The main part of these suggested changes are the datasets/multiple_ligands.py and multiligand_inference.py.

datasets/multiple_ligands.py implements a pytorch dataset to load ligands from a given .sdf or .smi file, which when combined with a dataloader is able to batch the data for better GPU utilization.

multiligand_inference.py utilizes this dataloader to perform inference on a given .sdf or .smi file, writing results as the inference is being run, as a safeguard against losing work if the process crashes or is interrupted.

Suggested usage is

python multiligand_infernce.py -o path/to/output_directory -r path/to/receptor.pdb -l path/to/ligands.sdf

This runs EquiBind on every ligand in ligands.sdf against the protein in receptor.pdb. The output is 3 files in output_directory with the following names and contents:

failed.txt - contains the index (in the file ligands.sdf) and name of every molecule for which inference failed in a way that was caught and handled. success.txt - contains the index (in the file ligands.sdf) and name of every molecule for which inference succeeded output.sdf - contains the conformers produced by EquiBind in .sdf format.

Along these, a number of options are provided. A few of interest are:

--no_skip: By default, the script looks for failed.txt and success.txt in output_directory, and skips all the ligands with the same index as the ones listed in those files, considering them to be previously calculated work, and any further work to the files already present. --no_skip turns this behavior off and overwrites the 3 files in output_directory if they were already present.

--batch_size: Controls the batch size for sending the receptor and ligand graphs to the GPU. Be aware that due to how batching of graphs works, a large batch size will take up a lot of space.

--n_workers_data_load: Controls the amount of workers spawned by the pytorch DataLoader. These will be responsible for the preproccessing on each batch, namely generating the ligand graph for each ligand.
opened by amfaber 4
Measure of inference quality

Hi,

First of all, thanks for developing this method - it is really something new.

When using the inference_from_files mode to infer various ligand conformations on several protein targets, how does the tool report some measure of inference quality"? Do you have some measure of conformation fit to the target? I can see intersection_losses_untuned being reported - can it be used?

Many thanks!

opened by hmms117 4
About the Training Efficiency

Dear authors:

Thanks for your great work and public codes. I am trying to do some further research based on your work. However, I found the training process is really slow, with less than 20% GPU utilization. The val loss on PDBBind is still decreasing after seven days of training on one V100 GPU. Do you have any advice to improve the training efficiency? I found you leave a TODO comment to run SVD in batches. Do you have any updates on this?

Many thanks!

opened by youqingxiaozhua 0
about dgllife version and usage

Hi there! For my personal experience, dgllife requires rdkit==2018.09.3 for some molecules issue. What is the version of dgllife requirement in this repo? or dgllife is not necessary?

opened by lichman0405 0
Fix for #46 and updates to internal ligand loading

The commit 'Fixed argument handling of "device"' should hopefully fix the bug in #46 and others. The other commit brings this repo up-to-date with my own changes to how ligand loading is done internally, which is now able to utilize the Multithreaded versions of SDMolSupplier and SmilesMolSupplier.

opened by amfaber 0
Is is possible to predict torsion angle directly from model?

I reviewed the paper GeoMol which is cited in the 3.2.2 section of the paper. That is a model can predict torsion angle directly. But the Equibind only outputs an approximate coordinate of docked ligand, then aligning torsion angle from rdkit conformer to the predicted coords, in order to avoid the problem of incorrect docked ligand conformer (wrong bond length and bond angle).

So why don't you guys just predict the torsion angle from the Equibind model directly? is it a possible strategy?

Thanks. It is a wonderful job.

opened by lkfo415579 0

Owner

Hannes Stärk

MIT Research Intern • Geometric DL + Graphs :heart: • M. Sc. Informatics from TU Munich

GitHub

This is the repo for the paper `SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization'. (published in Bioinformatics'21)

SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization This is the code for our paper ``SumGNN: Multi-typed Drug

58 Dec 21, 2022

Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network

DeepCDR Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network This work has been accepted to ECCB2020 and was also published in the

50 Dec 18, 2022

The code for SAG-DTA: Prediction of Drug–Target Affinity Using Self-Attention Graph Network.

SAG-DTA The code is the implementation for the paper 'SAG-DTA: Prediction of Drug–Target Affinity Using Self-Attention Graph Network'. Requirements py

7 Aug 2, 2022

Geometric Deep Learning Extension Library for PyTorch

Documentation | Paper | Colab Notebooks | External Resources | OGB Examples PyTorch Geometric (PyG) is a geometric deep learning extension library for

16.5k Jan 8, 2023

A geometric deep learning pipeline for predicting protein interface contacts.

44 Dec 30, 2022

GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.

22 Dec 12, 2022

Code for "SRHEN: Stepwise-Refining Homography Estimation Network via Parsing Geometric Correspondences in Deep Latent Space"

SRHEN This is a better and simpler implementation for "SRHEN: Stepwise-Refining Homography Estimation Network via Parsing Geometric Correspondences in

1 Oct 28, 2022

Contains code for Deep Kernelized Dense Geometric Matching

DKM - Deep Kernelized Dense Geometric Matching Contains code for Deep Kernelized Dense Geometric Matching We provide pretrained models and code for ev

83 Dec 23, 2022

Code of paper "Compositionally Generalizable 3D Structure Prediction"

Compositionally Generalizable 3D Structure Prediction In this work, We bring in the concept of compositional generalizability and factorizes the 3D sh

30 Dec 17, 2022

Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2

CoaDTI Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2 Abstract Environment The test was conducted i

7 Nov 14, 2022

Systemic Evolutionary Chemical Space Exploration for Drug Discovery

SECSE SECSE: Systemic Evolutionary Chemical Space Explorer Chemical space exploration is a major task of the hit-finding process during the pursuit of

64 Dec 16, 2022

OOD Dataset Curator and Benchmark for AI-aided Drug Discovery

?? DrugOOD ?? : OOD Dataset Curator and Benchmark for AI Aided Drug Discovery This is the official implementation of the DrugOOD project, this is the

108 Dec 17, 2022

Price-Prediction-For-a-Dream-Home - A machine learning based linear regression trained model for house price prediction.

Price-Prediction-For-a-Dream-Home ROADMAP TO THIS LINEAR REGRESSION BASED HOUSE PRICE PREDICTION PREDICTION MODEL Import all the dependencies of the p

1 Dec 29, 2021

Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Geometric Vector Perceptron Implementation of Geometric Vector Perceptron, a simple circuit with 3d rotation equivariance for learning over large biom

59 Nov 24, 2022

EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction

Related tags

Overview

EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction

Dataset

Use provided model weights to predict binding structure of your own protein-ligand pairs:

Step 1: What you need as input

Step 2: Setup Environment

Step 3: Predict Binding Structures!

Reproducing paper numbers

Using the provided model weights

Training a model yourself and using those weights

Reference

Comments

Owner

Hannes Stärk

This is the repo for the paper `SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization'. (published in Bioinformatics'21)

Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network

The code for SAG-DTA: Prediction of Drug–Target Affinity Using Self-Attention Graph Network.

Geometric Deep Learning Extension Library for PyTorch

A geometric deep learning pipeline for predicting protein interface contacts.

GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.

Code for "SRHEN: Stepwise-Refining Homography Estimation Network via Parsing Geometric Correspondences in Deep Latent Space"

Contains code for Deep Kernelized Dense Geometric Matching

Code of paper "Compositionally Generalizable 3D Structure Prediction"

Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2

Systemic Evolutionary Chemical Space Exploration for Drug Discovery

OOD Dataset Curator and Benchmark for AI-aided Drug Discovery

Price-Prediction-For-a-Dream-Home - A machine learning based linear regression trained model for house price prediction.

Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

A mini lib that implements several useful functions binding to PyTorch in C++.

A repository with exploration into using transformers to predict DNA ↔ transcription factor binding

Official implementation of "Generating 3D Molecules for Target Protein Binding"

Doge-Prediction - Coding Club prediction ig

A Temporal Extension Library for PyTorch Geometric