EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction

Overview

EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction

Paper on arXiv

EquiBind, is a SE(3)-equivariant geometric deep learning model performing direct-shot prediction of both i) the receptor binding location (blind docking) and ii) the ligand’s bound pose and orientation. EquiBind achieves significant speed-ups and better quality compared to traditional and recent baselines. If you have questions, don't hesitate to open an issue or ask me via [email protected] or social media or Octavian Ganea via [email protected]. We are happy to hear from you!

Dataset

Our preprocessed data (see dataset section in the paper Appendix) is available from zenodo.
The files in data contain the names for the time-based data split.

If you want to train one of our models with the data then:

  1. download it from zenodo
  2. unzip the directory and place it into data such that you have the path data/PDBBind

Use provided model weights to predict binding structure of your own protein-ligand pairs:

Step 1: What you need as input

Ligand files of the formats .mol2 or .sdf or .pdbqt or .pdb.
Receptor files of the format .pdb
For each complex you want to predict you need a directory containing the ligand and receptor file. Like this:

my_data_folder
└───name1
    │   name1_protein.pdb
    │   name1_ligand.sdf
└───name2
    │   name2_protein.pdb
    │   name2_ligand.sdf
...

Step 2: Setup Environment

We will set up the environment using Anaconda. Clone the current repo

git clone https://github.com/HannesStark/EquiBind

Create a new environment with all required packages using environment.yml (this can take a while). While in the project directory run:

conda env create

Activate the environment

conda activate equibind

Here are the requirements themselves if you want to install them manually instead of using the environment.yml:

python=3.7
pytorch 1.10
torchvision
cudatoolkit=10.2
torchaudio
dgl-cuda10.2
rdkit
openbabel
biopython
rdkit
biopandas
pot
dgllife
joblib
pyaml
icecream
matplotlib
tensorboard

Step 3: Predict Binding Structures!

In the config file configs_clean/inference.yml set the path to your input data folder inference_path: path_to/my_data_folder.
Then run:

python inference.py --config=configs_clean/inference.yml

Done! 🎉
Your results are saved as .sdf files in the directory specified in the config file under output_directory: 'data/results/output' and as tensors at runs/flexible_self_docking/predictions_RDKitFalse.pt!

Reproducing paper numbers

Download the data and place it as described in the "Dataset" section above.

Using the provided model weights

To predict binding structures using the provided model weights run:

python inference.py --config=configs_clean/inference_file_for_reproduce.yml

This will give you the results of EquiBind-U and then those of EquiBind after running the fast ligand point cloud fitting corrections.
The numbers are a bit better than what is reported in the paper. We will put the improved numbers into the next update of the paper.

Training a model yourself and using those weights

To train the model yourself, run:

python train.py --config=configs_clean/RDKitCoords_flexible_self_docking.yml

The model weights are saved in the runs directory.
You can also start a tensorboard server tensorboard --logdir=runs and watch the model train.
To evaluate the model on the test set, change the run_dirs: entry of the config file inference_file_for_reproduce.yml to point to the directory produced in runs. Then you can runpython inference.py --config=configs_clean/inference_file_for_reproduce.yml as above!

Reference

📃 Paper on arXiv

@misc{stark2022equibind,
      title={EquiBind: Geometric Deep Learning for Drug Binding Structure Prediction}, 
      author={Hannes Stärk and Octavian-Eugen Ganea and Lagnajit Pattanaik and Regina Barzilay and Tommi Jaakkola},
      year={2022}
}
Comments
  • joblib Parallel issue with specific complex '3m1s' in the PDBBind data

    joblib Parallel issue with specific complex '3m1s' in the PDBBind data

    It could be a problem with RDKit. My version is "rdkit 2021.09.4".

    import pickle
    pdbbind_dir = "PDBBind_processed/"
    name = '3m1s'
    lig = read_molecule(os.path.join(pdbbind_dir, name, f'{name}_ligand.sdf'), sanitize=True,
                        remove_hs=True)
    if lig == None:  # read mol2 file if sdf file cannot be sanitized
        lig = read_molecule(os.path.join(pdbbind_dir, name, f'{name}_ligand.mol2'), sanitize=True,
                            remove_hs=True)
    # lig = Chem.MolFromSmiles('O=C[Ru+9]12345(C6=C1C2C3=C64)n1c2ccc(O)cc2c2c3c(c4ccc[n+]5c4c21)C(=O)NC3=O')
    pickle.dump(lig, open("test.pkl", "bw"))
    pickle.load(open("test.pkl", "rb"))
    

    RuntimeError: invalid value in pickle

    opened by luwei0917 13
  • Not able to run with python 2.  DGL-cuda without CUDA GPU.

    Not able to run with python 2. DGL-cuda without CUDA GPU.

    python inference.py --config=configs_clean/inference.yml File "inference.py", line 119 sys.stdout = Logger(logpath=os.path.join(os.path.dirname(args.checkpoint), f'inference.log'), syspart=sys.stdout) ^ SyntaxError: invalid syntax

    opened by GsGithub17 12
  • Does current models supported multiple suggestion outputs?

    Does current models supported multiple suggestion outputs?

    As I know that, current models only return one output for each pair ligand-receptor. Does the current model extend to support multiple suggestions binding sites output with the ranking?

    opened by PhungVanDuy 6
  • Scale off for results

    Scale off for results

    The program runs without issue, but the scale of the SDF molecule does not match the input scale/protein. Have you seen this before? Is there any recourse here?

    Screen Shot 2022-06-16 at 1 10 33 PM Screen Shot 2022-06-16 at 1 10 16 PM
    opened by jadolfbr 6
  • FileNotFoundError: train_arguments.yaml

    FileNotFoundError: train_arguments.yaml

    Hi I am attempting to the run this software on an Ubuntu Virtual Machine. Setting up the environment went smoothly. However when I try to run the inference.py script I get the following error:

    Traceback (most recent call last):
      File "/home/tony/Documents/EquiBind-main/inference.py", line 460, in <module>
        with open(os.path.join(os.path.dirname(args.checkpoint), 'train_arguments.yaml'), 'r') as arg_file:
    FileNotFoundError: [Errno 2] No such file or directory: 'runs/flexible_self_docking/train_arguments.yaml'
    

    It seems like the concatenation of the path for the train_arguments.yaml is not working correctly, hopefully this is quite an easy fix though?

    Thanks in advance for your help.

    opened by Tonylac77 6
  • Error when run get_receptor_inference

    Error when run get_receptor_inference

    First of all, Your research has had a huge impact on drug discovery based on AI. Thank you so much!!!

    I got the error like below. My Inputs are protein pdb from https://www.rcsb.org/ and 3d ligand conformer from PubChem.

    [2022-04-06 19:21:16.672575] [ Using Seed :  1  ]
    
    Processing SOS1: complex 1 of 1
    Trying to load data/my_data_folder/SOS1/SOS1_ligand.sdf
    Docking the receptor data/my_data_folder/SOS1/SOS1_protein.pdb
    To the ligand data/my_data_folder/SOS1/SOS1_ligand.sdf
    Traceback (most recent call last):
      File "inference.py", line 473, in <module>
        inference_from_files(args)
      File "inference.py", line 340, in inference_from_files
        rec, rec_coords, c_alpha_coords, n_coords, c_coords = get_receptor_inference(rec_path)
      File "/home/sejeong/codes/EquiBind/commons/process_mols.py", line 421, in get_receptor_inference
        c_alpha_coords = np.concatenate(valid_c_alpha_coords, axis=0)  # [n_residues, 3]
      File "<__array_function__ internals>", line 6, in concatenate
    ValueError: need at least one array to concatenate
    

    Is there any problem in my input setting? I would be very grateful if you could give me a solution.

    Thank you for your work again. :)

    opened by SejeongPark8354 6
  • Duplicate of Issue #13 + The `model_type` parameter in .yml config files

    Duplicate of Issue #13 + The `model_type` parameter in .yml config files

    Dear authors, I just start working with your instrument, I installed cpu-version (conda env create -f environment_cpuonly.yml) and successfully run it. I used 5tgz psb structure as a target. And docked 1743 known inhibitors. Versions:

    rdkit                     2021.09.5 `
    openbabel                 3.1.1
    

    I run it by this command:

     python docking/EquiBind/inference.py --config=docking/equibind_run/inference.yml
    

    inference.yml:

    run_dirs:
      - flexible_self_docking # the resulting coordinates will be saved here as tensors in a .pt file (but also as .sdf files if you specify an "output_directory" below)
    inference_path: 'docking/equibind_run' # this should be your input file path as described in the main readme
    
    test_names: timesplit_test
    output_directory: 'docking/equibind_run/output' # the predicted ligands will be saved as .sdf file here
    run_corrections: True
    use_rdkit_coords: False # generates the coordinates of the ligand with rdkit instead of using the provided conformer. If you already have a 3D structure that you want to use as initial conformer, then lea$
    save_trajectories: False
    
    num_confs: 1 # usually this should be 1
    seed: 120
    device: cpu
    

    Initial conformers were obtained by rdkit ( params = AllChem.ETKDGv3(); AllChem.EmbedMolecule(mol, params)). All missing hydrogens were added to the ligands (by rdkit) and to the protein structure (by chimera). As input I used sdf files of ligands and pdb file of protein (put each ligand and the protein to separate directories). Example of the input files: CHEMBL1088245_protein.pdb.txt CHEMBL1088245_ligand.sdf.txt

    The problem is that resulted binding poses are incorrect, like ligand's atoms crosses protein's atoms.
    lig_equibind_corrected.sdf.txt Maybe I didn't set some special parameters? Could you help me please? Thank you!

    duplicate 
    opened by avnikonenko 6
  • Torch not compiled with CUDA enabled

    Torch not compiled with CUDA enabled

    Hello,when I run the multiligand_inference.py , it prompts this error:

    python multiligand_inference.py -o ./my_data_folder/result/ -r ./my_data_folder/multiligand-test/5v4q_protein.pdb -l ./my_data_folder/multiligand-test/ligand.sdf

    Namespace(batch_size=8, checkpoint=None, config=None, device='cpu', lazy_dataload=None, lig_slice=None, ligands_sdf='./my_data_folder/multiligand-test/ligand.sdf', n_workers_data_load=0, num_confs=1, output_directory='./my_data_folder/result/', rec_pdb='./my_data_folder/multiligand-test/5v4q_protein.pdb', run_corrections=True, seed=1, skip_in_output=True, train_args=None, use_rdkit_coords=False) [2022-07-08 10:34:33.719185] [ Using Seed : 1 ] Found 0 previously calculated ligands device = cpu Entering batch ending in index 5/5 Traceback (most recent call last): File "multiligand_inference.py", line 278, in main() File "multiligand_inference.py", line 275, in main write_while_inferring(lig_loader, model, args) File "multiligand_inference.py", line 217, in write_while_inferring lig_graphs = lig_graphs.to(args.device) File "/data/anaconda/envs/equibind/lib/python3.7/site-packages/dgl/heterograph.py", line 5448, in to ret._graph = self._graph.copy_to(utils.to_dgl_context(device)) File "/data/anaconda/envs/equibind/lib/python3.7/site-packages/dgl/utils/internal.py", line 533, in to_dgl_context device_id = F.device_id(ctx) File "/data/anaconda/envs/equibind/lib/python3.7/site-packages/dgl/backend/pytorch/tensor.py", line 90, in device_id return 0 if ctx.type == 'cpu' else th.cuda.current_device() File "/data/anaconda/envs/equibind/lib/python3.7/site-packages/torch/cuda/init.py", line 479, in current_device _lazy_init() File "/data/anaconda/envs/equibind/lib/python3.7/site-packages/torch/cuda/init.py", line 208, in _lazy_init raise AssertionError("Torch not compiled with CUDA enabled") AssertionError: Torch not compiled with CUDA enabled

    How can solve this error?

    opened by struggle007 4
  • Questions about SE(3)-equivariant

    Questions about SE(3)-equivariant

    Hi,

    I'm still a newbie in ligand-receptor binding. I did an experiment on translation equivariance in the 5ol3 complex in PDBBind dataset. I experimented with a translation of both ligand and receptor of the 5ol3 complex by shifting1 Å along the y-axis and shifting for the ligand only. However, the results showed that the binding poses were different for no shifting (the control group), shifting ligand, and shifting both ligand and receptor. The figure below shows the conformations of the molecules which inferenced by the model. It does not look like the model guarantees SE(3)-equivariant. I would like to understand the reason why these three conformations are not similar.

    Many Thanks!

    5ol3_ligand_inferenced_by_equibind
    opened by Surviveagainsttheodds 4
  • Support for inference on multiple ligands in sdf and smi formats

    Support for inference on multiple ligands in sdf and smi formats

    The main part of these suggested changes are the datasets/multiple_ligands.py and multiligand_inference.py.

    datasets/multiple_ligands.py implements a pytorch dataset to load ligands from a given .sdf or .smi file, which when combined with a dataloader is able to batch the data for better GPU utilization.

    multiligand_inference.py utilizes this dataloader to perform inference on a given .sdf or .smi file, writing results as the inference is being run, as a safeguard against losing work if the process crashes or is interrupted.

    Suggested usage is

    python multiligand_infernce.py -o path/to/output_directory -r path/to/receptor.pdb -l path/to/ligands.sdf
    

    This runs EquiBind on every ligand in ligands.sdf against the protein in receptor.pdb. The output is 3 files in output_directory with the following names and contents:

    failed.txt - contains the index (in the file ligands.sdf) and name of every molecule for which inference failed in a way that was caught and handled. success.txt - contains the index (in the file ligands.sdf) and name of every molecule for which inference succeeded output.sdf - contains the conformers produced by EquiBind in .sdf format.

    Along these, a number of options are provided. A few of interest are:

    --no_skip: By default, the script looks for failed.txt and success.txt in output_directory, and skips all the ligands with the same index as the ones listed in those files, considering them to be previously calculated work, and any further work to the files already present. --no_skip turns this behavior off and overwrites the 3 files in output_directory if they were already present.

    --batch_size: Controls the batch size for sending the receptor and ligand graphs to the GPU. Be aware that due to how batching of graphs works, a large batch size will take up a lot of space.

    --n_workers_data_load: Controls the amount of workers spawned by the pytorch DataLoader. These will be responsible for the preproccessing on each batch, namely generating the ligand graph for each ligand.

    opened by amfaber 4
  • Measure of inference quality

    Measure of inference quality

    Hi,

    First of all, thanks for developing this method - it is really something new.

    When using the inference_from_files mode to infer various ligand conformations on several protein targets, how does the tool report some measure of inference quality"? Do you have some measure of conformation fit to the target? I can see intersection_losses_untuned being reported - can it be used?

    Many thanks!

    opened by hmms117 4
  • About the Training Efficiency

    About the Training Efficiency

    Dear authors:

    Thanks for your great work and public codes. I am trying to do some further research based on your work. However, I found the training process is really slow, with less than 20% GPU utilization. The val loss on PDBBind is still decreasing after seven days of training on one V100 GPU. Do you have any advice to improve the training efficiency? I found you leave a TODO comment to run SVD in batches. Do you have any updates on this?

    Many thanks!

    opened by youqingxiaozhua 0
  • about dgllife version and usage

    about dgllife version and usage

    Hi there! For my personal experience, dgllife requires rdkit==2018.09.3 for some molecules issue. What is the version of dgllife requirement in this repo? or dgllife is not necessary?

    opened by lichman0405 0
  • Fix for #46 and updates to internal ligand loading

    Fix for #46 and updates to internal ligand loading

    The commit 'Fixed argument handling of "device"' should hopefully fix the bug in #46 and others. The other commit brings this repo up-to-date with my own changes to how ligand loading is done internally, which is now able to utilize the Multithreaded versions of SDMolSupplier and SmilesMolSupplier.

    opened by amfaber 0
  • Is is possible to predict torsion angle directly from model?

    Is is possible to predict torsion angle directly from model?

    I reviewed the paper GeoMol which is cited in the 3.2.2 section of the paper. That is a model can predict torsion angle directly. But the Equibind only outputs an approximate coordinate of docked ligand, then aligning torsion angle from rdkit conformer to the predicted coords, in order to avoid the problem of incorrect docked ligand conformer (wrong bond length and bond angle).

    So why don't you guys just predict the torsion angle from the Equibind model directly? is it a possible strategy?

    Thanks. It is a wonderful job.

    opened by lkfo415579 0
Owner
Hannes Stärk
MIT Research Intern • Geometric DL + Graphs :heart: • M. Sc. Informatics from TU Munich
Hannes Stärk
This is the repo for the paper `SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization'. (published in Bioinformatics'21)

SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge Graph Summarization This is the code for our paper ``SumGNN: Multi-typed Drug

Yue Yu 58 Dec 21, 2022
Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network

DeepCDR Cancer Drug Response Prediction via a Hybrid Graph Convolutional Network This work has been accepted to ECCB2020 and was also published in the

Qiao Liu 50 Dec 18, 2022
The code for SAG-DTA: Prediction of Drug–Target Affinity Using Self-Attention Graph Network.

SAG-DTA The code is the implementation for the paper 'SAG-DTA: Prediction of Drug–Target Affinity Using Self-Attention Graph Network'. Requirements py

Shugang Zhang 7 Aug 2, 2022
Geometric Deep Learning Extension Library for PyTorch

Documentation | Paper | Colab Notebooks | External Resources | OGB Examples PyTorch Geometric (PyG) is a geometric deep learning extension library for

Matthias Fey 16.5k Jan 8, 2023
A geometric deep learning pipeline for predicting protein interface contacts.

A geometric deep learning pipeline for predicting protein interface contacts.

null 44 Dec 30, 2022
GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.

GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.

null 22 Dec 12, 2022
Code for "SRHEN: Stepwise-Refining Homography Estimation Network via Parsing Geometric Correspondences in Deep Latent Space"

SRHEN This is a better and simpler implementation for "SRHEN: Stepwise-Refining Homography Estimation Network via Parsing Geometric Correspondences in

null 1 Oct 28, 2022
Contains code for Deep Kernelized Dense Geometric Matching

DKM - Deep Kernelized Dense Geometric Matching Contains code for Deep Kernelized Dense Geometric Matching We provide pretrained models and code for ev

Johan Edstedt 83 Dec 23, 2022
Code of paper "Compositionally Generalizable 3D Structure Prediction"

Compositionally Generalizable 3D Structure Prediction In this work, We bring in the concept of compositional generalizability and factorizes the 3D sh

Songfang Han 30 Dec 17, 2022
Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2

CoaDTI Multi-modal co-attention for drug-target interaction annotation and Its Application to SARS-CoV-2 Abstract Environment The test was conducted i

Layne_Huang 7 Nov 14, 2022
Systemic Evolutionary Chemical Space Exploration for Drug Discovery

SECSE SECSE: Systemic Evolutionary Chemical Space Explorer Chemical space exploration is a major task of the hit-finding process during the pursuit of

null 64 Dec 16, 2022
OOD Dataset Curator and Benchmark for AI-aided Drug Discovery

?? DrugOOD ?? : OOD Dataset Curator and Benchmark for AI Aided Drug Discovery This is the official implementation of the DrugOOD project, this is the

null 108 Dec 17, 2022
Price-Prediction-For-a-Dream-Home - A machine learning based linear regression trained model for house price prediction.

Price-Prediction-For-a-Dream-Home ROADMAP TO THIS LINEAR REGRESSION BASED HOUSE PRICE PREDICTION PREDICTION MODEL Import all the dependencies of the p

DIKSHA DESWAL 1 Dec 29, 2021
Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Geometric Vector Perceptron Implementation of Geometric Vector Perceptron, a simple circuit with 3d rotation equivariance for learning over large biom

Phil Wang 59 Nov 24, 2022
A mini lib that implements several useful functions binding to PyTorch in C++.

Torch-gather A mini library that implements several useful functions binding to PyTorch in C++. What does gather do? Why do we need it? When dealing w

maxwellzh 8 Sep 7, 2022
A repository with exploration into using transformers to predict DNA ↔ transcription factor binding

Transcription Factor binding predictions with Attention and Transformers A repository with exploration into using transformers to predict DNA ↔ transc

Phil Wang 62 Dec 20, 2022
Official implementation of "Generating 3D Molecules for Target Protein Binding"

Generating 3D Molecules for Target Protein Binding This is the official implementation of the GraphBP method proposed in the following paper. Meng Liu

DIVE Lab, Texas A&M University 74 Dec 7, 2022
Doge-Prediction - Coding Club prediction ig

Doge-Prediction Coding Club prediction ig Basically: Create an application that

null 1 Jan 10, 2022
A Temporal Extension Library for PyTorch Geometric

Documentation | External Resources | Datasets PyTorch Geometric Temporal is a temporal (dynamic) extension library for PyTorch Geometric. The library

Benedek Rozemberczki 1.9k Jan 7, 2023