Source code for EquiDock: Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking (ICLR 2022)

Octavian Ganea

Last update: Jan 2, 2023

Related tags

Deep Learning equidock_public

Overview

Source code for EquiDock: Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking (ICLR 2022)

Please cite "Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking", Ganea et. al, Spotlight @ ICLR 2022

Dependencies

python==3.9.10
numpy==1.22.1
cuda==10.1
torch==1.10.2
dgl==0.7.0
biopandas==0.2.8
ot==0.7.0
rdkit==2021.09.4
dgllife==0.2.8
joblib==1.1.0

DB5.5 data

The raw DB5.5 dataset was already placed in the data directory from the original source:

https://zlab.umassmed.edu/benchmark/ or https://github.com/drorlab/DIPS

The raw pdb files of DB5.5 dataset are in the directory ./data/benchmark5.5/structures

Then preprocess the raw data as follows to prepare data for rigid body docking:

# prepare data for rigid body docking
python preprocess_raw_data.py -n_jobs 40 -data db5 -graph_nodes residues -graph_cutoff 30 -graph_max_neighbor 10 -graph_residue_loc_is_alphaC -pocket_cutoff 8

By default, preprocess_raw_data.py uses 10 neighbor for each node when constructing the graph and uses only residues (coordinates being those of the alpha carbons). After running preprocess_raw_data.py you will get following ready-for-training data directory:

./cache/db5_residues_maxneighbor_10_cutoff_30.0_pocketCut_8.0/cv_0/

with files

$ ls cache/db5_residues_maxneighbor_10_cutoff_30.0_pocketCut_8.0/cv_0/
label_test.pkl			label_val.pkl			ligand_graph_train.bin		receptor_graph_test.bin		receptor_graph_val.bin
label_train.pkl			ligand_graph_test.bin		ligand_graph_val.bin		receptor_graph_train.bin

DIPS data

Download the dataset (see https://github.com/drorlab/DIPS and https://github.com/amorehead/DIPS-Plus) :

mkdir -p ./DIPS/raw/pdb

rsync -rlpt -v -z --delete --port=33444 \
rsync.rcsb.org::ftp_data/biounit/coordinates/divided/ ./DIPS/raw/pdb

Follow the following first steps from https://github.com/amorehead/DIPS-Plus :

# Create data directories (if not already created):
mkdir project/datasets/DIPS/raw project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim project/datasets/DIPS/interim/external_feats project/datasets/DIPS/final project/datasets/DIPS/final/raw project/datasets/DIPS/final/processed

# Download the raw PDB files:
rsync -rlpt -v -z --delete --port=33444 --include='*.gz' --include='*.xz' --include='*/' --exclude '*' \
rsync.rcsb.org::ftp_data/biounit/coordinates/divided/ project/datasets/DIPS/raw/pdb

# Extract the raw PDB files:
python3 project/datasets/builder/extract_raw_pdb_gz_archives.py project/datasets/DIPS/raw/pdb

# Process the raw PDB data into associated pair files:
python3 project/datasets/builder/make_dataset.py project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim --num_cpus 28 --source_type rcsb --bound

# Apply additional filtering criteria:
python3 project/datasets/builder/prune_pairs.py project/datasets/DIPS/interim/pairs project/datasets/DIPS/filters project/datasets/DIPS/interim/pairs-pruned --num_cpus 28

Then, place file utils/partition_dips.py in the DIPS/src/ folder, use the pairs-postprocessed-*.txt files for the actual data splits used in our paper, and run from the DIPS/ folder the command: python src/partition_dips.py data/DIPS/interim/pairs-pruned/. This creates the corresponding train/test/validation splits (again, using the exact splits in pairs-postprocessed-*.txt) of the 42K filtered pairs in DIPS. You should now have the following directory:

$ ls ./DIPS/data/DIPS/interim/pairs-pruned
0g  a6	ax  bo	cf  d6	dx  eo	ff  g6	gx  ho	if  j6	jx  ko	lf  m6	mx  no	of  p6				   pt  qk  rb  s2  st  tk  ub  v2  vt  wk  xb  y2  yt  zk
17  a7	ay  bp	cg  d7	dy  ep	fg  g7	gy  hp	ig  j7	jy  kp	lg  m7	my  np	og  p7				   pu  ql  rc  s3  su  tl  uc  v3  vu  wl  xc  y3  yu  zl
1a  a8	az  bq	ch  d8	dz  eq	fh  g8	gz  hq	ih  j8	jz  kq	lh  m8	mz  nq	oh  p8				   pv  qm  rd  s4  sv  tm  ud  v4  vv  wm  xd  y4  yv  zm
1b  a9	b0  br	ci  d9	e0  er	fi  g9	h0  hr	ii  j9	k0  kr	li  m9	n0  nr	oi  p9				   pw  qn  re  s5  sw  tn  ue  v5  vw  wn  xe  y5  yw  zn
1g  aa	b1  bs	cj  da	e1  es	fj  ga	h1  hs	ij  ja	k1  ks	lj  ma	n1  ns	oj  pa				   px  qo  rf  s6  sx  to  uf  v6  vx  wo  xf  y6  yx  zo
2a  ab	b2  bt	ck  db	e2  et	fk  gb	h2  ht	ik  jb	k2  kt	lk  mb	n2  nt	ok  pairs-postprocessed-test.txt   py  qp  rg  s7  sy  tp  ug  v7  vy  wp  xg  y7  yy  zp
2c  ac	b3  bu	cl  dc	e3  eu	fl  gc	h3  hu	il  jc	k3  ku	ll  mc	n3  nu	ol  pairs-postprocessed-train.txt  pz  qq  rh  s8  sz  tq  uh  v8  vz  wq  xh  y8  yz  zq
2e  ad	b4  bv	cm  dd	e4  ev	fm  gd	h4  hv	im  jd	k4  kv	lm  md	n4  nv	om  pairs-postprocessed.txt	   q0  qr  ri  s9  t0  tr  ui  v9  w0  wr  xi  y9  z0  zr
2g  ae	b5  bw	cn  de	e5  ew	fn  ge	h5  hw	in  je	k5  kw	ln  me	n5  nw	on  pairs-postprocessed-val.txt    q1  qs  rj  sa  t1  ts  uj  va  w1  ws  xj  ya  z1  zs
3c  af	b6  bx	co  df	e6  ex	fo  gf	h6  hx	io  jf	k6  kx	lo  mf	n6  nx	oo  pb				   q2  qt  rk  sb  t2  tt  uk  vb  w2  wt  xk  yb  z2  zt
3g  ag	b7  by	cp  dg	e7  ey	fp  gg	h7  hy	ip  jg	k7  ky	lp  mg	n7  ny	op  pc				   q3  qu  rl  sc  t3  tu  ul  vc  w3  wu  xl  yc  z3  zu
48  ah	b8  bz	cq  dh	e8  ez	fq  gh	h8  hz	iq  jh	k8  kz	lq  mh	n8  nz	oq  pd				   q4  qv  rm  sd  t4  tv  um  vd  w4  wv  xm  yd  z4  zv
4g  ai	b9  c0	cr  di	e9  f0	fr  gi	h9  i0	ir  ji	k9  l0	lr  mi	n9  o0	or  pe				   q5  qw  rn  se  t5  tw  un  ve  w5  ww  xn  ye  z5  zw
56  aj	ba  c1	cs  dj	ea  f1	fs  gj	ha  i1	is  jj	ka  l1	ls  mj	na  o1	os  pf				   q6  qx  ro  sf  t6  tx  uo  vf  w6  wx  xo  yf  z6  zx
5c  ak	bb  c2	ct  dk	eb  f2	ft  gk	hb  i2	it  jk	kb  l2	lt  mk	nb  o2	ot  pg				   q7  qy  rp  sg  t7  ty  up  vg  w7  wy  xp  yg  z7  zy
6g  al	bc  c3	cu  dl	ec  f3	fu  gl	hc  i3	iu  jl	kc  l3	lu  ml	nc  o3	ou  ph				   q8  qz  rq  sh  t8  tz  uq  vh  w8  wz  xq  yh  z8  zz
7g  am	bd  c4	cv  dm	ed  f4	fv  gm	hd  i4	iv  jm	kd  l4	lv  mm	nd  o4	ov  pi				   q9  r0  rr  si  t9  u0  ur  vi  w9  x0  xr  yi  z9
87  an	be  c5	cw  dn	ee  f5	fw  gn	he  i5	iw  jn	ke  l5	lw  mn	ne  o5	ow  pj				   qa  r1  rs  sj  ta  u1  us  vj  wa  x1  xs  yj  za
8g  ao	bf  c6	cx  do	ef  f6	fx  go	hf  i6	ix  jo	kf  l6	lx  mo	nf  o6	ox  pk				   qb  r2  rt  sk  tb  u2  ut  vk  wb  x2  xt  yk  zb
9g  ap	bg  c7	cy  dp	eg  f7	fy  gp	hg  i7	iy  jp	kg  l7	ly  mp	ng  o7	oy  pl				   qc  r3  ru  sl  tc  u3  uu  vl  wc  x3  xu  yl  zc
9h  aq	bh  c8	cz  dq	eh  f8	fz  gq	hh  i8	iz  jq	kh  l8	lz  mq	nh  o8	oz  pm				   qd  r4  rv  sm  td  u4  uv  vm  wd  x4  xv  ym  zd
a0  ar	bi  c9	d0  dr	ei  f9	g0  gr	hi  i9	j0  jr	ki  l9	m0  mr	ni  o9	p0  pn				   qe  r5  rw  sn  te  u5  uw  vn  we  x5  xw  yn  ze
a1  as	bj  ca	d1  ds	ej  fa	g1  gs	hj  ia	j1  js	kj  la	m1  ms	nj  oa	p1  po				   qf  r6  rx  so  tf  u6  ux  vo  wf  x6  xx  yo  zf
a2  at	bk  cb	d2  dt	ek  fb	g2  gt	hk  ib	j2  jt	kk  lb	m2  mt	nk  ob	p2  pp				   qg  r7  ry  sp  tg  u7  uy  vp  wg  x7  xy  yp  zg
a3  au	bl  cc	d3  du	el  fc	g3  gu	hl  ic	j3  ju	kl  lc	m3  mu	nl  oc	p3  pq				   qh  r8  rz  sq  th  u8  uz  vq  wh  x8  xz  yq  zh
a4  av	bm  cd	d4  dv	em  fd	g4  gv	hm  id	j4  jv	km  ld	m4  mv	nm  od	p4  pr				   qi  r9  s0  sr  ti  u9  v0  vr  wi  x9  y0  yr  zi
a5  aw	bn  ce	d5  dw	en  fe	g5  gw	hn  ie	j5  jw	kn  le	m5  mw	nn  oe	p5  ps				   qj  ra  s1  ss  tj  ua  v1  vs  wj  xa  y1  ys  zj

Then preprocess the raw data as follow to prepare data for rigid body docking:

# prepare data for rigid body docking
python preprocess_raw_data.py -n_jobs 60 -data dips -graph_nodes residues -graph_cutoff 30 -graph_max_neighbor 10 -graph_residue_loc_is_alphaC -pocket_cutoff 8 -data_fraction 1.0

You should now obtain the following cache data directory:

$ ls cache/dips_residues_maxneighbor_10_cutoff_30.0_pocketCut_8.0/cv_0/
label_test.pkl		     ligand_graph_val.bin		  receptor_graph_frac_1.0_train.bin
label_val.pkl		     ligand_graph_frac_1.0_train.bin  receptor_graph_test.bin
label_frac_1.0_train.pkl   ligand_graph_test.bin	      receptor_graph_val.bin

Training

On GPU (works also on CPU, but it's very slow):

CUDA_VISIBLE_DEVICES=0 python -m src.train -hyper_search

or just specify your own params if you don't want to do hyperparam search. This will create checkpoints, tensorboard logs (you can visualize with tensorboard) and will store all stdout/stderr in a log file. This will train a model on DIPS first and, then, fine-tune it on DB5. Use -toy to train on DB5 only.

Data splits

In our paper, we used the train/validation/test splits given by the files

DIPS: DIPS/data/DIPS/interim/pairs-pruned/pairs-postprocessed-*.txt
DB5: data/benchmark5.5/cv/cv_0/*.txt

Inference

See inference_rigid.py.

Pretrained models

Our paper pretrained models are available in folder checkpts/. By loading those (as in inference_rigid.py), you can also see which hyperparameters were used in those models (or directly from their names).

Test and reproduce paper's numbers

Test sets used in our paper are given in test_sets_pdb/. Ground truth (bound) structures are in test_sets_pdb/dips_test_random_transformed/complexes/, while unbound structures (i.e., randomly rotated and translated ligands and receptors) are in test_sets_pdb/dips_test_random_transformed/random_transformed/ and you should precisely use those for your predictions (or at least the ligands, while using the ground truth receptors like we do in inference_rigid.py). This test set was originally generated as a randomly sampled family-based subset of complexes in ./DIPS/data/DIPS/interim/pairs-pruned/pairs-postprocessed-test.txt using the file src/test_all_methods/testset_random_transf.py.

Run python -m src.inference_rigid to produce EquiDock's outputs for all test files. This will create a new directory of PDB output files in test_sets_pdb/.

Get RMSD numbers from our papers using python -m src.test_all_methods.eval_pdb_outputset. You can use this script to evaluate all other baselines. Baselines' output PDB files are also provided in test_sets_pdb/

Comments

How to get the complex pose?

Hi,

I am having some issues looking through test_sets_pdb:

original (undocked) structures are in: db5_test_random_transform with ligands (part that moves) being in random_transformed, receptor (not movable) being in complexes and results in db5_equidock_results.

So for example if we take from db5_equidock_results the following pose: 1AVX_l_b_EQUIDOCK.pdb this means that 1AVX_l_b.pdb was used as ligand (movable) and 1AVX_r_b_complex.pdb as receptor (not moved). This means that if I superimposse in pymol 1AVX_l_b_EQUIDOCK.pdb and `1AVX_r_b_complex.pdb' I should get a nicely docked complex, however this is not the case. There are many many clashes.

Can you help?

Best, Liviu

opened by LivC193 13
error when I run preprocess_raw_data.py. How can I fix it?

hello, when I run the command: python preprocess_raw_data.py -n_jobs 20 -data db5 -graph_nodes residues -graph_cutoff 30 -graph_max_neighbor 10 -graph_residue_loc_is_alphaC -pocket_cutoff 8

I got the following error:

Processing split 1 Processing ./cache/db5_residues_maxneighbor_10_cutoff_30.0_pocketCut_8.0/cv_1\label_val.pkl Traceback (most recent call last): File "C:\Users\equidock_public-main\src\preprocess_raw_data.py", line 37, in Unbound_Bound_Data(args, reload_mode='val', load_from_cache=False, raw_data_path=raw_data_path, File "C:\Users\equidock_public-main\src\utils\db5_data.py", line 78, in init with open(os.path.join(split_files_path, reload_mode + '.txt'), 'r') as f: FileNotFoundError: [Errno 2] No such file or directory: './data/benchmark5.5/cv/cv_0\cv_1\val.txt'

Any suggestions?

Thanks

opened by mycode-bit 4
deallock in make_dataset

as we can see in the DISP_Plus, the author said: about make dips dataset #7 https://github.com/BioinfoMachineLearning/DIPS-Plus/issues/7

"deadlock of sorts after a certain number of complexes have been processed." when run for a long time, it will appeare By chance， but for few files , it always run success, so i Split up to process the make_dataset

first mkdir six different fold like tmp1 tmp2 .... mkdir tmp1 than cd in pdb and move the files to the new fold mv ls | head -200 ../tmp6 run make_dataset.py seperate: python3 make_dataset.py project/datasets/DIPS/raw/tmp1 project/datasets/DIPS/interim --num_cpus 24 --source_type rcsb --bound

opened by zhenpingli 3
to speed up rsync

add a -W to ignore check，

rsync -rlpt -v -W -z --delete --port=33444 rsync.rcsb.org::ftp_data/biounit/coordinates/divided/ ./DIPS/raw/pdb

hope it useful

opened by zhenpingli 0
Requesting for a requirements.txt for pip

Hi, I created a virtual environment and tried to pip install the dependencies listed in README.md. However, I'm not able to install some of them (e.g. cuda & dgl==0.7.0). Can I request for a requirements.txt to install the dependencies?

Thank you! :)

opened by yipy0005 0
best validation score & some other variations

Hello!

I was working with your code and found out that the best validation score used in the project (val_complex_rmsd_median) differs from what is enunciated in the article presenting EquiDock (val_ligand_rmsd_median). Is there any reason behind this choice or am I misinterpreting something ?

https://github.com/octavian-ganea/equidock_public/blob/ac2c754399bf20b50a27d86dbff4f6669788d47f/src/train.py#L372

opened by AxelGiottonini 0
How to run inference on custom PDB + Problems with Installation

Hi there,

I'd like to report a bug on installation. So far my workaround was to use dgl==0.9.0 rather than the dgl==0.7.0 you have in the requirements.

Also, is there an easy way to interface with the models for a custom set of PDBs? I'd prefer to avoid having to fiddle with the inference_rigid.py but from what it seems there's no way to pass custom sets other than perhaps inserting them as test data?

opened by universvm 1
Matrix Product Error in Kabsh Model

Hi, dear authors of Equidock, I feel really sad to hear the news that Ganea passed away without fully showing his extraordinary genius.

I came across the calculation of the Kabsh Model and found that the computation of the rotation matrix is somehow misleading. To be specific, U, S, Vt = np.linalg.svd(H) gives us the U, S, V^T, which corresponds to U2, S, U1^T in the paper. Next, the rotation matrix is obtained via R = Vt.T @ U.T, which is different from what is described in the text. Instead, R = U2 @ U^T, which should be R = U @ Vt in the code. Do you agree with me?

opened by smiles724 1
Installation problems

Thank you for this great tool. I am starting to install it on Ubuntu 20.04, and met with multiple FileNotFound Errors. Are there any dependencies not listed? Here are the errors: (base) nc1@nc1-UA9C-R38:~/equidock_public-main$ # Extract the raw PDB files: (base) nc1@nc1-UA9C-R38:~/equidock_public-main$ python3 project/datasets/builder/extract_raw_pdb_gz_archives.py project/datasets/DIPS/raw/pdb python3: can't open file '/home/nc1/equidock_public-main/project/datasets/builder/extract_raw_pdb_gz_archives.py': [Errno 2] No such file or directory (base) nc1@nc1-UA9C-R38:~/equidock_public-main$ (base) nc1@nc1-UA9C-R38:~/equidock_public-main$ # Process the raw PDB data into associated pair files: (base) nc1@nc1-UA9C-R38:~/equidock_public-main$ python3 project/datasets/builder/make_dataset.py project/datasets/DIPS/raw/pdb project/datasets/DIPS/interim --num_cpus 28 --source_type rcsb --bound python3: can't open file '/home/nc1/equidock_public-main/project/datasets/builder/make_dataset.py': [Errno 2] No such file or directory (base) nc1@nc1-UA9C-R38:~/equidock_public-main$ (base) nc1@nc1-UA9C-R38:~/equidock_public-main$ # Apply additional filtering criteria: (base) nc1@nc1-UA9C-R38:~/equidock_public-main$ python3 project/datasets/builder/prune_pairs.py project/datasets/DIPS/interim/pairs project/datasets/DIPS/filters project/datasets/DIPS/interim/pairs-pruned --num_cpus 28 python3: can't open file '/home/nc1/equidock_public-main/project/datasets/builder/prune_pairs.py': [Errno 2] No such file or directory

opened by hbsong-03 1
Why optimal transport matrix is not used?

Hi, thanks for the great work!! I have a question regarding the following point in the paper:

On p.7 it is stated that:

we unfortunately do not know the actual alignment between points in $Y_l$ and $P_l$ , for every $l ∈ {1, 2}$. This can be recovered using an additional optimal transport loss

However, in the code here : https://github.com/octavian-ganea/equidock_public/blob/main/src/train.py#L128 The optimal transport matrix (the 2nd returned variable) is ignored:

ot_dist, _ = compute_ot_emd(cost_mat_ligand + cost_mat_receptor, args['device'])

In my understanding, the matrix should be used to recovered the alignment. So I am now confused how the points alignment can be recovered without this optimal transport matrix?

Thank you so much again!

opened by ratthachat 4
Hyperparameters

Hello ! It is a bit unclear to me which hyper parameters you used to train your model, could you provide a complete list of your best models for DIPS and DB5? In particular, I am not sure whether node and edge features were used. Moreover, the hyperparameters you mention in the paper are not the same as your best model's checkpoints. Thanks :)

opened by PBordesInstadeep 2

Source code for EquiDock: Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking (ICLR 2022)

Related tags

Overview

Source code for EquiDock: Independent SE(3)-Equivariant Models for End-to-End Rigid Protein Docking (ICLR 2022)

Dependencies

DB5.5 data

DIPS data

Training

Data splits

Inference

Pretrained models

Test and reproduce paper's numbers

Comments

Owner

Octavian Ganea

This is the codebase for the ICLR 2021 paper Trajectory Prediction using Equivariant Continuous Convolution

🐤 Nix-TTS: An Incredibly Lightweight End-to-End Text-to-Speech Model via Non End-to-End Distillation

A variational Bayesian method for similarity learning in non-rigid image registration (CVPR 2022)

Open source repository for the code accompanying the paper 'Non-Rigid Neural Radiance Fields Reconstruction and Novel View Synthesis of a Deforming Scene from Monocular Video'.

This is the open-source reference implementation of the SIGGRAPH 2021 paper Intersection-free Rigid Body Dynamics.

NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

Source code, datasets and trained models for the paper Learning Advanced Mathematical Computations from Examples (ICLR 2021), by François Charton, Amaury Hayat (ENPC-Rutgers) and Guillaume Lample

Code for "Learning to Segment Rigid Motions from Two Frames".

The code for the CVPR 2021 paper Neural Deformation Graphs, a novel approach for globally-consistent deformation tracking and 3D reconstruction of non-rigid objects.

Code for ICCV 2021 paper: ARAPReg: An As-Rigid-As Possible Regularization Loss for Learning Deformable Shape Generators..

Official code release for ICCV 2021 paper SNARF: Differentiable Forward Skinning for Animating Non-rigid Neural Implicit Shapes.

Imposter-detector-2022 - HackED 2022 Team 3IQ - 2022 Imposter Detector

Generative Models for Graph-Based Protein Design

Uni-Fold: Training your own deep protein-folding models

Codes and models for the paper "Learning Unknown from Correlations: Graph Neural Network for Inter-novel-protein Interaction Prediction".

RITA is a family of autoregressive protein models, developed by LightOn in collaboration with the OATML group at Oxford and the Debora Marks Lab at Harvard.

(CVPR 2022) A minimalistic mapless end-to-end stack for joint perception, prediction, planning and control for self driving.

[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

[CVPR 2022 Oral] MixFormer: End-to-End Tracking with Iterative Mixed Attention