Implementation of GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022).

Minkai Xu

Last update: Dec 26, 2022

Related tags

Deep Learning molecule computational-biology computational-chemistry conformation iclr generative-models graph-neural-networks diffusion-models score-matching iclr2022

Overview

GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation

[OpenReview] [arXiv] [Code]

The official implementation of GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022 Oral Presentation [54/3391]).

Environments

Install via Conda (Recommended)

# Clone the environment
conda env create -f env.yml
# Activate the environment
conda activate geodiff
# Install PyG
conda install pytorch-geometric=1.7.2=py37_torch_1.8.0_cu102 -c rusty1s -c conda-forge

Dataset

Offical Dataset

The offical raw GEOM dataset is avaiable [here].

Preprocessed dataset

We provide the preprocessed datasets (GEOM) in this [google drive folder]. After downleading the dataset, it should be put into the folder path as specified in the dataset variable of config files ./configs/*.yml.

Prepare your own GEOM dataset from scratch (optional)

You can also download origianl GEOM full dataset and prepare your own data split. A guide is available at previous work ConfGF's [github page].

Training

All hyper-parameters and training details are provided in config files (./configs/*.yml), and free feel to tune these parameters.

You can train the model with the following commands:

# Default settings
python train.py ./config/qm9_default.yml
python train.py ./config/drugs_default.yml
# An ablation setting with fewer timesteps, as described in Appendix D.2.
python train.py ./config/drugs_1k_default.yml

The model checkpoints, configuration yaml file as well as training log will be saved into a directory specified by --logdir in train.py.

Generation

We provide the checkpoints of two trained models, i.e., qm9_default and drugs_default in the [google drive folder]. Note that, please put the checkpoints *.pt into paths like ${log}/${model}/checkpoints/, and also put corresponding configuration file *.yml into the upper level directory ${log}/${model}/.

Attention: if you want to use pretrained models, please use the code at the pretrain branch, which is the vanilla codebase for reproducing the results with our pretrained models. We recently notice some issue of the codebase and update it, making the main branch not compatible well with the previous checkpoints.

You can generate conformations for entire or part of test sets by:

python test.py ${log}/${model}/checkpoints/${iter}.pt \
    --start_idx 800 --end_idx 1000

Here start_idx and end_idx indicate the range of the test set that we want to use. All hyper-parameters related to sampling can be set in test.py files. Specifically, for testing qm9 model, you could add the additional arg --w_global 0.3, which empirically shows slightly better results.

Conformations of some drug-like molecules generated by GeoDiff are provided below.

Evaluation

After generating conformations following the obove commands, the results of all benchmark tasks can be calculated based on the generated data.

Task 1. Conformation Generation

The COV and MAT scores on the GEOM datasets can be calculated using the following commands:

python eval_covmat.py ${log}/${model}/${sample}/sample_all.pkl

Task 2. Property Prediction

For the property prediction, we use a small split of qm9 different from the Conformation Generation task. This split is also provided in the [google drive folder]. Generating conformations and evaluate mean absolute errors (MAR) metric on this split can be done by the following commands:

python ${log}/${model}/checkpoints/${iter}.pt --num_confs 50 \
      --start_idx 0 --test_set data/GEOM/QM9/qm9_property.pkl
python eval_prop.py --generated ${log}/${model}/${sample}/sample_all.pkl

Visualizing molecules with PyMol

Here we also provide a guideline for visualizing molecules with PyMol. The guideline is borrowed from previous work ConfGF's [github page].

Start Setup

pymol -R
Display - Background - White
Display - Color Space - CMYK
Display - Quality - Maximal Quality
Display Grid
1. by object: use set grid_slot, int, mol_name to put the molecule into the corresponding slot
2. by state: align all conformations in a single slot
3. by object-state: align all conformations and put them in separate slots. (grid_slot dont work!)
Setting - Line and Sticks - Ball and Stick on - Ball and Stick ratio: 1.5
Setting - Line and Sticks - Stick radius: 0.2 - Stick Hydrogen Scale: 1.0

Show Molecule

To show molecules
1. hide everything
2. show sticks
To align molecules: align name1, name2

Convert RDKit mol to Pymol

from rdkit.Chem import PyMol
v= PyMol.MolViewer()
rdmol = Chem.MolFromSmiles('C')
v.ShowMol(rdmol, name='mol')
v.SaveFile('mol.pkl')

Citation

Please consider citing the our paper if you find it helpful. Thank you!

@inproceedings{
xu2022geodiff,
title={GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation},
author={Minkai Xu and Lantao Yu and Yang Song and Chence Shi and Stefano Ermon and Jian Tang},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=PzcvxEMzvQC}
}

Acknowledgement

This repo is built upon the previous work ConfGF's [codebase]. Thanks Chence and Shitong!

Contact

If you have any question, please contact me at [email protected] or [email protected].

Known issues

The current codebase is not compatible with more recent torch-geometric versions.
The current processed dataset (with PyD data object) is not compatible with more recent torch-geometric versions.

Comments

Question about the theory in the paper

Hi Xu,

First of all thanks for your nice work. I've read your paper, and I have some questions on the proof of the equivariance of the transition kernel. In detail, suppose $\mathcal{C}^t$ is roto-translation invariant, and (thus) $\mu_{\theta}(\mathcal{C}^t, \mathcal{G}, t)$ is roto-translation equivariant with desgined GNN, we need to prove that $p(\mathcal{C}^{t-1} | \mathcal{C}^{t}, \mathcal{G}, t)$ is equivariant. I wonder if it is due to the following derivation:

$$\begin{aligned} p(R \mathcal{C}^{t-1} + g | R \mathcal{C}^{t} + g, \mathcal{G}, t) &= \mathcal{N}(R\mathcal{C}^{t-1} + g; \boldsymbol{\mu}{\theta}(R\mathcal{C}^t + g, \mathcal{G}, t), \sigma_t^2 \mathbf{I}) \ &= \frac{1}{(2 \pi)^{\frac{p}{2}}|\boldsymbol{\Sigma}|^{-\frac{1}{2}}} e^{-\frac{1}{2}(R(\mathcal{C}^{t-1}-\boldsymbol{\mu}\theta))^T \boldsymbol{\Sigma}^{-1}(R(\mathcal{C}^{t-1}-\boldsymbol{\mu}\theta))} \ &= \frac{1}{(2 \pi)^{\frac{p}{2}}|\boldsymbol{\Sigma}|^{-\frac{1}{2}}} e^{-\frac{1}{2}(\mathcal{C}^{t-1}-\boldsymbol{\mu}\theta)^T \boldsymbol{\Sigma}^{-1}(\mathcal{C}^{t-1}-\boldsymbol{\mu}_\theta)} \end{aligned}$$

where $\boldsymbol{\Sigma} = \sigma_t^2 \mathbf{I}$. I am not sure if it's correct, hope to receive your clarification. Thanks.

opened by Frankie123421 12
Provide data process script

Could you please provide the data process script since I found that ConfGF's code does not get the features like num_graphs, atom_type, pos, bond_index , etc.?

opened by Layne-Huang 4
Question about GFN and GeoDiff-A Implementation

Hi authors,

Thank you for sharing codes of your great work! According to your paper, GeoDiff has two different versions including GeoDiff-A and GeoDiff-C. But it seems only the codes of GeoDiff-C are provided. In addition, it is mentioned in the paper that graph field network (GFN) is used but I cannot find the codes of GFN, while SchNet seems to play the role of GFN in current implementations.

I am wondering whether the implementations of GFN and GeoDiff-A will be provided in the future? Thanks!

Best,

opened by lyzustc 2
Can not open the google drive download link for preprocessed dataset

Hi, thanks for sharing this very good work!

I found the google drive link of the preprocessed dataset can not use, maybe the permission issue, could you have a double check? Thanks

opened by klightz 2
$m_{ij} notation in GNN architecture$

m_{ij} notation in GNN architecture

Hi Minkai Xu,

I think there is a flaw in your mathematical notation for GNN, Equation 5 and 6 in particular. I found that the notations m_{ij} in equation 5 and 6 are not the same. m_{ij} in equation 5 and m_{ij} in equation 6

opened by CaptainCuong 1
pos_ref and 3D visualization issues

Dear Minkai,

I am sorry to bother you. It seems like that you did not use time step t as the parameter in the diffusion training process. Why did not you use beta or alpha with time step t in the training process?

opened by Layne-Huang 0
Questions about the rescale problem

Hi, Xu. Thanks for sharing the code. I've noticed the discussion here (https://github.com/MinkaiXu/GeoDiff/issues/11) and carefully read the code line by line. Just as what you stated in the issue 11, the "diffusion" process in the code is actually rescaled compared to the paper, i.e., $\mathcal{C}^t = \frac{1}{\sqrt{\alpha_t}}(\sqrt{\alpha_t}C^0 + \sqrt{1-\alpha_t}\epsilon)$. Based on the paper ScoreSDE (https://arxiv.org/abs/2011.13456), DDPM is a variance preserving process and DSM is a variance exploding one. I think maybe there might be some typos in your answer to issue 11 which cause contradiction: "2) use the alpha to rescale the data to achieve variation preserving" and "the problem of variation preserving is: it will change the scale of coordinates". In my perspective, after rescaling, $\mathcal{C}^t = C^0 + \frac{\sqrt{1-\alpha_t}}{\sqrt{\alpha_t}}\epsilon$ is a DSM process with variance increasing along with $t$. So I am confused about why this rescaling method will hold the scale of coordinates since in my view it seems to corrupt the scale (increase the variance) instead.

opened by Frankie123421 0
Global and Local structure

Hi Minkai Xu,

What's the motivation for that you designed two separate architectures to learn local and global structures? In loss, the loss is divided into local loss and global loss (node_eq_global - target_pos_global)**2 + (node_eq_local - target_pos_local)**2

opened by CaptainCuong 1
strange thing when sampling

Hi, Minkai @MinkaiXu. Rencently I am trying to do some experiments based on GeoDiff, and having some problems. Frist, I re-train the model with your original hyper-parameters. I notice that after around 800k iterations, the loss hardly decreases but oscillates wildly. I don't know whether the model truly benefit from the training in the following 2m iterations, and why the loss oscillates wildly? Second, I am trying to sample some conformations from the checkpoints getting in my training procedure, but something very strange happened. Most of the molecules during sampling occur FloatingPointError and retry with local clipping. Then I use eval_covmat.py to evaluate the quailty of conformations, I get 0.00 Cov value and thousands of Mat value. This strange thing happens all the time except one test with 700000.pt to generate molecules from 800 to 1000 (and I repeat the test again with same parameters and the strange thing occurs) I don't know why it happens because I run your code without any modification, and I wonder why you design the local clip operation and how to choose the corresponding hyper-parameter? In the end, I would very appreciate it if you could give me some explanation and suggestions.

opened by mayz20 1
protein conformation

Hi! Thanks for the nice paper. In the paper you mentioned protein conformation is a difficult task with this model. Can you explain why linearity of proteins can be an issue? What could be some improvements?

opened by orgw 1
Typo in get_edge_encoder

The function found here uses the full variable config which is not passed in for the type gaussian. I believe it should be cfg, but I guess it is never used.

opened by natolambert 1

Implementation of GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022).

Related tags

Overview

GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation

Environments

Install via Conda (Recommended)

Dataset

Offical Dataset

Preprocessed dataset

Prepare your own GEOM dataset from scratch (optional)

Training

Generation

Evaluation

Task 1. Conformation Generation

Task 2. Property Prediction

Visualizing molecules with PyMol

Start Setup

Show Molecule

Citation

Acknowledgement

Contact

Known issues

Comments

Owner

Minkai Xu

Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

NNR conformation conditional and global probabilities estimation and analysis in peptides or proteins fragments

Pytorch-diffusion - A basic PyTorch implementation of 'Denoising Diffusion Probabilistic Models'

Minimal diffusion models - Minimal code and simple experiments to play with Denoising Diffusion Probabilistic Models (DDPMs)

A denoising diffusion probabilistic model (DDPM) tailored for conditional generation of protein distograms

Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

Imposter-detector-2022 - HackED 2022 Team 3IQ - 2022 Imposter Detector

ReLoss - Official implementation for paper "Relational Surrogate Loss Learning" ICLR 2022

Official Pytorch implementation of Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference (ICLR 2022)

A PyTorch implementation of ICLR 2022 Oral paper PiCO: Contrastive Label Disambiguation for Partial Label Learning

McGill Physics Hackathon 2021: Reaction-Diffusion Models for the Generation of Biological Patterns

Pytorch Implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension)

Pytorch implementation of "Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech"

Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy" (ICLR 2022 Spotlight)

[ICLR 2022] Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators

[ICLR 2022] Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics

Code for "MetaMorph: Learning Universal Controllers with Transformers", Gupta et al, ICLR 2022