GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation

[OpenReview] [arXiv] [Code]

The official implementation of GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022 Oral Presentation [54/3391]).

Environments

Install via Conda (Recommended)

# Clone the environment
conda env create -f env.yml
# Activate the environment
conda activate geodiff
# Install PyG
conda install pytorch-geometric=1.7.2=py37_torch_1.8.0_cu102 -c rusty1s -c conda-forge

Dataset

Offical Dataset

The offical raw GEOM dataset is avaiable [here].

Preprocessed dataset

We provide the preprocessed datasets (GEOM) in this [google drive folder]. After downleading the dataset, it should be put into the folder path as specified in the dataset variable of config files ./configs/*.yml.

Prepare your own GEOM dataset from scratch (optional)

You can also download origianl GEOM full dataset and prepare your own data split. A guide is available at previous work ConfGF's [github page].

Training

All hyper-parameters and training details are provided in config files (./configs/*.yml), and free feel to tune these parameters.

You can train the model with the following commands:

# Default settings
python train.py ./config/qm9_default.yml
python train.py ./config/drugs_default.yml
# An ablation setting with fewer timesteps, as described in Appendix D.2.
python train.py ./config/drugs_1k_default.yml

The model checkpoints, configuration yaml file as well as training log will be saved into a directory specified by --logdir in train.py.

Generation

We provide the checkpoints of two trained models, i.e., qm9_default and drugs_default in the [google drive folder]. Note that, please put the checkpoints *.pt into paths like ${log}/${model}/checkpoints/, and also put corresponding configuration file *.yml into the upper level directory ${log}/${model}/.

Attention: if you want to use pretrained models, please use the code at the pretrain branch, which is the vanilla codebase for reproducing the results with our pretrained models. We recently notice some issue of the codebase and update it, making the main branch not compatible well with the previous checkpoints.

You can generate conformations for entire or part of test sets by:

python test.py ${log}/${model}/checkpoints/${iter}.pt \ --start_idx 800 --end_idx 1000 Here start_idx and end_idx indicate the range of the test set that we want to use. All hyper-parameters related to sampling can be set in test.py files. Specifically, for testing qm9 model, you could add the additional arg --w_global 0.3, which empirically shows slightly better results. Conformations of some drug-like molecules generated by GeoDiff are provided below. Evaluation After generating conformations following the obove commands, the results of all benchmark tasks can be calculated based on the generated data. Task 1. Conformation Generation The COV and MAT scores on the GEOM datasets can be calculated using the following commands: python eval_covmat.py${log}/${model}/${sample}/sample_all.pkl

For the property prediction, we use a small split of qm9 different from the Conformation Generation task. This split is also provided in the [google drive folder]. Generating conformations and evaluate mean absolute errors (MAR) metric on this split can be done by the following commands:

python ${log}/${model}/checkpoints/${iter}.pt --num_confs 50 \ --start_idx 0 --test_set data/GEOM/QM9/qm9_property.pkl python eval_prop.py --generated${log}/${model}/${sample}/sample_all.pkl

Visualizing molecules with PyMol

Here we also provide a guideline for visualizing molecules with PyMol. The guideline is borrowed from previous work ConfGF's [github page].

Start Setup

1. pymol -R
2. Display - Background - White
3. Display - Color Space - CMYK
4. Display - Quality - Maximal Quality
5. Display Grid
1. by object: use set grid_slot, int, mol_name to put the molecule into the corresponding slot
2. by state: align all conformations in a single slot
3. by object-state: align all conformations and put them in separate slots. (grid_slot dont work!)
6. Setting - Line and Sticks - Ball and Stick on - Ball and Stick ratio: 1.5
7. Setting - Line and Sticks - Stick radius: 0.2 - Stick Hydrogen Scale: 1.0

Show Molecule

1. To show molecules

1. hide everything
2. show sticks
2. To align molecules: align name1, name2

3. Convert RDKit mol to Pymol

from rdkit.Chem import PyMol
v= PyMol.MolViewer()
rdmol = Chem.MolFromSmiles('C')
v.ShowMol(rdmol, name='mol')
v.SaveFile('mol.pkl')

Citation

Please consider citing the our paper if you find it helpful. Thank you!

@inproceedings{
xu2022geodiff,
title={GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation},
author={Minkai Xu and Lantao Yu and Yang Song and Chence Shi and Stefano Ermon and Jian Tang},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=PzcvxEMzvQC}
}


Acknowledgement

This repo is built upon the previous work ConfGF's [codebase]. Thanks Chence and Shitong!

Known issues

1. The current codebase is not compatible with more recent torch-geometric versions.
2. The current processed dataset (with PyD data object) is not compatible with more recent torch-geometric versions.
• Question about the theory in the paper

Hi Xu,

First of all thanks for your nice work. I've read your paper, and I have some questions on the proof of the equivariance of the transition kernel. In detail, suppose $\mathcal{C}^t$ is roto-translation invariant, and (thus) $\mu_{\theta}(\mathcal{C}^t, \mathcal{G}, t)$ is roto-translation equivariant with desgined GNN, we need to prove that $p(\mathcal{C}^{t-1} | \mathcal{C}^{t}, \mathcal{G}, t)$ is equivariant. I wonder if it is due to the following derivation:

\begin{aligned} p(R \mathcal{C}^{t-1} + g | R \mathcal{C}^{t} + g, \mathcal{G}, t) &= \mathcal{N}(R\mathcal{C}^{t-1} + g; \boldsymbol{\mu}{\theta}(R\mathcal{C}^t + g, \mathcal{G}, t), \sigma_t^2 \mathbf{I}) \ &= \frac{1}{(2 \pi)^{\frac{p}{2}}|\boldsymbol{\Sigma}|^{-\frac{1}{2}}} e^{-\frac{1}{2}(R(\mathcal{C}^{t-1}-\boldsymbol{\mu}\theta))^T \boldsymbol{\Sigma}^{-1}(R(\mathcal{C}^{t-1}-\boldsymbol{\mu}\theta))} \ &= \frac{1}{(2 \pi)^{\frac{p}{2}}|\boldsymbol{\Sigma}|^{-\frac{1}{2}}} e^{-\frac{1}{2}(\mathcal{C}^{t-1}-\boldsymbol{\mu}\theta)^T \boldsymbol{\Sigma}^{-1}(\mathcal{C}^{t-1}-\boldsymbol{\mu}_\theta)} \end{aligned}

where $\boldsymbol{\Sigma} = \sigma_t^2 \mathbf{I}$. I am not sure if it's correct, hope to receive your clarification. Thanks.

opened by Frankie123421 12
• Provide data process script

Could you please provide the data process script since I found that ConfGF's code does not get the features like num_graphs, atom_type, pos, bond_index , etc.?

opened by Layne-Huang 4
• Question about GFN and GeoDiff-A Implementation

Hi authors,

Thank you for sharing codes of your great work! According to your paper, GeoDiff has two different versions including GeoDiff-A and GeoDiff-C. But it seems only the codes of GeoDiff-C are provided. In addition, it is mentioned in the paper that graph field network (GFN) is used but I cannot find the codes of GFN, while SchNet seems to play the role of GFN in current implementations.

I am wondering whether the implementations of GFN and GeoDiff-A will be provided in the future? Thanks!

Best,

opened by lyzustc 2

Hi, thanks for sharing this very good work!

I found the google drive link of the preprocessed dataset can not use, maybe the permission issue, could you have a double check? Thanks

opened by klightz 2
• pos_ref and 3D visualization issues

Dear Minkai,

I am sorry to bother you. It seems like that you did not use time step t as the parameter in the diffusion training process. Why did not you use beta or alpha with time step t in the training process?

opened by Layne-Huang 0
• protein conformation

Hi! Thanks for the nice paper. In the paper you mentioned protein conformation is a difficult task with this model. Can you explain why linearity of proteins can be an issue? What could be some improvements?

opened by orgw 1
• Typo in get_edge_encoder

The function found here uses the full variable config which is not passed in for the type gaussian. I believe it should be cfg, but I guess it is never used.

opened by natolambert 1
• Disagreement with paper

I doubt that the code is still based on score-matching methods. In models/epsnet/dualenc.py, line 478, the noisy sample in forward diffusion process is different from the eq.(4) in DDPM, and the equation with subscription 2 in the paper. So that the noise calculated by d_gt and d_perturbed is hard to understand. Second, the encoder's forward method does not embed time_step as the parameters to calculate the noise. In langevin_dynamics_sample_diffusion method, the sampling process is also different from the Algorithm 1 in the paper. Why is the step_size formulated as shown in line 443? Can you give me more details on the implementation? Maybe the code is based on the improved version of the original DDPM, such as score-based ones? How can I find more materials to understand your code where differences occur?

opened by BIRD-TAO 6
• Why use AddHigherOrderEdges() in sampling but not in training?

Dear Minkai,

I noticed that you used AddHigherOrderEdges() transformation when preparing dataset in sampling but you did not implement it in traning. Why did you use this transformation and why did not you use it in training process?

Thank you very much!

opened by Layne-Huang 6
• Using GeoDiff on mac

Hello, great work and thanks for sharing the code. However, when I try to install the environment on a mac, it is not compatible, it seems to work only on linux. Is there any chance you could provide an environment for mac? Thanks!

opened by danielm322 1
• I got errors when training the code

Hi, Minkai, I use your data generation code to generate datasets with a new version of pyg, however, I got the following error when training, do you have any idea?

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1479, 128]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

opened by nickspark 8
Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Molecular Sets (MOSES): A benchmarking platform for molecular generation models Deep generative models are rapidly becoming popular for the discovery

3 Nov 14, 2021
Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

Molecular Sets (MOSES): A benchmarking platform for molecular generation models Deep generative models are rapidly becoming popular for the discovery

621 Sep 20, 2022
NNR conformation conditional and global probabilities estimation and analysis in peptides or proteins fragments

NNR and global probabilities estimation and analysis in peptides or protein fragments This module calculates global and NNR conformation dependent pro

0 Jul 15, 2021
Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Geometric Vector Perceptron Implementation of Geometric Vector Perceptron, a simple circuit with 3d rotation equivariance for learning over large biom

56 Sep 2, 2022
Pytorch-diffusion - A basic PyTorch implementation of 'Denoising Diffusion Probabilistic Models'

PyTorch implementation of 'Denoising Diffusion Probabilistic Models' This reposi

48 Sep 18, 2022
16 Aug 28, 2022
A denoising diffusion probabilistic model (DDPM) tailored for conditional generation of protein distograms

Denoising Diffusion Probabilistic Model for Proteins Implementation of Denoising Diffusion Probabilistic Model in Pytorch. It is a new approach to gen

96 Sep 24, 2022
Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

GLIDE This is the official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing w

2.7k Sep 28, 2022
McGill Physics Hackathon 2021: Reaction-Diffusion Models for the Generation of Biological Patterns

DiffuseAnimals: Reaction-Diffusion Models for the Generation of Biological Patterns Introduction Reaction-diffusion equations can be utilized in order

2 Mar 7, 2022
ReLoss - Official implementation for paper "Relational Surrogate Loss Learning" ICLR 2022

Relational Surrogate Loss Learning (ReLoss) Official implementation for paper "R

22 Aug 29, 2022
Official Pytorch implementation of Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference (ICLR 2022)

The Official Implementation of CLIB (Continual Learning for i-Blurry) Online Continual Learning on Class Incremental Blurry Task Configuration with An

32 Aug 31, 2022
A PyTorch implementation of ICLR 2022 Oral paper PiCO: Contrastive Label Disambiguation for Partial Label Learning

PiCO: Contrastive Label Disambiguation for Partial Label Learning This is a PyTorch implementation of ICLR 2022 Oral paper PiCO; also see our Project

83 May 11, 2022
Imposter-detector-2022 - HackED 2022 Team 3IQ - 2022 Imposter Detector

HackED 2022 Team 3IQ - 2022 Imposter Detector By Aneeljyot Alagh, Curtis Kan, Jo

3 Aug 20, 2022
Pytorch Implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension)

DiffSinger - PyTorch Implementation PyTorch implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension). Status

130 Sep 29, 2022

94 Sep 22, 2022
A PyTorch implementation of "DGC-Net: Dense Geometric Correspondence Network"

DGC-Net: Dense Geometric Correspondence Network This is a PyTorch implementation of our work "DGC-Net: Dense Geometric Correspondence Network" TL;DR A

186 Sep 22, 2022
This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

This is a Pytorch implementation of the paper: Self-Supervised Graph Transformer on Large-Scale Molecular Data.

192 Sep 20, 2022
Public Implementation of ChIRo from "Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations"

Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations This directory contains the model architectures and experimental

28 Sep 3, 2022
Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy" (ICLR 2022 Spotlight)

About Code release for Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy (ICLR 2022 Spotlight)

141 Sep 21, 2022