# GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation

[OpenReview] [arXiv] [Code]

The official implementation of GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022 Oral Presentation [54/3391]).

## Environments

### Install via Conda (Recommended)

# Clone the environment
conda env create -f env.yml
# Activate the environment
conda activate geodiff
# Install PyG
conda install pytorch-geometric=1.7.2=py37_torch_1.8.0_cu102 -c rusty1s -c conda-forge

## Dataset

### Offical Dataset

The offical raw GEOM dataset is avaiable [here].

### Preprocessed dataset

We provide the preprocessed datasets (GEOM) in this [google drive folder]. After downleading the dataset, it should be put into the folder path as specified in the dataset variable of config files ./configs/*.yml.

### Prepare your own GEOM dataset from scratch (optional)

You can also download origianl GEOM full dataset and prepare your own data split. A guide is available at previous work ConfGF's [github page].

## Training

All hyper-parameters and training details are provided in config files (./configs/*.yml), and free feel to tune these parameters.

You can train the model with the following commands:

# Default settings
python train.py ./config/qm9_default.yml
python train.py ./config/drugs_default.yml
# An ablation setting with fewer timesteps, as described in Appendix D.2.
python train.py ./config/drugs_1k_default.yml

The model checkpoints, configuration yaml file as well as training log will be saved into a directory specified by --logdir in train.py.

## Generation

We provide the checkpoints of two trained models, i.e., qm9_default and drugs_default in the [google drive folder]. Note that, please put the checkpoints *.pt into paths like ${log}/${model}/checkpoints/, and also put corresponding configuration file *.yml into the upper level directory ${log}/${model}/.

Attention: if you want to use pretrained models, please use the code at the pretrain branch, which is the vanilla codebase for reproducing the results with our pretrained models. We recently notice some issue of the codebase and update it, making the main branch not compatible well with the previous checkpoints.

You can generate conformations for entire or part of test sets by:

python test.py ${log}/${model}/checkpoints/${iter}.pt \ --start_idx 800 --end_idx 1000 Here start_idx and end_idx indicate the range of the test set that we want to use. All hyper-parameters related to sampling can be set in test.py files. Specifically, for testing qm9 model, you could add the additional arg --w_global 0.3, which empirically shows slightly better results. Conformations of some drug-like molecules generated by GeoDiff are provided below. ## Evaluation After generating conformations following the obove commands, the results of all benchmark tasks can be calculated based on the generated data. ### Task 1. Conformation Generation The COV and MAT scores on the GEOM datasets can be calculated using the following commands: python eval_covmat.py${log}/${model}/${sample}/sample_all.pkl

For the property prediction, we use a small split of qm9 different from the Conformation Generation task. This split is also provided in the [google drive folder]. Generating conformations and evaluate mean absolute errors (MAR) metric on this split can be done by the following commands:

python ${log}/${model}/checkpoints/${iter}.pt --num_confs 50 \ --start_idx 0 --test_set data/GEOM/QM9/qm9_property.pkl python eval_prop.py --generated${log}/${model}/${sample}/sample_all.pkl

## Visualizing molecules with PyMol

Here we also provide a guideline for visualizing molecules with PyMol. The guideline is borrowed from previous work ConfGF's [github page].

### Start Setup

1. pymol -R
2. Display - Background - White
3. Display - Color Space - CMYK
4. Display - Quality - Maximal Quality
5. Display Grid
1. by object: use set grid_slot, int, mol_name to put the molecule into the corresponding slot
2. by state: align all conformations in a single slot
3. by object-state: align all conformations and put them in separate slots. (grid_slot dont work!)
6. Setting - Line and Sticks - Ball and Stick on - Ball and Stick ratio: 1.5
7. Setting - Line and Sticks - Stick radius: 0.2 - Stick Hydrogen Scale: 1.0

### Show Molecule

1. To show molecules

1. hide everything
2. show sticks
2. To align molecules: align name1, name2

3. Convert RDKit mol to Pymol

from rdkit.Chem import PyMol
v= PyMol.MolViewer()
rdmol = Chem.MolFromSmiles('C')
v.ShowMol(rdmol, name='mol')
v.SaveFile('mol.pkl')

## Citation

Please consider citing the our paper if you find it helpful. Thank you!

@inproceedings{
xu2022geodiff,
title={GeoDiff: A Geometric Diffusion Model for Molecular Conformation Generation},
author={Minkai Xu and Lantao Yu and Yang Song and Chence Shi and Stefano Ermon and Jian Tang},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=PzcvxEMzvQC}
}


## Acknowledgement

This repo is built upon the previous work ConfGF's [codebase]. Thanks Chence and Shitong!

## Known issues

1. The current codebase is not compatible with more recent torch-geometric versions.
2. The current processed dataset (with PyD data object) is not compatible with more recent torch-geometric versions.
• #### Question about the theory in the paper

Hi Xu,

First of all thanks for your nice work. I've read your paper, and I have some questions on the proof of the equivariance of the transition kernel. In detail, suppose $\mathcal{C}^t$ is roto-translation invariant, and (thus) $\mu_{\theta}(\mathcal{C}^t, \mathcal{G}, t)$ is roto-translation equivariant with desgined GNN, we need to prove that $p(\mathcal{C}^{t-1} | \mathcal{C}^{t}, \mathcal{G}, t)$ is equivariant. I wonder if it is due to the following derivation:

\begin{aligned} p(R \mathcal{C}^{t-1} + g | R \mathcal{C}^{t} + g, \mathcal{G}, t) &= \mathcal{N}(R\mathcal{C}^{t-1} + g; \boldsymbol{\mu}{\theta}(R\mathcal{C}^t + g, \mathcal{G}, t), \sigma_t^2 \mathbf{I}) \ &= \frac{1}{(2 \pi)^{\frac{p}{2}}|\boldsymbol{\Sigma}|^{-\frac{1}{2}}} e^{-\frac{1}{2}(R(\mathcal{C}^{t-1}-\boldsymbol{\mu}\theta))^T \boldsymbol{\Sigma}^{-1}(R(\mathcal{C}^{t-1}-\boldsymbol{\mu}\theta))} \ &= \frac{1}{(2 \pi)^{\frac{p}{2}}|\boldsymbol{\Sigma}|^{-\frac{1}{2}}} e^{-\frac{1}{2}(\mathcal{C}^{t-1}-\boldsymbol{\mu}\theta)^T \boldsymbol{\Sigma}^{-1}(\mathcal{C}^{t-1}-\boldsymbol{\mu}_\theta)} \end{aligned}

where $\boldsymbol{\Sigma} = \sigma_t^2 \mathbf{I}$. I am not sure if it's correct, hope to receive your clarification. Thanks.

opened by Frankie123421 12
• #### Provide data process script

Could you please provide the data process script since I found that ConfGF's code does not get the features like num_graphs, atom_type, pos, bond_index , etc.?

opened by Layne-Huang 4
• #### Question about GFN and GeoDiff-A Implementation

Hi authors,

Thank you for sharing codes of your great work! According to your paper, GeoDiff has two different versions including GeoDiff-A and GeoDiff-C. But it seems only the codes of GeoDiff-C are provided. In addition, it is mentioned in the paper that graph field network (GFN) is used but I cannot find the codes of GFN, while SchNet seems to play the role of GFN in current implementations.

I am wondering whether the implementations of GFN and GeoDiff-A will be provided in the future? Thanks!

Best,

opened by lyzustc 2

Hi, thanks for sharing this very good work!

I found the google drive link of the preprocessed dataset can not use, maybe the permission issue, could you have a double check? Thanks

opened by klightz 2
• #### m_{ij} notation in GNN architecture

Hi Minkai Xu,

I think there is a flaw in your mathematical notation for GNN, Equation 5 and 6 in particular. I found that the notations m_{ij} in equation 5 and 6 are not the same. m_{ij} in equation 5 and m_{ij} in equation 6

opened by CaptainCuong 1
• #### pos_ref and 3D visualization issues

Dear Minkai,

I am sorry to bother you. It seems like that you did not use time step t as the parameter in the diffusion training process. Why did not you use beta or alpha with time step t in the training process?

opened by Layne-Huang 0
• #### Questions about the rescale problem

Hi, Xu. Thanks for sharing the code. I've noticed the discussion here (https://github.com/MinkaiXu/GeoDiff/issues/11) and carefully read the code line by line. Just as what you stated in the issue 11, the "diffusion" process in the code is actually rescaled compared to the paper, i.e., $\mathcal{C}^t = \frac{1}{\sqrt{\alpha_t}}(\sqrt{\alpha_t}C^0 + \sqrt{1-\alpha_t}\epsilon)$. Based on the paper ScoreSDE (https://arxiv.org/abs/2011.13456), DDPM is a variance preserving process and DSM is a variance exploding one. I think maybe there might be some typos in your answer to issue 11 which cause contradiction: "2) use the alpha to rescale the data to achieve variation preserving" and "the problem of variation preserving is: it will change the scale of coordinates". In my perspective, after rescaling, $\mathcal{C}^t = C^0 + \frac{\sqrt{1-\alpha_t}}{\sqrt{\alpha_t}}\epsilon$ is a DSM process with variance increasing along with $t$. So I am confused about why this rescaling method will hold the scale of coordinates since in my view it seems to corrupt the scale (increase the variance) instead.

opened by Frankie123421 0
• #### Global and Local structure

Hi Minkai Xu,

What's the motivation for that you designed two separate architectures to learn local and global structures? In loss, the loss is divided into local loss and global loss (node_eq_global - target_pos_global)**2 + (node_eq_local - target_pos_local)**2

opened by CaptainCuong 1
• #### strange thing when sampling

Hi, Minkai @MinkaiXu. Rencently I am trying to do some experiments based on GeoDiff, and having some problems. Frist, I re-train the model with your original hyper-parameters. I notice that after around 800k iterations, the loss hardly decreases but oscillates wildly. I don't know whether the model truly benefit from the training in the following 2m iterations, and why the loss oscillates wildly? Second, I am trying to sample some conformations from the checkpoints getting in my training procedure, but something very strange happened. Most of the molecules during sampling occur FloatingPointError and retry with local clipping. Then I use eval_covmat.py to evaluate the quailty of conformations, I get 0.00 Cov value and thousands of Mat value. This strange thing happens all the time except one test with 700000.pt to generate molecules from 800 to 1000 (and I repeat the test again with same parameters and the strange thing occurs) I don't know why it happens because I run your code without any modification, and I wonder why you design the local clip operation and how to choose the corresponding hyper-parameter? In the end, I would very appreciate it if you could give me some explanation and suggestions.

opened by mayz20 1
• #### protein conformation

Hi! Thanks for the nice paper. In the paper you mentioned protein conformation is a difficult task with this model. Can you explain why linearity of proteins can be an issue? What could be some improvements?

opened by orgw 1
• #### Typo in get_edge_encoder

The function found here uses the full variable config which is not passed in for the type gaussian. I believe it should be cfg, but I guess it is never used.

opened by natolambert 1
###### Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Molecular Sets (MOSES): A benchmarking platform for molecular generation models Deep generative models are rapidly becoming popular for the discovery

3 Oct 14, 2022
###### Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

Molecular Sets (MOSES): A benchmarking platform for molecular generation models Deep generative models are rapidly becoming popular for the discovery

656 Dec 29, 2022
###### Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Geometric Vector Perceptron Implementation of Geometric Vector Perceptron, a simple circuit with 3d rotation equivariance for learning over large biom

59 Nov 24, 2022
###### NNR conformation conditional and global probabilities estimation and analysis in peptides or proteins fragments

NNR and global probabilities estimation and analysis in peptides or protein fragments This module calculates global and NNR conformation dependent pro

0 Jul 15, 2021
###### Pytorch-diffusion - A basic PyTorch implementation of 'Denoising Diffusion Probabilistic Models'

PyTorch implementation of 'Denoising Diffusion Probabilistic Models' This reposi

76 Jan 7, 2023
###### A denoising diffusion probabilistic model (DDPM) tailored for conditional generation of protein distograms

Denoising Diffusion Probabilistic Model for Proteins Implementation of Denoising Diffusion Probabilistic Model in Pytorch. It is a new approach to gen

108 Nov 23, 2022
###### Official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models.

GLIDE This is the official codebase for running the small, filtered-data GLIDE model from GLIDE: Towards Photorealistic Image Generation and Editing w

2.9k Jan 4, 2023
###### Imposter-detector-2022 - HackED 2022 Team 3IQ - 2022 Imposter Detector

HackED 2022 Team 3IQ - 2022 Imposter Detector By Aneeljyot Alagh, Curtis Kan, Jo

3 Aug 20, 2022
###### ReLoss - Official implementation for paper "Relational Surrogate Loss Learning" ICLR 2022

Relational Surrogate Loss Learning (ReLoss) Official implementation for paper "R

31 Nov 22, 2022
###### Official Pytorch implementation of Online Continual Learning on Class Incremental Blurry Task Configuration with Anytime Inference (ICLR 2022)

The Official Implementation of CLIB (Continual Learning for i-Blurry) Online Continual Learning on Class Incremental Blurry Task Configuration with An

34 Oct 26, 2022
###### A PyTorch implementation of ICLR 2022 Oral paper PiCO: Contrastive Label Disambiguation for Partial Label Learning

PiCO: Contrastive Label Disambiguation for Partial Label Learning This is a PyTorch implementation of ICLR 2022 Oral paper PiCO; also see our Project

83 May 11, 2022
###### McGill Physics Hackathon 2021: Reaction-Diffusion Models for the Generation of Biological Patterns

DiffuseAnimals: Reaction-Diffusion Models for the Generation of Biological Patterns Introduction Reaction-diffusion equations can be utilized in order

2 Mar 7, 2022
###### Pytorch Implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension)

DiffSinger - PyTorch Implementation PyTorch implementation of DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis (TTS Extension). Status

152 Jan 2, 2023

103 Dec 23, 2022
###### Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy" (ICLR 2022 Spotlight)

About Code release for Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy (ICLR 2022 Spotlight)

221 Dec 31, 2022
###### [ICLR 2022] Pretraining Text Encoders with Adversarial Mixture of Training Signal Generators

AMOS This repository contains the scripts for fine-tuning AMOS pretrained models on GLUE and SQuAD 2.0 benchmarks. Paper: Pretraining Text Encoders wi

22 Sep 15, 2022
###### [ICLR 2022] Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics

CPDeform Code and data for paper Contact Points Discovery for Soft-Body Manipulations with Differentiable Physics at ICLR 2022 (Spotlight). @InProceed

29 Nov 29, 2022
###### Code for "MetaMorph: Learning Universal Controllers with Transformers", Gupta et al, ICLR 2022

MetaMorph: Learning Universal Controllers with Transformers This is the code for the paper MetaMorph: Learning Universal Controllers with Transformers

50 Jan 3, 2023