Implementation of Learning Gradient Fields for Molecular Conformation Generation (ICML 2021).

MilaGraph

Last update: Dec 9, 2022

Related tags

Deep Learning ConfGF

Overview

[PDF] | [Slides]

The official implementation of Learning Gradient Fields for Molecular Conformation Generation (ICML 2021 Long talk)

Installation

Install via Conda (Recommended)

# Clone the environment
conda env create -f env.yml

# Activate the environment
conda activate confgf

# Install Library
git clone https://github.com/DeepGraphLearning/ConfGF.git
cd ConfGF
python setup.py install

Install Manually

# Create conda environment
conda create -n confgf python=3.7

# Activate the environment
conda activate confgf

# Install packages
conda install -y -c pytorch pytorch=1.7.0 torchvision torchaudio cudatoolkit=10.2
conda install -y -c rdkit rdkit==2020.03.2.0
conda install -y scikit-learn pandas decorator ipython networkx tqdm matplotlib
conda install -y -c conda-forge easydict
pip install pyyaml

# Install PyTorch Geometric
pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.7.0+cu102.html
pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.7.0+cu102.html
pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.7.0+cu102.html
pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.7.0+cu102.html
pip install torch-geometric==1.6.3

# Install Library
git clone https://github.com/DeepGraphLearning/ConfGF.git
cd ConfGF
python setup.py install

Dataset

Offical Dataset

The offical raw GEOM dataset is avaiable [here].

Preprocessed dataset

We provide the preprocessed datasets (GEOM, ISO17) in a [google drive folder]. For ISO17 dataset, we use the default split of [GraphDG].

Prepare your own GEOM dataset from scratch (optional)

Download the raw GEOM dataset and unpack it.

tar xvf ~/rdkit_folder.tar.gz -C ~/GEOM

Preprocess the raw GEOM dataset.

python script/process_GEOM_dataset.py --base_path GEOM --dataset_name qm9 --confmin 50 --confmax 500
python script/process_GEOM_dataset.py --base_path GEOM --dataset_name drugs --confmin 50 --confmax 100

The final folder structure will look like this:

GEOM
|___rdkit_folder  # raw dataset
|   |___qm9 # raw qm9 dataset
|   |___drugs # raw drugs dataset
|   |___summary_drugs.json
|   |___summary_qm9.json
|   
|___qm9_processed
|   |___train_data_40k.pkl
|   |___val_data_5k.pkl
|   |___test_data_200.pkl
|   
|___drugs_processed
|   |___train_data_39k.pkl
|   |___val_data_5k.pkl
|   |___test_data_200.pkl
|
iso17_processed
|___iso17_split-0_train_processed.pkl
|___iso17_split-0_test_processed.pkl
|
...

Training

All hyper-parameters and training details are provided in config files (./config/*.yml), and free feel to tune these parameters.

You can train the model with the following commands:

python -u script/train.py --config_path ./config/qm9_default.yml
python -u script/train.py --config_path ./config/drugs_default.yml
python -u script/train.py --config_path ./config/iso17_default.yml

The checkpoint of the models will be saved into a directory specified in config files.

Generation

We provide the checkpoints of three trained models, i.e., qm9_default, drugs_default and iso17_default in a [google drive folder].

You can generate conformations of a molecule by feeding its SMILES into the model:

python -u script/gen.py --config_path ./config/qm9_default.yml --generator ConfGF --smiles c1ccccc1
python -u script/gen.py --config_path ./config/qm9_default.yml --generator ConfGFDist --smiles c1ccccc1

Here we use the models trained on GEOM-QM9 to generate conformations for the benzene. The argument --generator indicates the type of the generator, i.e., ConfGF vs. ConfGFDist. See the ablation study (Table 5) in the original paper for more details.

You can also generate conformations for an entire test set.

python -u script/gen.py --config_path ./config/qm9_default.yml --generator ConfGF \
                        --start 0 --end 200 \

python -u script/gen.py --config_path ./config/qm9_default.yml --generator ConfGFDist \
                        --start 0 --end 200 \

python -u script/gen.py --config_path ./config/drugs_default.yml --generator ConfGF \
                        --start 0 --end 200 \

python -u script/gen.py --config_path ./config/drugs_default.yml --generator ConfGFDist \
                        --start 0 --end 200 \

Here start and end indicate the range of the test set that we want to use. All hyper-parameters related to generation can be set in config files.

Conformations of some drug-like molecules generated by ConfGF are provided below.

Get Results

The results of all benchmark tasks can be calculated based on generated conformations.

We report the results of each task in the following tables. Results of ConfGF and ConfGFDist are re-evaluated based on the current code base, which successfully reproduce the results reported in the original paper. Results of other models are taken directly from the original paper.

Task 1. Conformation Generation

The COV and MAT scores on the GEOM datasets can be calculated using the following commands:

python -u script/get_task1_results.py --input dir_of_QM9_samples --core 10 --threshold 0.5  

python -u script/get_task1_results.py --input dir_of_Drugs_samples --core 10 --threshold 1.25

Table: COV and MAT scores on GEOM-QM9

QM9	COV-Mean (%)	COV-Median (%)	MAT-Mean (\AA)	MAT-Median (\AA)
ConfGF	91.06	95.76	0.2649	0.2668
ConfGFDist	85.37	88.59	0.3435	0.3548
CGCF	78.05	82.48	0.4219	0.3900
GraphDG	73.33	84.21	0.4245	0.3973
CVGAE	0.09	0.00	1.6713	1.6088
RDKit	83.26	90.78	0.3447	0.2935

Table: COV and MAT scores on GEOM-Drugs

Drugs	COV-Mean (%)	COV-Median (%)	MAT-Mean (\AA)	MAT-Median (\AA)
ConfGF	62.54	71.32	1.1637	1.1617
ConfGFDist	49.96	48.12	1.2845	1.2827
CGCF	53.96	57.06	1.2487	1.2247
GraphDG	8.27	0.00	1.9722	1.9845
CVGAE	0.00	0.00	3.0702	2.9937
RDKit	60.91	65.70	1.2026	1.1252

Task 2. Distributions Over Distances

The MMD metrics on the ISO17 dataset can be calculated using the following commands:

python -u script/get_task2_results.py --input dir_of_ISO17_samples

Table: Distributions over distances

Method	Single-Mean	Single-Median	Pair-Mean	Pair-Median	All-Mean	All-Median
ConfGF	0.3430	0.2473	0.4195	0.3081	0.5432	0.3868
ConfGFDist	0.3348	0.2011	0.4080	0.2658	0.5821	0.3974
CGCF	0.4490	0.1786	0.5509	0.2734	0.8703	0.4447
GraphDG	0.7645	0.2346	0.8920	0.3287	1.1949	0.5485
CVGAE	4.1789	4.1762	4.9184	5.1856	5.9747	5.9928
RDKit	3.4513	3.1602	3.8452	3.6287	4.0866	3.7519

Visualizing molecules with PyMol

Start Setup

pymol -R
Display - Background - White
Display - Color Space - CMYK
Display - Quality - Maximal Quality
Display Grid
1. by object: use set grid_slot, int, mol_name to put the molecule into the corresponding slot
2. by state: align all conformations in a single slot
3. by object-state: align all conformations and put them in separate slots. (grid_slot dont work!)
Setting - Line and Sticks - Ball and Stick on - Ball and Stick ratio: 1.5
Setting - Line and Sticks - Stick radius: 0.2 - Stick Hydrogen Scale: 1.0

Show Molecule

To show molecules
1. hide everything
2. show sticks
To align molecules: align name1, name2

Convert RDKit mol to Pymol

from rdkit.Chem import PyMol
v= PyMol.MolViewer()
rdmol = Chem.MolFromSmiles('C')
v.ShowMol(rdmol, name='mol')
v.SaveFile('mol.pkl')

Make the trajectory for Langevin dynamics

load a sequence of pymol objects named traj*.pkl into the PyMol, where traji.pkl is the i-th conformation in the trajectory.
Join states: join_states mol, traj*, 0
Delete useless object: delete traj*
Movie - Program - State Loop - Full Speed
Export the movie to a sequence of PNG files: File - Export Movie As - PNG Images
Use photoshop to convert the PNG sequence to a GIF with the transparent background.

Citation

Please consider citing the following paper if you find our codes helpful. Thank you!

@inproceedings{shi*2021confgf,
title={Learning Gradient Fields for Molecular Conformation Generation},
author={Shi, Chence and Luo, Shitong and Xu, Minkai and Tang, Jian},
booktitle={International Conference on Machine Learning},
year={2021}
}

Contact

Chence Shi ([email protected])

Comments

There is not summary file in GEOM's QM9 archive.

https://github.com/DeepGraphLearning/ConfGF/blob/38aeb6c7719343d13fa867f4b17b02ed45d09bd0/confgf/dataset/dataset.py#L185

Hi, When I accessed to the webpage of Havard to look for dataset. I saw 2 QM9 files, including qm9_crude.msgpack.tar.gz and qm9_featurized.msgpack.tar.gz. But none of them have the summary file. Can you give me the link to the exact GEOM's QM9. I cannot find the archive of dataset aligned with your code.

opened by CaptainCuong 0
An error on "Generate conformations of a molecule by feeding its SMILES into the model"

I have downloaded ckpt from google driver provided by authors, and then tried to "Generate conformations of a molecule by feeding its SMILES into the model".

The ckpt was placed at XXX/ConfGF/confgf/train/qm9_default/checkpoint284 or XXX/ConfGF/confgf/train/qm9_default. However, both pathes had this error occurred: FileNotFoundError: [Errno 2] No such file or directory: 'XXX/ConfGF/confgf/train/qm9_default/checkpoint284'

Hope to get your kindly help!

opened by HKQiu 0
"The 'data' object was created by an older version of PyG. "

Hi! thank you for sharing your work When I ues google colab run"python -u script/gen.py --config_path ./config/qm9_default.yml --generator ConfGF --smiles c1ccccc1" it returned an error :RuntimeError: The 'data' object was created by an older version of PyG. If this error occurred while loading an already existing dataset, remove the 'processed/' directory in the dataset's root folder and try again.

could you please tell me how to fix this?

whole output:

Let's use 1 GPUs! Using device cuda:0 as main device {'train': {'batch_size': 128, 'seed': 2021, 'epochs': 300, 'shuffle': True, 'resume_train': False, 'eval': True, 'num_workers': 0, 'gpus': [0], 'anneal_power': 2.0, 'save': True, 'save_path': '/home/shichenc/scratch/confgf/train', 'resume_checkpoint': None, 'resume_epoch': None, 'log_interval': 400, 'optimizer': {'type': 'Adam', 'lr': 0.001, 'weight_decay': 0.0, 'dropout': 0.0}, 'scheduler': {'type': 'plateau', 'factor': 0.6, 'patience': 10, 'min_lr': '1e-4'}, 'device': device(type='cuda', index=0)}, 'test': {'init_checkpoint': '/home/shichenc/scratch/confgf/train/qm9_default', 'output_path': '/home/shichenc/scratch/confgf/test/qm9_default', 'epoch': 284, 'gen': {'dg_step_size': 3.0, 'dg_num_steps': 1000, 'steps_d': 100, 'step_lr_d': 2e-06, 'steps_pos': 100, 'step_lr_pos': 2.4e-06, 'clip': 1000, 'min_sigma': 0.0, 'verbose': 1}}, 'data': {'base_path': '/content/', 'dataset': 'qm9', 'train_set': 'train_data_40k.pkl', 'val_set': 'val_data_5k.pkl', 'test_set': 'test_data_200.pkl'}, 'model': {'name': 'qm9_default', 'hidden_dim': 256, 'num_convs': 4, 'sigma_begin': 10, 'sigma_end': 0.01, 'num_noise_level': 50, 'order': 3, 'mlp_act': 'relu', 'gnn_act': 'relu', 'cutoff': 10.0, 'short_cut': True, 'concat_hidden': False, 'noise_type': 'symmetry', 'edge_encoder': 'mlp'}} set seed for random, numpy and torch loading data from /content/qm9_processed train size : 0 || val size: 0 || test size: 24068 loading data done! got 200 molecules with 24068 confs Traceback (most recent call last): File "script/gen.py", line 92, in test_data = dataset.GEOMDataset_PackedConf(data=test_data, transform=transform) File "/content/ConfGF/confgf/dataset/dataset.py", line 449, in init self._pack_data_by_mol() File "/content/ConfGF/confgf/dataset/dataset.py", line 469, in _pack_data_by_mol data = copy.deepcopy(v[0]) File "/usr/local/lib/python3.7/copy.py", line 161, in deepcopy y = copier(memo) File "/usr/local/lib/python3.7/site-packages/torch_geometric/data/data.py", line 392, in deepcopy out._store._parent = out File "/usr/local/lib/python3.7/site-packages/torch_geometric/data/data.py", line 358, in getattr "The 'data' object was created by an older version of PyG. " RuntimeError: The 'data' object was created by an older version of PyG. If this error occurred while loading an already existing dataset, remove the 'processed/' directory in the dataset's root folder and try again.

opened by flashasdbaksdgi 2
Code on property prediction experiments

I failed to use Psi4 to reproduce the Property Prediction experiments and the results are very different in the order of magnitude. Can you release the code for property prediction experiments and the randomly selected 30 molecule smiles?

opened by teslacool 1
How to generate the 3D SDF from the output?

Hello,

Congrats on this work. It is quite interesting.

I have some doubts on it. May you help me? I was able to run the pre-trained models (checkpoints). A pickle is generated, but how to use this pickle to generate the SDF file for the particular molecule?

I noticed that a 2D molecule is generated in this pickle...How to include the conformers -- like an SDF -- so I can use the output 3D molecule?

Cheers,

Alex
enhancement

opened by alexgcsa 2

Owner

MilaGraph

Research group led by Prof. Jian Tang at Mila-Quebec AI Institute (https://mila.quebec/) focusing on graph representation learning and graph neural networks.

GitHub

Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Molecular Sets (MOSES): A benchmarking platform for molecular generation models Deep generative models are rapidly becoming popular for the discovery

3 Oct 14, 2022

Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

Molecular Sets (MOSES): A benchmarking platform for molecular generation models Deep generative models are rapidly becoming popular for the discovery

656 Dec 29, 2022

NNR conformation conditional and global probabilities estimation and analysis in peptides or proteins fragments

NNR and global probabilities estimation and analysis in peptides or protein fragments This module calculates global and NNR conformation dependent pro

0 Jul 15, 2021

A PyTorch implementation of Learning to learn by gradient descent by gradient descent

Intro PyTorch implementation of Learning to learn by gradient descent by gradient descent. Run python main.py TODO Initial implementation Toy data LST

300 Dec 11, 2022

Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".

PixelTransformer Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation". Project Page Installation Please insta

24 Dec 17, 2022

Implementation of Self-supervised Graph-level Representation Learning with Local and Global Structure (ICML 2021).

Self-supervised Graph-level Representation Learning with Local and Global Structure Introduction This project is an implementation of ``Self-supervise

50 Dec 9, 2022

This repo contains the implementation of the algorithm proposed in Off-Belief Learning, ICML 2021.

Off-Belief Learning Introduction This repo contains the implementation of the algorithm proposed in Off-Belief Learning, ICML 2021. Environment Setup

32 Jan 5, 2023

Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

Bridging Multi-Task Learning and Meta-Learning Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Trainin

57 Dec 15, 2022

Neural-Pull: Learning Signed Distance Functions from Point Clouds by Learning to Pull Space onto Surfaces(ICML 2021)

Neural-Pull: Learning Signed Distance Functions from Point Clouds by Learning to Pull Space onto Surfaces(ICML 2021) This repository contains the code

149 Dec 15, 2022

Public Implementation of ChIRo from "Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations"

Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations This directory contains the model architectures and experimental

35 Dec 5, 2022

Official implementation of "SinIR: Efficient General Image Manipulation with Single Image Reconstruction" (ICML 2021)

SinIR (Official Implementation) Requirements To install requirements: pip install -r requirements.txt We used Python 3.7.4 and f-strings which are in

47 Oct 11, 2022

Pytorch Implementation of Spiking Neural Networks Calibration, ICML 2021

SNN_Calibration Pytorch Implementation of Spiking Neural Networks Calibration, ICML 2021 Feature Comparison of SNN calibration: Features SNN Direct Tr

60 Dec 27, 2022

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning DouZero is a reinforcement learning framework for DouDizhu (斗地主), t

3.1k Jan 4, 2023

[ICML 2021] “ Self-Damaging Contrastive Learning”, Ziyu Jiang, Tianlong Chen, Bobak Mortazavi, Zhangyang Wang

Self-Damaging Contrastive Learning Introduction The recent breakthrough achieved by contrastive learning accelerates the pace for deploying unsupervis

51 Dec 29, 2022

[ICML 2021] "Graph Contrastive Learning Automated" by Yuning You, Tianlong Chen, Yang Shen, Zhangyang Wang

Graph Contrastive Learning Automated PyTorch implementation for Graph Contrastive Learning Automated [talk] [poster] [appendix] Yuning You, Tianlong C

80 Nov 23, 2022

[ICML 2021] Break-It-Fix-It: Learning to Repair Programs from Unlabeled Data

Break-It-Fix-It: Learning to Repair Programs from Unlabeled Data This repo provides the source code & data of our paper: Break-It-Fix-It: Unsupervised

86 Nov 30, 2022

Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

Self-Tuning for Data-Efficient Deep Learning This repository contains the implementation code for paper: Self-Tuning for Data-Efficient Deep Learning

101 Dec 11, 2022

The implementation of the algorithm in the paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020.

DS3L This is the code for paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020. Setups The code is implem

36 Oct 19, 2022

PyTorch implementation of SCAFFOLD (Stochastic Controlled Averaging for Federated Learning, ICML 2020).

Scaffold-Federated-Learning PyTorch implementation of SCAFFOLD (Stochastic Controlled Averaging for Federated Learning, ICML 2020). Environment numpy=

30 Dec 29, 2022

Implementation of Learning Gradient Fields for Molecular Conformation Generation (ICML 2021).

Related tags

Overview

Installation

Install via Conda (Recommended)

Install Manually

Dataset

Offical Dataset

Preprocessed dataset

Prepare your own GEOM dataset from scratch (optional)

Training

Generation

Get Results

Task 1. Conformation Generation

Task 2. Distributions Over Distances

Visualizing molecules with PyMol

Start Setup

Show Molecule

Make the trajectory for Langevin dynamics

Citation

Contact

Comments

There is not summary file in GEOM's QM9 archive.

An error on "Generate conformations of a molecule by feeding its SMILES into the model"

"The 'data' object was created by an older version of PyG. "

Code on property prediction experiments

How to generate the 3D SDF from the output?

Owner

MilaGraph

Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

NNR conformation conditional and global probabilities estimation and analysis in peptides or proteins fragments

A PyTorch implementation of Learning to learn by gradient descent by gradient descent

Code release for the ICML 2021 paper "PixelTransformer: Sample Conditioned Signal Generation".

Implementation of Self-supervised Graph-level Representation Learning with Local and Global Structure (ICML 2021).

This repo contains the implementation of the algorithm proposed in Off-Belief Learning, ICML 2021.

Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

Neural-Pull: Learning Signed Distance Functions from Point Clouds by Learning to Pull Space onto Surfaces(ICML 2021)

Public Implementation of ChIRo from "Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations"

Official implementation of "SinIR: Efficient General Image Manipulation with Single Image Reconstruction" (ICML 2021)

Pytorch Implementation of Spiking Neural Networks Calibration, ICML 2021

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI

[ICML 2021] “ Self-Damaging Contrastive Learning”, Ziyu Jiang, Tianlong Chen, Bobak Mortazavi, Zhangyang Wang

[ICML 2021] "Graph Contrastive Learning Automated" by Yuning You, Tianlong Chen, Yang Shen, Zhangyang Wang

[ICML 2021] Break-It-Fix-It: Learning to Repair Programs from Unlabeled Data

Code release for "Self-Tuning for Data-Efficient Deep Learning" (ICML 2021)

The implementation of the algorithm in the paper "Safe Deep Semi-Supervised Learning for Unseen-Class Unlabeled Data" published in ICML 2020.

PyTorch implementation of SCAFFOLD (Stochastic Controlled Averaging for Federated Learning, ICML 2020).