Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Deep generative models are rapidly becoming popular for the discovery of new molecules and materials. Such models learn on a large collection of molecular structures and produce novel compounds. In this work, we introduce Molecular Sets (MOSES), a benchmarking platform to support research on machine learning for drug discovery. MOSES implements several popular molecular generation models and provides a set of metrics to evaluate the quality and diversity of generated molecules. With MOSES, we aim to standardize the research on molecular generation and facilitate the sharing and comparison of new models.

For more details, please refer to the paper.

If you are using MOSES in your research paper, please cite us as

@article{10.3389/fphar.2020.565644,
  title={{M}olecular {S}ets ({MOSES}): {A} {B}enchmarking {P}latform for {M}olecular {G}eneration {M}odels},
  author={Polykovskiy, Daniil and Zhebrak, Alexander and Sanchez-Lengeling, Benjamin and Golovanov, Sergey and Tatanov, Oktai and Belyaev, Stanislav and Kurbanov, Rauf and Artamonov, Aleksey and Aladinskiy, Vladimir and Veselov, Mark and Kadurin, Artur and Johansson, Simon and  Chen, Hongming and Nikolenko, Sergey and Aspuru-Guzik, Alan and Zhavoronkov, Alex},
  journal={Frontiers in Pharmacology},
  year={2020}
}

Dataset

We propose a benchmarking dataset refined from the ZINC database.

The set is based on the ZINC Clean Leads collection. It contains 4,591,276 molecules in total, filtered by molecular weight in the range from 250 to 350 Daltons, a number of rotatable bonds not greater than 7, and XlogP less than or equal to 3.5. We removed molecules containing charged atoms or atoms besides C, N, S, O, F, Cl, Br, H or cycles longer than 8 atoms. The molecules were filtered via medicinal chemistry filters (MCFs) and PAINS filters.

The dataset contains 1,936,962 molecular structures. For experiments, we split the dataset into a training, test and scaffold test sets containing around 1.6M, 176k, and 176k molecules respectively. The scaffold test set contains unique Bemis-Murcko scaffolds that were not present in the training and test sets. We use this set to assess how well the model can generate previously unobserved scaffolds.

Models

Metrics

Besides standard uniqueness and validity metrics, MOSES provides other metrics to access the overall quality of generated molecules. Fragment similarity (Frag) and Scaffold similarity (Scaff) are cosine distances between vectors of fragment or scaffold frequencies correspondingly of the generated and test sets. Nearest neighbor similarity (SNN) is the average similarity of generated molecules to the nearest molecule from the test set. Internal diversity (IntDiv) is an average pairwise similarity of generated molecules. Fréchet ChemNet Distance (FCD) measures the difference in distributions of last layer activations of ChemNet. Novelty is a fraction of unique valid generated molecules not present in the training set.

Model	Valid (↑)	Unique@1k (↑)	Unique@10k (↑)	FCD (↓)		SNN (↑)		Frag (↑)		Scaf (↑)		IntDiv (↑)	IntDiv2 (↑)	Filters (↑)	Novelty (↑)
Model	Valid (↑)	Unique@1k (↑)	Unique@10k (↑)	Test	TestSF	Test	TestSF	Test	TestSF	Test	TestSF	IntDiv (↑)	IntDiv2 (↑)	Filters (↑)	Novelty (↑)
Train	1.0	1.0	1.0	0.008	0.4755	0.6419	0.5859	1.0	0.9986	0.9907	0.0	0.8567	0.8508	1.0	1.0
HMM	0.076±0.0322	0.623±0.1224	0.5671±0.1424	24.4661±2.5251	25.4312±2.5599	0.3876±0.0107	0.3795±0.0107	0.5754±0.1224	0.5681±0.1218	0.2065±0.0481	0.049±0.018	0.8466±0.0403	0.8104±0.0507	0.9024±0.0489	0.9994±0.001
NGram	0.2376±0.0025	0.974±0.0108	0.9217±0.0019	5.5069±0.1027	6.2306±0.0966	0.5209±0.001	0.4997±0.0005	0.9846±0.0012	0.9815±0.0012	0.5302±0.0163	0.0977±0.0142	0.8738±0.0002	0.8644±0.0002	0.9582±0.001	0.9694±0.001
Combinatorial	1.0±0.0	0.9983±0.0015	0.9909±0.0009	4.2375±0.037	4.5113±0.0274	0.4514±0.0003	0.4388±0.0002	0.9912±0.0004	0.9904±0.0003	0.4445±0.0056	0.0865±0.0027	0.8732±0.0002	0.8666±0.0002	0.9557±0.0018	0.9878±0.0008
CharRNN	0.9748±0.0264	1.0±0.0	0.9994±0.0003	0.0732±0.0247	0.5204±0.0379	0.6015±0.0206	0.5649±0.0142	0.9998±0.0002	0.9983±0.0003	0.9242±0.0058	0.1101±0.0081	0.8562±0.0005	0.8503±0.0005	0.9943±0.0034	0.8419±0.0509
AAE	0.9368±0.0341	1.0±0.0	0.9973±0.002	0.5555±0.2033	1.0572±0.2375	0.6081±0.0043	0.5677±0.0045	0.991±0.0051	0.9905±0.0039	0.9022±0.0375	0.0789±0.009	0.8557±0.0031	0.8499±0.003	0.996±0.0006	0.7931±0.0285
VAE	0.9767±0.0012	1.0±0.0	0.9984±0.0005	0.099±0.0125	0.567±0.0338	0.6257±0.0005	0.5783±0.0008	0.9994±0.0001	0.9984±0.0003	0.9386±0.0021	0.0588±0.0095	0.8558±0.0004	0.8498±0.0004	0.997±0.0002	0.6949±0.0069
JTN-VAE	1.0±0.0	1.0±0.0	0.9996±0.0003	0.3954±0.0234	0.9382±0.0531	0.5477±0.0076	0.5194±0.007	0.9965±0.0003	0.9947±0.0002	0.8964±0.0039	0.1009±0.0105	0.8551±0.0034	0.8493±0.0035	0.976±0.0016	0.9143±0.0058
LatentGAN	0.8966±0.0029	1.0±0.0	0.9968±0.0002	0.2968±0.0087	0.8281±0.0117	0.5371±0.0004	0.5132±0.0002	0.9986±0.0004	0.9972±0.0007	0.8867±0.0009	0.1072±0.0098	0.8565±0.0007	0.8505±0.0006	0.9735±0.0006	0.9498±0.0006

For comparison of molecular properties, we computed the Wasserstein-1 distance between distributions of molecules in the generated and test sets. Below, we provide plots for lipophilicity (logP), Synthetic Accessibility (SA), Quantitative Estimation of Drug-likeness (QED) and molecular weight.

logP	SA

weight	QED

Installation

PyPi

The simplest way to install MOSES (models and metrics) is to install RDKit: conda install -yq -c rdkit rdkit and then install MOSES (molsets) from pip (pip install molsets). If you want to use LatentGAN, you should also install additional dependencies using bash install_latentgan_dependencies.sh.

If you are using Ubuntu, you should also install sudo apt-get install libxrender1 libxext6 for RDKit.

Docker

Install docker and nvidia-docker.
Pull an existing image (4.1Gb to download) from DockerHub:

docker pull molecularsets/moses

or clone the repository and build it manually:

git clone https://github.com/molecularsets/moses.git
nvidia-docker image build --tag molecularsets/moses moses/

Create a container:

nvidia-docker run -it --name moses --network="host" --shm-size 10G molecularsets/moses

The dataset and source code are available inside the docker container at /moses:

docker exec -it molecularsets/moses bash

Manually

Alternatively, install dependencies and MOSES manually.

Clone the repository:

git lfs install
git clone https://github.com/molecularsets/moses.git

Install RDKit for metrics calculation.
Install MOSES:

python setup.py install

(Optional) Install dependencies for LatentGAN:

bash install_latentgan_dependencies.sh

Benchmarking your models

Install MOSES as described in the previous section.
Get train, test and test_scaffolds datasets using the following code:

import moses

train = moses.get_dataset('train')
test = moses.get_dataset('test')
test_scaffolds = moses.get_dataset('test_scaffolds')

You can use a standard torch DataLoader in your models. We provide a simple StringDataset class for convenience:

from torch.utils.data import DataLoader
from moses import CharVocab, StringDataset

train = moses.get_dataset('train')
vocab = CharVocab.from_data(train)
train_dataset = StringDataset(vocab, train)
train_dataloader = DataLoader(
    train_dataset, batch_size=512,
    shuffle=True, collate_fn=train_dataset.default_collate
)

for with_bos, with_eos, lengths in train_dataloader:
    ...

Calculate metrics from your model's samples. We recomend sampling at least 30,000 molecules:

import moses
metrics = moses.get_all_metrics(list_of_generated_smiles)

Add generated samples and metrics to your repository. Run the experiment multiple times to estimate the variance of the metrics.

Reproducing the baselines

End-to-End launch

You can run pretty much everything with:

python scripts/run.py

This will split the dataset, train the models, generate new molecules, and calculate the metrics. Evaluation results will be saved in metrics.csv.

You can specify the GPU device index as cuda:n (or cpu for CPU) and/or model by running:

python scripts/run.py --device cuda:1 --model aae

For more details run python scripts/run.py --help.

You can reproduce evaluation of all models with several seeds by running:

sh scripts/run_all_models.sh

Training

python scripts/train.py <model name> \
       --train_load <train dataset> \
       --model_save <path to model> \
       --config_save <path to config> \
       --vocab_save <path to vocabulary>

To get a list of supported models run python scripts/train.py --help.

For more details of certain model run python scripts/train.py --help.

Generation

python scripts/sample.py <model name> \
       --model_load <path to model> \
       --vocab_load <path to vocabulary> \
       --config_load <path to config> \
       --n_samples <number of samples> \
       --gen_save <path to generated dataset>

To get a list of supported models run python scripts/sample.py --help.

For more details of certain model run python scripts/sample.py --help.

Evaluation

python scripts/eval.py \
       --ref_path <reference dataset> \
       --gen_path <generated dataset>

For more details run python scripts/eval.py --help.

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Don’t be Contradicted with Anything!CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System This repository contains the PyTorch im

25 Sep 6, 2022

RobustART: Benchmarking Robustness on Architecture Design and Training Techniques

The first comprehensive Robustness investigation benchmark on large-scale dataset ImageNet regarding ARchitecture design and Training techniques towards diverse noises.

132 Dec 23, 2022

12 Sep 26, 2021

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking We revisit and address issues with Oxford 5k and Paris 6k image retrieval benchm

188 Dec 17, 2022

Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

B-Pref Official codebase for B-Pref: Benchmarking Preference-BasedReinforcement Learning contains scripts to reproduce experiments. Install conda env

48 Dec 20, 2022

Repo for "Benchmarking Robustness of 3D Point Cloud Recognition against Common Corruptions" https://arxiv.org/abs/2201.12296

Benchmarking Robustness of 3D Point Cloud Recognition against Common Corruptions This repo contains the dataset and code for the paper Benchmarking Ro

168 Dec 29, 2022

Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Related tags

Overview

Molecular Sets (MOSES): A benchmarking platform for molecular generation models

Dataset

Models

Metrics

Installation

PyPi

Docker

Manually

Benchmarking your models

Reproducing the baselines

End-to-End launch

Training

Generation

Evaluation

You might also like...

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

RobustART: Benchmarking Robustness on Architecture Design and Training Techniques

PyTorch code for EMNLP 2021 paper: Don't be Contradicted with Anything! CI-ToD: Towards Benchmarking Consistency for Task-oriented Dialogue System

Revisiting Oxford and Paris: Large-Scale Image Retrieval Benchmarking

Official codebase for "B-Pref: Benchmarking Preference-BasedReinforcement Learning" contains scripts to reproduce experiments.

ColossalAI-Benchmark - Performance benchmarking with ColossalAI

Code for the paper "Benchmarking and Analyzing Point Cloud Classification under Corruptions"

Evaluation and Benchmarking of Speech Super-resolution Methods

Repo for "Benchmarking Robustness of 3D Point Cloud Recognition against Common Corruptions" https://arxiv.org/abs/2201.12296

Owner

Neelesh C A

nnDetection is a self-configuring framework for 3D (volumetric) medical object detection which can be applied to new data sets without manual intervention. It includes guides for 12 data sets that were used to develop and evaluate the performance of the proposed method.

Source code for the GPT-2 story generation models in the EMNLP 2020 paper "STORIUM: A Dataset and Evaluation Platform for Human-in-the-Loop Story Generation"

Implementation of Learning Gradient Fields for Molecular Conformation Generation (ICML 2021).

Implementation of GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation (ICLR 2022).

source code for https://arxiv.org/abs/2005.11248 "Accelerating Antimicrobial Discovery with Controllable Deep Generative Models and Molecular Dynamics"

A light-weight image labelling tool for Python designed for creating segmentation data sets.

Blender Add-on that sets a Material's Base Color to one of Pantone's Colors of the Year

Revisiting, benchmarking, and refining Heterogeneous Graph Neural Networks.

FedScale: Benchmarking Model and System Performance of Federated Learning

Pip-package for trajectory benchmarking from "Be your own Benchmark: No-Reference Trajectory Metric on Registered Point Clouds", ECMR'21