A graph-to-sequence model for one-step retrosynthesis and reaction outcome prediction.

Last update: Nov 18, 2022

Related tags

Deep Learning Graph2SMILES

Overview

Graph2SMILES

A graph-to-sequence model for one-step retrosynthesis and reaction outcome prediction.

1. Environmental setup

System requirements

Ubuntu: >= 16.04
conda: >= 4.0
GPU: at least 8GB Memory with CUDA >= 10.1

Note: there is some known compatibility issue with RTX 3090, for which the PyTorch would need to be upgraded to >= 1.8.0. The code has not been heavily tested under 1.8.0, so our best advice is to use some other GPU.

Using conda

Please ensure that conda has been properly initialized, i.e. conda activate is runnable. Then

bash -i scripts/setup.sh
conda activate graph2smiles

2. Data preparation

Download the raw (cleaned and tokenized) data from Google Drive by

python scripts/download_raw_data.py --data_name=USPTO_50k
python scripts/download_raw_data.py --data_name=USPTO_full
python scripts/download_raw_data.py --data_name=USPTO_480k
python scripts/download_raw_data.py --data_name=USPTO_STEREO

It is okay to only download the dataset(s) you want. For each dataset, modify the following environmental variables in scripts/preprocess.sh:

DATASET: one of [USPTO_50k, USPTO_full, USPTO_480k, USPTO_STEREO]
TASK: retrosynthesis for 50k and full, or reaction_prediction for 480k and STEREO
N_WORKERS: number of CPU cores (for parallel preprocessing)

Then run the preprocessing script by

sh scripts/preprocess.sh

3. Model training and validation

Modify the following environmental variables in scripts/train_g2s.sh:

EXP_NO: your own identifier (any string) for logging and tracking
DATASET: one of [USPTO_50k, USPTO_full, USPTO_480k, USPTO_STEREO]
TASK: retrosynthesis for 50k and full, or reaction_prediction for 480k and STEREO
MPN_TYPE: one of [dgcn, dgat]

Then run the training script by

sh scripts/train_g2s.sh

The training process regularly evaluates on the validation sets, both with and without teacher forcing. While this evaluation is done mostly with top-1 accuracy, it is also possible to do holistic evaluation after training finishes to get all the top-n accuracies on the val set. To do that, first modify the following environmental variables in scripts/validate.sh:

EXP_NO: your own identifier (any string) for logging and tracking
DATASET: one of [USPTO_50k, USPTO_full, USPTO_480k, USPTO_STEREO]
CHECKPOINT: the folder containing the checkpoints
FIRST_STEP: the step of the first checkpoints to be evaluated
LAST_STEP: the step of the last checkpoints to be evaluated

Then run the evaluation script by

sh scripts/validate.sh

Note: the evaluation process performs beam search over the whole val sets for all checkpoints. It can take tens of hours.

We provide pretrained model checkpoints for all four datasets with both dgcn and dgat, which can be downloaded from Google Drive with

python scripts/download_checkpoints.py --data_name=$DATASET --mpn_type=$MPN_TYPE

using any combinations of DATASET and MPN_TYPE.

4. Testing

Modify the following environmental variables in scripts/predict.sh:

EXP_NO: your own identifier (any string) for logging and tracking
DATASET: one of [USPTO_50k, USPTO_full, USPTO_480k, USPTO_STEREO]
CHECKPOINT: the path to the checkpoint (which is a .pt file)

Then run the testing script by

sh scripts/predict.sh

which will first run beam search to generate the results for all the test inputs, and then computes the average top-n accuracies.

Comments

reaction prediction

The model throws the error "RuntimeError: CUDA error: device-side assertion fired" (graph2seq_series_rel.py ,line 125, in forward, memory_lengths=memory_lengths) when I run reaction prediction, not when I run retrosynthesis.

opened by WYejian 3
Large data, configuration?

Hi, I tried to train the model on ~230k reactions, but within 24hrs it reached only 1.3k steps. This seems a bit too few, therefore I wonder if I am missing some parameters or other detail. Also, how many steps you would suggest to train to get decent results?

Apart from this, the setup and scripts run pretty smoothly!

opened by vthost 1
I encountered a problem while training the model.

File "train.py", line 308, in main(args) File "train.py", line 127, in main loss, acc = model(batch) File "/root/anaconda3/envs/g2s/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/Z70177/prog/wzp/g2s/models/graph2seq_series_rel.py", line 114, in forward padded_memory_bank, memory_lengths = self.encode_and_reshape(reaction_batch) File "/Z70177/prog/wzp/g2s/models/graph2seq_series_rel.py", line 106, in encode_and_reshape reaction_batch.distances File "/root/anaconda3/envs/g2s/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/Z70177/prog/wzp/g2s/models/attention_xl.py", line 263, in forward out = layer(out, mask, distances) File "/root/anaconda3/envs/g2s/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/Z70177/prog/wzp/g2s/models/attention_xl.py", line 202, in forward context, _ = self.self_attn(input_norm, mask=mask, distances=distances) File "/root/anaconda3/envs/g2s/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, **kwargs) File "/Z70177/prog/wzp/g2s/models/attention_xl.py", line 139, in forward b_d = torch.matmul(query + v, rel_emb_t RuntimeError: CUDA error: device-side assert triggered Please help me solve it！

opened by zpking 1
Why pad `a_graph` and `b_graph` to length 11?
I am interested in your work. While I read the code, I found that during preprocessing, in the class get_graph_features_from_smi in data_utils.py. The codes

# padding for a_graph in a_graphs: while len(a_graph) < 11: # OH MY GOODNESS... Fe can be bonded to 10... a_graph.append(1e9) for b_graph in b_graphs: while len(b_graph) < 11: # OH MY GOODNESS... Fe can be bonded to 10... b_graph.append(1e9)

I cannot understand why the atom and bond graphs needed to be padded to length 11. Could you tell me something about it? Thanks a lot.
opened by yippp 1
inconsistent results in the paper

Hi,

Would u plz explain why the performance of GraphRetro [1] in Table 3 of your paper is much lower than that of the original workshop paper [1]? I further notice that one of the authors of Graph2SMILES is the author of GraphRetro.

[1] "Learning graph models for template-free retrosynthe". ICML Workshop, 2020

opened by hhr114 1

Owner

GitHub

In this work, we will implement some basic but important algorithm of machine learning step by step.

WoRkS continued English 中文 Français Probability Density Estimation-Non-Parametric Methods(概率密度估计-非参数方法) 1. Kernel / k-Nearest Neighborhood Density Est

1 Dec 30, 2021

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

This is a fork of Fairseq(-py) with implementations of the following models: Pervasive Attention - 2D Convolutional Neural Networks for Sequence-to-Se

490 Dec 15, 2022

CausalNLP is a practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.

CausalNLP CausalNLP is a practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable. Install pip install -U

95 Jan 3, 2023

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

SETR - Pytorch Since the original paper (Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.) has no official

112 Dec 16, 2022

Python PID Tuner - Makes a model of the System from a Process Reaction Curve and calculates PID Gains

PythonPID_Tuner_SOPDT Step 1: Takes a Process Reaction Curve in csv format - assumes data at 100ms interval (column names CV and PV) Step 2: Makes a r

1 Jan 18, 2022

PSTR: End-to-End One-Step Person Search With Transformers (CVPR2022)

PSTR (CVPR2022) This code is an official implementation of "PSTR: End-to-End One-Step Person Search With Transformers (CVPR2022)". End-to-end one-step

28 Dec 13, 2022

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021)

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021) Citation Please cite as: @inproceedings{liu2020understan

22 Nov 25, 2022

Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Paper | Blog OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image gene

1.4k Jan 8, 2023

A graph-to-sequence model for one-step retrosynthesis and reaction outcome prediction.

Related tags

Overview

Graph2SMILES

1. Environmental setup

System requirements

Using conda

2. Data preparation

3. Model training and validation

4. Testing

Comments

reaction prediction

Large data, configuration?

I encountered a problem while training the model.

Why pad `a_graph` and `b_graph` to length 11?

inconsistent results in the paper

Owner

In this work, we will implement some basic but important algorithm of machine learning step by step.

Pervasive Attention: 2D Convolutional Networks for Sequence-to-Sequence Prediction

CausalNLP is a practical toolkit for causal inference with text as treatment, outcome, or "controlled-for" variable.

Implementation of SETR model, Original paper: Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers.

Python PID Tuner - Makes a model of the System from a Process Reaction Curve and calculates PID Gains

PSTR: End-to-End One-Step Person Search With Transformers (CVPR2022)

Understanding and Improving Encoder Layer Fusion in Sequence-to-Sequence Learning (ICLR 2021)

Official repository of OFA. Paper: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

[CVPR 2021] Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

Sequence to Sequence Models with PyTorch

Sequence-to-Sequence learning using PyTorch

An implementation of a sequence to sequence neural network using an encoder-decoder

Sequence lineage information extracted from RKI sequence data repo

A PyTorch Implementation of "Watch Your Step: Learning Node Embeddings via Graph Attention" (NeurIPS 2018).

McGill Physics Hackathon 2021: Reaction-Diffusion Models for the Generation of Biological Patterns

Reaction SMILES-AA mapping via language modelling

Price-Prediction-For-a-Dream-Home - A machine learning based linear regression trained model for house price prediction.

You Only 👀 One Sequence