GeoMol: Torsional Geometric Generation of Molecular 3D Conformer Ensembles

Last update: Dec 20, 2022

Related tags

Deep Learning GeoMol

Overview

GeoMol: Torsional Geometric Generation of Molecular 3D Conformer Ensembles

This repository contains a method to generate 3D conformer ensembles directly from the molecular graph as described in our paper.

Requirements

python (version>=3.7.9)
pytorch (version>=1.7.0)
rdkit (version>=2020.03.2)
pytorch-geometric (version>=1.6.3)
networkx (version>=2.5.1)
pot (version>=0.7.0)

Installation

Data

Download and extract the GEOM dataset from the original source:

wget https://dataverse.harvard.edu/api/access/datafile/4327252
tar -xvf 4327252

Environment

Run make conda_env to create the conda environment. The script will request you to enter one of the supported CUDA versions listed here. The script uses this CUDA version to install PyTorch and PyTorch Geometric. Alternatively, you could manually follow the steps to install PyTorch Geometric here.

Usage

This should result in two different directories, one for each half of GEOM. You should place the qm9 conformers directory in the data/QM9/ directory and do the same for the drugs directory. This is all you need to train the model:

python train.py --data_dir data/QM9/qm9/ --split_path data/QM9/splits/split0.npy --log_dir ./test_run --n_epochs 250 --dataset qm9

Use the provided script to generate conformers. The test_csv arg should be a csv file with SMILES in the first column, and the number of conformers you want to generate in the second column. This will output a compressed dictionary of rdkit mols in the trained_model_dir directory (unless you provide the out arg):

python generate_confs.py --trained_model_dir trained_models/qm9/ --test_csv data/QM9/test_smiles.csv --dataset qm9

You can use the provided visualize_confs.ipynb jupyter notebook to visualize the generated conformers.

Additional comments

Training

To train the model, our code randomly samples files from the GEOM dataset and randomly samples conformers within those files. This is a lot of file I/O, which wasn't a huge issue for us when training, but could be an issue for others. If you're having issues with this, feel free to reach out, and I can help you reconfigure the code.

Some limitations

Currently, the model is hardcoded for atoms with a max of 4 neighbors. Since the dataset we train on didn't have atoms with more than 4 neighbors, we made this choice to speed up the code. In principle, the code can be adapted for something like a pentavalent phosphorus, but this wasn't a priority for us.

We can't deal with disconnected fragments (i.e. there is a "." in the SMILES).

This code will work poorly for macrocycles.

To ensure correct predictions, ALL tetrahedral chiral centers must be specified. There's probably a way to automate the specification of "rigid" chiral centers (e.g. in a fused ring), which I'll hopefully figure out soon, but I'm grad student with limited time :(

Feedback and collaboration

Code like this doesn't improve without feedback from the community. If you have comments/suggestions, please reach out to us! We're always happy to chat and provide input on how you can take this method to the next level.

Comments

OS Error with torch-sparse

I was trying the run this repository with the QM9 dataset. First I ran into the issue that was reported in issue #2 and #4.

Based on that I tried downgrading the torch version to 1.7.0 and torch-geometric to both 1.6.3 and 1.7.2. However I was unable to get past the below error. I tried looking for other solutions for the below error but was not able to find many resources apart from this one here.

Perhaps if a requirement file could be shared from the owner of this repository, I would be able to create an environment where this code can run.

Let me know if more info is needed from my side.

:~/Code/geo_mol/GeoMol$ python train.py --data_dir /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/qm9 --split_path /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/splits/split0.npy --log_dir ./test_run --n_epochs 250 --dataset qm9
Traceback (most recent call last):
  File "train.py", line 9, in <module>
    from model.model import GeoMol
  File "/home/vishwesh/Code/geo_mol/GeoMol/model/model.py", line 5, in <module>
    import torch_geometric as tg
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/__init__.py", line 5, in <module>
    import torch_geometric.data
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/__init__.py", line 1, in <module>
    from .data import Data
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/data.py", line 8, in <module>
    from torch_sparse import coalesce, SparseTensor
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_sparse/__init__.py", line 19, in <module>
    torch.ops.load_library(spec.origin)
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch/_ops.py", line 105, in load_library
    ctypes.CDLL(path)
  File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/ctypes/__init__.py", line 364, in __init__
    self._handle = _dlopen(self._name, mode)
OSError: /home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_sparse/_version_cpu.so: undefined symbol: _ZN3c106detail12infer_schema20make_function_schemaENS_8ArrayRefINS1_11ArgumentDefEEES4_

opened by finalelement 8

Runtime Error when enumerating train_loader during training

Hi! I really appreciate your fantastic work and code. And I've reproduced your work through the guidance in README.md However, I've received this error when executing the training process with train.py.

Describe the error

Starting training...
  0%|                                                                                                                                                       | 0/625 [00:00<?, ?it/s][11:18:30] Explicit valence for atom # 0 N, 4, is greater than permitted
  0%|                                                                                                                                                       | 0/625 [22:56<?, ?it/s]
Traceback (most recent call last):
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/pubhome/qcxia02/.vscode-server/extensions/ms-python.python-2021.11.1422169775/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
    cli.main()
  File "/pubhome/qcxia02/.vscode-server/extensions/ms-python.python-2021.11.1422169775/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
    run()
  File "/pubhome/qcxia02/.vscode-server/extensions/ms-python.python-2021.11.1422169775/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
    runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/pubhome/qcxia02/git-repo/AI-CONF/GeoMol/train.py", line 74, in <module>
    train_loss = train(model, train_loader, optimizer, device, scheduler, logger if args.verbose else None, epoch, writer)
  File "/pubhome/qcxia02/git-repo/AI-CONF/GeoMol/model/training.py", line 18, in train
    for i, data in tqdm(enumerate(loader), total=len(loader)):
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
    return self._process_data(data)
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
    data.reraise()
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
    raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
    data = fetcher.fetch(index)
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    return self.collate_fn(data)
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/loader/dataloader.py", line 39, in __call__
    return self.collate(batch)
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/loader/dataloader.py", line 20, in collate
    self.exclude_keys)
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/batch.py", line 75, in from_data_list
    exclude_keys=exclude_keys,
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/collate.py", line 86, in collate
    increment)
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/collate.py", line 142, in _collate
    data_list, stores, increment)
  File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/collate.py", line 162, in _collate
    value = torch.cat(values, dim=cat_dim or 0)
RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 19 but got size 21 for tensor number 1 in the list.

To Reproduce

`python train.py --data_dir data/QM9/qm9/ --split_path data/QM9/splits/split0.npy --log_dir ./test_run --n_epochs 250 --dataset qm9`

Expected behavior

Training completed smoothly without error

Environments:

The environments are based on the given environment.yml file, the version of torch are listed below: - OS: CentOS Linux release 8.4.2105 - Package Version:

python=3.7.10
pytorch=1.10.0=py3.7_cpu_0
torchaudio=0.10.0=py37_cpu
torchvision=0.11.1=py37_cpu
pytorch-cluster=1.5.9=py37_torch_1.10.0_cpu
pytorch-mutex=1.0=cpu
pytorch-scatter=2.0.9=py37_torch_1.10.0_cpu
pytorch-sparse=0.6.12=py37_torch_1.10.0_cpu
pytorch-spline-conv=1.2.1=py37_torch_1.10.0_cpu
torch-geometric=2.0.2

Additional context:

This error was raised while dataloader enumeration was called during training, i.e. for i, data in tqdm(enumerate(loader), total=len(loader)):. The Expected size 19 but got size 21 error during torch.cat comes from that it tried to cat tensor B (2nd molecule) with shape 10x21x3 to tensor A (1st molecule) with shape 10x19x3 at dimension 0 (10), which needs that the other dimension (19/21) should be the same. I'm not sure if this occurrence is normal to you and not sure where to make the modifications (if needed).

Looking forward to your reply :)

opened by qcxia20 2

Code Problem in permutations for symmetric hydrogens
Hi, thanks for the insight of this great work and for releasing the code! But when reproducing training, I have encountered the following errors:

in model/model.py, in GeoMol, assign_neighobours, i got

File "/home/dgxtest/3D-pretrain/GeoMol-main/model/model.py", line 180, in assign_neighborhoods RuntimeError: "mul_cuda" not implemented for 'Bool' self.leaf_hydrogens[a] = self.leaf_hydrogens[a] * True if self.leaf_hydrogens[a].sum() > 1 else self.leaf_hydrogens[a] * False

I can see that this code is intended to executing a XNOR operation (not so convincing now due to error2), so I changed the logic into the following and fix the error

self.leaf_hydrogens[a] = ~(self.leaf_hydrogens[a] ^ True) if self.leaf_hydrogens[a].sum() > 1 else ~(self.leaf_hydrogens[a] ^ False)

But the following error ensues

File "/home/dgxtest/3D-pretrain/GeoMol-main/model/model.py", line 332, in ground_truth_local_stats n_perms[0:len(perms), self.leaf_hydrogens[a]] = perms 'RuntimeError: shape mismatch: value tensor of shape [24, 4] cannot be broadcast to indexing result of shape [6, 4]'

in this case, self.leaf_hydrogens[a] is [True, True, True, True], thus leading to a permutation of length 24 in "perms" while "n_perms" is hardcoded in shape [6, 4] I am not sure whether my modification in error1 leads to a wrong self.leaf_hydrogens in error2, would you please help me point it out? very much appreciated.

btw, I am using torch1.7.0+cu110 and torch-geometric 1.6.3 as metioned in issue #2.
opened by sunyuancheng 1
Question: On the GEOM Dataset availability

Would the referenced dataset located at https://dataverse.harvard.edu/api/access/datafile/4327252 also be available to access under a F/OSS-Compliant license? And is it accessible by any other means or mirrors?

opened by Daasin 1
Add memoization of dihedral_pairs instead of computing them each iteration

Add memoization of dihedral_pairs in datasets such that they are only computed in the first epoch and then stored in memory and reused. This should speed up the code since computing the dihedral pairs previously took up 73% of the runtime in my experiments. Now, this overhead will only happen in the first epoch, and the additional memory usage is negligible.

Calling the attribute of the PyTorch geometric Data object edge_index_dihedral_pairs has the dihedral_pairs being treated as edge indices during batching such that PyTorch geometric automatically takes care of increasing the indices of the dihedral_pairs according to the graph sizes when creating a batch.

opened by HannesStark 1

RuntimeError: Cannot re-initialize CUDA in forked subprocess (solved)

Just in case anyone else has the same issue, I received the following error when during training.

Starting training...
  0%|                                                                         | 0/625 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/grads/e/ethanycx/workspace/GeoMol/train.py", line 73, in <module>
    train_loss = train(model, train_loader, optimizer, device, scheduler, logger if args.verbose else None, epoch, writer)
  File "/home/grads/e/ethanycx/workspace/GeoMol/model/training.py", line 18, in train
    for i, data in tqdm(enumerate(loader), total=len(loader)):
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
    return self._process_data(data)
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
    data.reraise()
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/_utils.py", line 428, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch_geometric/data/dataset.py", line 187, in __getitem__
    data = self.get(self.indices()[idx])
  File "/home/grads/e/ethanycx/workspace/GeoMol/model/featurization.py", line 74, in get
    data.edge_index_dihedral_pairs = get_dihedral_pairs(data.edge_index, data=data)
  File "/home/grads/e/ethanycx/workspace/GeoMol/model/utils.py", line 122, in get_dihedral_pairs
    keep = [t.to(device) for t in keep]
  File "/home/grads/e/ethanycx/workspace/GeoMol/model/utils.py", line 122, in <listcomp>
    keep = [t.to(device) for t in keep]
  File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
    raise RuntimeError(
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

Versions: torch==1.7.1, torch_geometric==1.7.0

This seems to be a Pytorch issue with the DataLoader. I fixed the issue by inserting the following lines at line 18 in train.py (and indent later lines accordingly):

if __name__ == '__main__':
    torch.multiprocessing.set_start_method('spawn', force=True)

and changing line 240 in featurization.py to num_workers=1,.

opened by ycremar 0

A question about the direction of alpha angle?

Dear Authors thanks for your wonderful job.

In the paper, the alpha angle is the sum of many different torsion angles along with X-Y rotatable, but when you use such an alpha angle to rotate fragments of the molecule you will counter a question, whether the alpha angle value rotated by X->Y direction or Y->X direction?

The figure below describes the problem, I rotate each of the fragments (LS of X) by using X<-Y direction, left bottom part is OK, but the left top part is wrong, if you use X->Y direction to rotate it(left top part) will become correct again. (which means the alpha angle has two directions somehow) I would not figure it out for a while.

In this example, X will be the larger ID than Y.

opened by lkfo415579 0

Problems about loss computation

Hi, Great Work! Could you please tell me the reason of subtracting the angle loss and the dihedral loss (at the bottom of code)? Thank U~ ` def batch_molecule_loss(self, true_stats, model_stats, ignore_neighbors): """ Compute loss for one pair of model/true molecules

    :param true_stats: tuple of masked true stat tensors (len 5)
    :param model_stats: tuple of masked model stat tensors (len 5)
        one-hop: (n_neighborhoods, 4)
        two-hop: (n_neighborhoods, 4, 4)
        angle: (n_neighborhoods, 6)
        dihedral: (2, n_dihedral_pairs, 9)
        three-hop: (n_dihedral_pairs, 9)
    :return: molecular loss for the batch (n_batch)
    """

    # unpack stats
    model_one_hop, model_two_hop, model_angles, model_dihedrals, model_three_hop = model_stats
    true_one_hop, true_two_hop, true_angles, true_dihedrals, true_three_hop = true_stats

    # calculate losses
    one_hop_loss, two_hop_loss, angle_loss = self.local_loss(true_one_hop, true_two_hop, true_angles,
                                                             model_one_hop, model_two_hop, model_angles)
    dihedral_loss, three_hop_loss = self.pair_loss(true_dihedrals, model_dihedrals, true_three_hop, model_three_hop)

    # writing
    self.one_hop_loss.append(one_hop_loss)
    self.two_hop_loss.append(two_hop_loss)
    self.angle_loss.append(angle_loss)
    self.dihedral_loss.append(dihedral_loss)
    self.three_hop_loss.append(three_hop_loss)

    if ignore_neighbors:
        return one_hop_loss + two_hop_loss - angle_loss
    else:
        return one_hop_loss + two_hop_loss - angle_loss + three_hop_loss - dihedral_loss`

opened by psp3dcg 2

Expecting all tensors to be on same device, but found two device cuda:0 and cpu, when running the generate_confs.py

Hello,

I am facing an issue when trying to run the generate_confs.py using the given pretrained models. However I am running into the error shared below, please share your insights, if there is a preference between GPU and CPU when trying to run the inference.

I also tried switching between cpu and gpu for the model, but no luck so far.

  0%|          | 0/1000 [02:14<?, ?it/s]
Traceback (most recent call last):
  File "/home/vishwesh/Software/pycharm-community-2021.1.1/plugins/python-ce/helpers/pydev/pydevd.py", line 1483, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/vishwesh/Software/pycharm-community-2021.1.1/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/vishwesh/Code/geo_mol/GeoMol/generate_confs.py", line 63, in <module>
    model(data, inference=True, n_model_confs=n_confs*2)
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/vishwesh/Code/geo_mol/GeoMol/model/model.py", line 81, in forward
    self.generate_model_prediction(data.x, data.edge_index, data.edge_attr, data.batch, data.chiral_tag)
  File "/home/vishwesh/Code/geo_mol/GeoMol/model/model.py", line 686, in generate_model_prediction
    x1, x2, h_mol = self.embed(x, edge_index, edge_attr, batch)
  File "/home/vishwesh/Code/geo_mol/GeoMol/model/model.py", line 228, in embed
    x1, _ = self.gnn(x, edge_index, edge_attr)
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/vishwesh/Code/geo_mol/GeoMol/model/GNN.py", line 126, in forward
    x = self.node_init(x)
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/vishwesh/Code/geo_mol/GeoMol/model/GNN.py", line 40, in forward
    x = self.layers[i](x)
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 96, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/nn/functional.py", line 1847, in linear
    return torch._C._nn.linear(input, weight, bias)
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking arugment for argument mat2 in method wrapper_mm)

Process finished with exit code 1

opened by finalelement 3

Questions about stereoisomer issues in the evaluation of GeoMol
https://github.com/PattanaikL/GeoMol/blob/5d0e85014a9546209d5b43861638caabb362ec25/scripts/compare_confs.py#L49-L56

This function used to filter out conformers with inconsistent smiles relative to the given smiles (in this script this is corrected_smi). In my reproduction, most cases that the inconsistency exists are molecules with a Z/E-double bond. These cases will not be filtered out if isomericSmiles=False, which makes me confused and I'm not sure if this is a mistake.

For example, now conformers with smiles Cc1cc(C(=O)c2cnc(/N=C/N(C)C)s2)c(F)cc1Cl and Cc1cc(C(=O)c2cnc(/N=C\N(C)C)s2)c(F)cc1Cl in reference data will all be saved for comparison although GeoMol was used to only generate conformers with Cc1cc(C(=O)c2cnc(/N=C\N(C)C)s2)c(F)cc1Cl.

https://github.com/PattanaikL/GeoMol/blob/5d0e85014a9546209d5b43861638caabb362ec25/model/featurization.py#L125-L126

Compared with that, the code in model/featurization.py filtered out the conformers with inconsistent smiles relative to the smiles in the dataset.

So actually, if I used compare_confs.py to calculate the performance with isomericSmiles=False, the conformers with different isomeric SMILES will not be filtered out and the performance was the same as or even worse than before (since that GeoMol was used to generate only one stereoisomer based on the given SMILES).

The performance comparison between GeoMol prediction and reference data (before using clean_confs; using clean_confs; change isomericSmiles=True:

**Before** Recall Coverage: Mean = 74.78, Median = 85.00 Recall AMR: Mean = 0.9471, Median = 0.9176 Precision Coverage: Mean = 71.84, Median = 87.50 Precision AMR: Mean = 1.0035, Median = 0.9649 **After (with clean_confs, more confs are included than before)** Recall Coverage: Mean = 74.30, Median = 90.00 Recall AMR: Mean = 0.9489, Median = 0.8797 Precision Coverage: Mean = 65.50, Median = 81.80 Precision AMR: Mean = 1.1044, Median = 1.0041 **isomericSmiles=True** Recall Coverage: Mean = 83.38, Median = 100.00 Recall AMR: Mean = 0.8233, Median = 0.8079 Precision Coverage: Mean = 72.73, Median = 87.50 Precision AMR: Mean = 0.9833, Median = 0.8895

As you can see, if isomericSmiles=True, the performance in GeoMol paper's result can be reproduced.

When I tried to walk further related to this issue, I found another weird thing that GeoMol will generate the conformers close in 3D geometry though with different stereoisomerism in SMILES as input. And the conformers close in 3D geometry are different stereoisomers in their SMILES. This issue does not exist in RDKit ETKDG and I am not sure if it will affect GeoMol's performance on these molecules. Here I give two examples on that, |SMILES| GeoMol (trans) | GeoMol (cis) | ETKDG (trans) | ETKDG (cis) | |--| -- | -- | -- | -- | | O=S(=O)(_N=C(_c1ccccc1)N1CCOCC1)c1ccc(Br)cc1 || | | | Cc1cc(C(=O)c2cnc(_N=C_N(C)C)s2)c(F)cc1Cl|| | |
opened by qcxia20 0
getting errors in training and while inferencing the model
I have created a new environment using your .sh file and running the training script with the same datasets. But I am getting this error. RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 21 but got size 19 for tensor number 1 in the list.

While running your generate_confs script I am getting this error. in this line data = Batch.from_data_list(data_list=[tg_data]) TypeError: argument of type 'int' is not iterable

if I am passing data directly to the model bypassing above line then in line model(tg_data, inference=True, n_model_confs=n_confs*2) I am getting this error AttributeError: 'GlobalStorage' object has no attribute 'bincount' NOTE : While passing the data directly to the model i changed n_atoms_per_mol = data.batch.bincount() TO n_atoms_per_mol = data.bincount() in get_neighbor_ids function of model.utils script. If i am not changing this line then the error is like NoneType attribute has no attribute bincount()
opened by uttu-parashar 10

Owner

GitHub

Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

Molecular Sets (MOSES): A benchmarking platform for molecular generation models Deep generative models are rapidly becoming popular for the discovery

656 Dec 29, 2022

PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition. Transformer models are good at capturing content-based

565 Jan 4, 2023

Conformer: Local Features Coupling Global Representations for Visual Recognition

Conformer: Local Features Coupling Global Representations for Visual Recognition (arxiv) This repository is built upon DeiT and timm Usage First, inst

378 Jan 8, 2023

Code for the ICASSP-2021 paper: Continuous Speech Separation with Conformer.

Continuous Speech Separation with Conformer Introduction We examine the use of the Conformer architecture for continuous speech separation. Conformer

81 Nov 28, 2022

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition Official implementation of the Efficient Conforme

145 Dec 30, 2022

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.

H2O H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Fl

6.1k Jan 5, 2023

Neural Oblivious Decision Ensembles

Neural Oblivious Decision Ensembles A supplementary code for anonymous ICLR 2020 submission. What does it do? It learns deep ensembles of oblivious di

25 Sep 21, 2022

Implementation of Learning Gradient Fields for Molecular Conformation Generation (ICML 2021).

[PDF] | [Slides] The official implementation of Learning Gradient Fields for Molecular Conformation Generation (ICML 2021 Long talk) Installation Inst

117 Dec 9, 2022

A Temporal Extension Library for PyTorch Geometric

Documentation | External Resources | Datasets PyTorch Geometric Temporal is a temporal (dynamic) extension library for PyTorch Geometric. The library

1.9k Jan 7, 2023

Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Geometric Vector Perceptron Implementation of Geometric Vector Perceptron, a simple circuit with 3d rotation equivariance for learning over large biom