GeoMol: Torsional Geometric Generation of Molecular 3D Conformer Ensembles

Related tags

Deep Learning GeoMol
Overview

GeoMol: Torsional Geometric Generation of Molecular 3D Conformer Ensembles


This repository contains a method to generate 3D conformer ensembles directly from the molecular graph as described in our paper.

Requirements

  • python (version>=3.7.9)
  • pytorch (version>=1.7.0)
  • rdkit (version>=2020.03.2)
  • pytorch-geometric (version>=1.6.3)
  • networkx (version>=2.5.1)
  • pot (version>=0.7.0)

Installation

Data

Download and extract the GEOM dataset from the original source:

  1. wget https://dataverse.harvard.edu/api/access/datafile/4327252
  2. tar -xvf 4327252

Environment

Run make conda_env to create the conda environment. The script will request you to enter one of the supported CUDA versions listed here. The script uses this CUDA version to install PyTorch and PyTorch Geometric. Alternatively, you could manually follow the steps to install PyTorch Geometric here.

Usage

This should result in two different directories, one for each half of GEOM. You should place the qm9 conformers directory in the data/QM9/ directory and do the same for the drugs directory. This is all you need to train the model:

python train.py --data_dir data/QM9/qm9/ --split_path data/QM9/splits/split0.npy --log_dir ./test_run --n_epochs 250 --dataset qm9

Use the provided script to generate conformers. The test_csv arg should be a csv file with SMILES in the first column, and the number of conformers you want to generate in the second column. This will output a compressed dictionary of rdkit mols in the trained_model_dir directory (unless you provide the out arg):

python generate_confs.py --trained_model_dir trained_models/qm9/ --test_csv data/QM9/test_smiles.csv --dataset qm9

You can use the provided visualize_confs.ipynb jupyter notebook to visualize the generated conformers.

Additional comments

Training

To train the model, our code randomly samples files from the GEOM dataset and randomly samples conformers within those files. This is a lot of file I/O, which wasn't a huge issue for us when training, but could be an issue for others. If you're having issues with this, feel free to reach out, and I can help you reconfigure the code.

Some limitations

Currently, the model is hardcoded for atoms with a max of 4 neighbors. Since the dataset we train on didn't have atoms with more than 4 neighbors, we made this choice to speed up the code. In principle, the code can be adapted for something like a pentavalent phosphorus, but this wasn't a priority for us.

We can't deal with disconnected fragments (i.e. there is a "." in the SMILES).

This code will work poorly for macrocycles.

To ensure correct predictions, ALL tetrahedral chiral centers must be specified. There's probably a way to automate the specification of "rigid" chiral centers (e.g. in a fused ring), which I'll hopefully figure out soon, but I'm grad student with limited time :(

Feedback and collaboration

Code like this doesn't improve without feedback from the community. If you have comments/suggestions, please reach out to us! We're always happy to chat and provide input on how you can take this method to the next level.

Comments
  • OS Error with torch-sparse

    OS Error with torch-sparse

    I was trying the run this repository with the QM9 dataset. First I ran into the issue that was reported in issue #2 and #4.

    Based on that I tried downgrading the torch version to 1.7.0 and torch-geometric to both 1.6.3 and 1.7.2. However I was unable to get past the below error. I tried looking for other solutions for the below error but was not able to find many resources apart from this one here.

    Perhaps if a requirement file could be shared from the owner of this repository, I would be able to create an environment where this code can run.

    Let me know if more info is needed from my side.

    :~/Code/geo_mol/GeoMol$ python train.py --data_dir /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/qm9 --split_path /home/vishwesh/Code/geo_mol/GeoMol/data/QM9/splits/split0.npy --log_dir ./test_run --n_epochs 250 --dataset qm9
    Traceback (most recent call last):
      File "train.py", line 9, in <module>
        from model.model import GeoMol
      File "/home/vishwesh/Code/geo_mol/GeoMol/model/model.py", line 5, in <module>
        import torch_geometric as tg
      File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/__init__.py", line 5, in <module>
        import torch_geometric.data
      File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/__init__.py", line 1, in <module>
        from .data import Data
      File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/data.py", line 8, in <module>
        from torch_sparse import coalesce, SparseTensor
      File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_sparse/__init__.py", line 19, in <module>
        torch.ops.load_library(spec.origin)
      File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch/_ops.py", line 105, in load_library
        ctypes.CDLL(path)
      File "/home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/ctypes/__init__.py", line 364, in __init__
        self._handle = _dlopen(self._name, mode)
    OSError: /home/vishwesh/anaconda3/envs/GeoMol/lib/python3.7/site-packages/torch_sparse/_version_cpu.so: undefined symbol: _ZN3c106detail12infer_schema20make_function_schemaENS_8ArrayRefINS1_11ArgumentDefEEES4_
    
    opened by finalelement 8
  • Runtime Error when enumerating train_loader during training

    Runtime Error when enumerating train_loader during training

    Hi! I really appreciate your fantastic work and code. And I've reproduced your work through the guidance in README.md However, I've received this error when executing the training process with train.py.

    Describe the error

    Starting training...
      0%|                                                                                                                                                       | 0/625 [00:00<?, ?it/s][11:18:30] Explicit valence for atom # 0 N, 4, is greater than permitted
      0%|                                                                                                                                                       | 0/625 [22:56<?, ?it/s]
    Traceback (most recent call last):
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/runpy.py", line 193, in _run_module_as_main
        "__main__", mod_spec)
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/pubhome/qcxia02/.vscode-server/extensions/ms-python.python-2021.11.1422169775/pythonFiles/lib/python/debugpy/__main__.py", line 45, in <module>
        cli.main()
      File "/pubhome/qcxia02/.vscode-server/extensions/ms-python.python-2021.11.1422169775/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 444, in main
        run()
      File "/pubhome/qcxia02/.vscode-server/extensions/ms-python.python-2021.11.1422169775/pythonFiles/lib/python/debugpy/../debugpy/server/cli.py", line 285, in run_file
        runpy.run_path(target_as_str, run_name=compat.force_str("__main__"))
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/runpy.py", line 263, in run_path
        pkg_name=pkg_name, script_name=fname)
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/runpy.py", line 96, in _run_module_code
        mod_name, mod_spec, pkg_name, script_name)
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/runpy.py", line 85, in _run_code
        exec(code, run_globals)
      File "/pubhome/qcxia02/git-repo/AI-CONF/GeoMol/train.py", line 74, in <module>
        train_loss = train(model, train_loader, optimizer, device, scheduler, logger if args.verbose else None, epoch, writer)
      File "/pubhome/qcxia02/git-repo/AI-CONF/GeoMol/model/training.py", line 18, in train
        for i, data in tqdm(enumerate(loader), total=len(loader)):
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/tqdm/std.py", line 1178, in __iter__
        for obj in iterable:
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
        data = self._next_data()
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
        return self._process_data(data)
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
        data.reraise()
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch/_utils.py", line 434, in reraise
        raise exception
    RuntimeError: Caught RuntimeError in DataLoader worker process 0.
    Original Traceback (most recent call last):
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
        data = fetcher.fetch(index)
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
        return self.collate_fn(data)
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/loader/dataloader.py", line 39, in __call__
        return self.collate(batch)
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/loader/dataloader.py", line 20, in collate
        self.exclude_keys)
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/batch.py", line 75, in from_data_list
        exclude_keys=exclude_keys,
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/collate.py", line 86, in collate
        increment)
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/collate.py", line 142, in _collate
        data_list, stores, increment)
      File "/pubhome/qcxia02/miniconda3/envs/GeoMol/lib/python3.7/site-packages/torch_geometric/data/collate.py", line 162, in _collate
        value = torch.cat(values, dim=cat_dim or 0)
    RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 19 but got size 21 for tensor number 1 in the list.
    

    To Reproduce

    `python train.py --data_dir data/QM9/qm9/ --split_path data/QM9/splits/split0.npy --log_dir ./test_run --n_epochs 250 --dataset qm9`
    

    Expected behavior

    Training completed smoothly without error

    Environments:

    The environments are based on the given environment.yml file, the version of torch are listed below: - OS: CentOS Linux release 8.4.2105 - Package Version:

    • python=3.7.10
    • pytorch=1.10.0=py3.7_cpu_0
    • torchaudio=0.10.0=py37_cpu
    • torchvision=0.11.1=py37_cpu
    • pytorch-cluster=1.5.9=py37_torch_1.10.0_cpu
    • pytorch-mutex=1.0=cpu
    • pytorch-scatter=2.0.9=py37_torch_1.10.0_cpu
    • pytorch-sparse=0.6.12=py37_torch_1.10.0_cpu
    • pytorch-spline-conv=1.2.1=py37_torch_1.10.0_cpu
    • torch-geometric=2.0.2

    Additional context:

    This error was raised while dataloader enumeration was called during training, i.e. for i, data in tqdm(enumerate(loader), total=len(loader)):. The Expected size 19 but got size 21 error during torch.cat comes from that it tried to cat tensor B (2nd molecule) with shape 10x21x3 to tensor A (1st molecule) with shape 10x19x3 at dimension 0 (10), which needs that the other dimension (19/21) should be the same. I'm not sure if this occurrence is normal to you and not sure where to make the modifications (if needed).

    Looking forward to your reply :)

    opened by qcxia20 2
  • Code Problem in permutations for symmetric hydrogens

    Code Problem in permutations for symmetric hydrogens

    Hi, thanks for the insight of this great work and for releasing the code! But when reproducing training, I have encountered the following errors:

    1. in model/model.py, in GeoMol, assign_neighobours, i got

    File "/home/dgxtest/3D-pretrain/GeoMol-main/model/model.py", line 180, in assign_neighborhoods RuntimeError: "mul_cuda" not implemented for 'Bool' self.leaf_hydrogens[a] = self.leaf_hydrogens[a] * True if self.leaf_hydrogens[a].sum() > 1 else self.leaf_hydrogens[a] * False

    I can see that this code is intended to executing a XNOR operation (not so convincing now due to error2), so I changed the logic into the following and fix the error

    self.leaf_hydrogens[a] = ~(self.leaf_hydrogens[a] ^ True) if self.leaf_hydrogens[a].sum() > 1 else ~(self.leaf_hydrogens[a] ^ False)

    1. But the following error ensues

    File "/home/dgxtest/3D-pretrain/GeoMol-main/model/model.py", line 332, in ground_truth_local_stats n_perms[0:len(perms), self.leaf_hydrogens[a]] = perms 'RuntimeError: shape mismatch: value tensor of shape [24, 4] cannot be broadcast to indexing result of shape [6, 4]'

    in this case, self.leaf_hydrogens[a] is [True, True, True, True], thus leading to a permutation of length 24 in "perms" while "n_perms" is hardcoded in shape [6, 4] I am not sure whether my modification in error1 leads to a wrong self.leaf_hydrogens in error2, would you please help me point it out? very much appreciated.

    btw, I am using torch1.7.0+cu110 and torch-geometric 1.6.3 as metioned in issue #2.

    opened by sunyuancheng 1
  • Question: On the GEOM Dataset availability

    Question: On the GEOM Dataset availability

    Would the referenced dataset located at https://dataverse.harvard.edu/api/access/datafile/4327252 also be available to access under a F/OSS-Compliant license? And is it accessible by any other means or mirrors?

    opened by Daasin 1
  • Add memoization of dihedral_pairs instead of computing them each iteration

    Add memoization of dihedral_pairs instead of computing them each iteration

    Add memoization of dihedral_pairs in datasets such that they are only computed in the first epoch and then stored in memory and reused. This should speed up the code since computing the dihedral pairs previously took up 73% of the runtime in my experiments. Now, this overhead will only happen in the first epoch, and the additional memory usage is negligible.

    Calling the attribute of the PyTorch geometric Data object edge_index_dihedral_pairs has the dihedral_pairs being treated as edge indices during batching such that PyTorch geometric automatically takes care of increasing the indices of the dihedral_pairs according to the graph sizes when creating a batch.

    opened by HannesStark 1
  • RuntimeError: Cannot re-initialize CUDA in forked subprocess (solved)

    RuntimeError: Cannot re-initialize CUDA in forked subprocess (solved)

    Just in case anyone else has the same issue, I received the following error when during training.

    Starting training...
      0%|                                                                         | 0/625 [00:00<?, ?it/s]
    Traceback (most recent call last):
      File "/home/grads/e/ethanycx/workspace/GeoMol/train.py", line 73, in <module>
        train_loss = train(model, train_loader, optimizer, device, scheduler, logger if args.verbose else None, epoch, writer)
      File "/home/grads/e/ethanycx/workspace/GeoMol/model/training.py", line 18, in train
        for i, data in tqdm(enumerate(loader), total=len(loader)):
      File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/tqdm/std.py", line 1178, in __iter__
        for obj in iterable:
      File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
        data = self._next_data()
      File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1085, in _next_data
        return self._process_data(data)
      File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1111, in _process_data
        data.reraise()
      File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/_utils.py", line 428, in reraise
        raise self.exc_type(msg)
    RuntimeError: Caught RuntimeError in DataLoader worker process 0.
    Original Traceback (most recent call last):
      File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 198, in _worker_loop
        data = fetcher.fetch(index)
      File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
        data = [self.dataset[idx] for idx in possibly_batched_index]
      File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
        data = [self.dataset[idx] for idx in possibly_batched_index]
      File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch_geometric/data/dataset.py", line 187, in __getitem__
        data = self.get(self.indices()[idx])
      File "/home/grads/e/ethanycx/workspace/GeoMol/model/featurization.py", line 74, in get
        data.edge_index_dihedral_pairs = get_dihedral_pairs(data.edge_index, data=data)
      File "/home/grads/e/ethanycx/workspace/GeoMol/model/utils.py", line 122, in get_dihedral_pairs
        keep = [t.to(device) for t in keep]
      File "/home/grads/e/ethanycx/workspace/GeoMol/model/utils.py", line 122, in <listcomp>
        keep = [t.to(device) for t in keep]
      File "/home/grads/e/ethanycx/miniconda3/envs/torch/lib/python3.9/site-packages/torch/cuda/__init__.py", line 163, in _lazy_init
        raise RuntimeError(
    RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
    

    Versions: torch==1.7.1, torch_geometric==1.7.0

    This seems to be a Pytorch issue with the DataLoader. I fixed the issue by inserting the following lines at line 18 in train.py (and indent later lines accordingly):

    if __name__ == '__main__':
        torch.multiprocessing.set_start_method('spawn', force=True)
    

    and changing line 240 in featurization.py to num_workers=1,.

    opened by ycremar 0
  • A question about the direction of alpha angle?

    A question about the direction of alpha angle?

    Dear Authors thanks for your wonderful job.

    In the paper, the alpha angle is the sum of many different torsion angles along with X-Y rotatable, but when you use such an alpha angle to rotate fragments of the molecule you will counter a question, whether the alpha angle value rotated by X->Y direction or Y->X direction?

    The figure below describes the problem, I rotate each of the fragments (LS of X) by using X<-Y direction, left bottom part is OK, but the left top part is wrong, if you use X->Y direction to rotate it(left top part) will become correct again. (which means the alpha angle has two directions somehow) I would not figure it out for a while.

    In this example, X will be the larger ID than Y. image

    opened by lkfo415579 0
  • Problems about loss computation

    Problems about loss computation

    Hi, Great Work! Could you please tell me the reason of subtracting the angle loss and the dihedral loss (at the bottom of code)? Thank U~ ` def batch_molecule_loss(self, true_stats, model_stats, ignore_neighbors): """ Compute loss for one pair of model/true molecules

        :param true_stats: tuple of masked true stat tensors (len 5)
        :param model_stats: tuple of masked model stat tensors (len 5)
            one-hop: (n_neighborhoods, 4)
            two-hop: (n_neighborhoods, 4, 4)
            angle: (n_neighborhoods, 6)
            dihedral: (2, n_dihedral_pairs, 9)
            three-hop: (n_dihedral_pairs, 9)
        :return: molecular loss for the batch (n_batch)
        """
    
        # unpack stats
        model_one_hop, model_two_hop, model_angles, model_dihedrals, model_three_hop = model_stats
        true_one_hop, true_two_hop, true_angles, true_dihedrals, true_three_hop = true_stats
    
        # calculate losses
        one_hop_loss, two_hop_loss, angle_loss = self.local_loss(true_one_hop, true_two_hop, true_angles,
                                                                 model_one_hop, model_two_hop, model_angles)
        dihedral_loss, three_hop_loss = self.pair_loss(true_dihedrals, model_dihedrals, true_three_hop, model_three_hop)
    
        # writing
        self.one_hop_loss.append(one_hop_loss)
        self.two_hop_loss.append(two_hop_loss)
        self.angle_loss.append(angle_loss)
        self.dihedral_loss.append(dihedral_loss)
        self.three_hop_loss.append(three_hop_loss)
    
        if ignore_neighbors:
            return one_hop_loss + two_hop_loss - angle_loss
        else:
            return one_hop_loss + two_hop_loss - angle_loss + three_hop_loss - dihedral_loss`
    
    opened by psp3dcg 2
  • Expecting all tensors to be on same device, but found two device cuda:0 and cpu, when running the generate_confs.py

    Expecting all tensors to be on same device, but found two device cuda:0 and cpu, when running the generate_confs.py

    Hello,

    I am facing an issue when trying to run the generate_confs.py using the given pretrained models. However I am running into the error shared below, please share your insights, if there is a preference between GPU and CPU when trying to run the inference.

    I also tried switching between cpu and gpu for the model, but no luck so far.

      0%|          | 0/1000 [02:14<?, ?it/s]
    Traceback (most recent call last):
      File "/home/vishwesh/Software/pycharm-community-2021.1.1/plugins/python-ce/helpers/pydev/pydevd.py", line 1483, in _exec
        pydev_imports.execfile(file, globals, locals)  # execute the script
      File "/home/vishwesh/Software/pycharm-community-2021.1.1/plugins/python-ce/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
        exec(compile(contents+"\n", file, 'exec'), glob, loc)
      File "/home/vishwesh/Code/geo_mol/GeoMol/generate_confs.py", line 63, in <module>
        model(data, inference=True, n_model_confs=n_confs*2)
      File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/vishwesh/Code/geo_mol/GeoMol/model/model.py", line 81, in forward
        self.generate_model_prediction(data.x, data.edge_index, data.edge_attr, data.batch, data.chiral_tag)
      File "/home/vishwesh/Code/geo_mol/GeoMol/model/model.py", line 686, in generate_model_prediction
        x1, x2, h_mol = self.embed(x, edge_index, edge_attr, batch)
      File "/home/vishwesh/Code/geo_mol/GeoMol/model/model.py", line 228, in embed
        x1, _ = self.gnn(x, edge_index, edge_attr)
      File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/vishwesh/Code/geo_mol/GeoMol/model/GNN.py", line 126, in forward
        x = self.node_init(x)
      File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/vishwesh/Code/geo_mol/GeoMol/model/GNN.py", line 40, in forward
        x = self.layers[i](x)
      File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
        return forward_call(*input, **kwargs)
      File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 96, in forward
        return F.linear(input, self.weight, self.bias)
      File "/home/vishwesh/anaconda3/envs/geomol_v2/lib/python3.8/site-packages/torch/nn/functional.py", line 1847, in linear
        return torch._C._nn.linear(input, weight, bias)
    RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking arugment for argument mat2 in method wrapper_mm)
    
    Process finished with exit code 1
    
    
    opened by finalelement 3
  • Questions about stereoisomer issues in the evaluation of GeoMol

    Questions about stereoisomer issues in the evaluation of GeoMol

    https://github.com/PattanaikL/GeoMol/blob/5d0e85014a9546209d5b43861638caabb362ec25/scripts/compare_confs.py#L49-L56

    • This function used to filter out conformers with inconsistent smiles relative to the given smiles (in this script this is corrected_smi). In my reproduction, most cases that the inconsistency exists are molecules with a Z/E-double bond. These cases will not be filtered out if isomericSmiles=False, which makes me confused and I'm not sure if this is a mistake.
    • For example, now conformers with smiles Cc1cc(C(=O)c2cnc(/N=C/N(C)C)s2)c(F)cc1Cl and Cc1cc(C(=O)c2cnc(/N=C\N(C)C)s2)c(F)cc1Cl in reference data will all be saved for comparison although GeoMol was used to only generate conformers with Cc1cc(C(=O)c2cnc(/N=C\N(C)C)s2)c(F)cc1Cl.

    https://github.com/PattanaikL/GeoMol/blob/5d0e85014a9546209d5b43861638caabb362ec25/model/featurization.py#L125-L126

    • Compared with that, the code in model/featurization.py filtered out the conformers with inconsistent smiles relative to the smiles in the dataset.
    • So actually, if I used compare_confs.py to calculate the performance with isomericSmiles=False, the conformers with different isomeric SMILES will not be filtered out and the performance was the same as or even worse than before (since that GeoMol was used to generate only one stereoisomer based on the given SMILES).
    • The performance comparison between GeoMol prediction and reference data (before using clean_confs; using clean_confs; change isomericSmiles=True:
    **Before**
    Recall Coverage: Mean = 74.78, Median = 85.00
    Recall AMR: Mean = 0.9471, Median = 0.9176
    Precision Coverage: Mean = 71.84, Median = 87.50
    Precision AMR: Mean = 1.0035, Median = 0.9649
    
    **After (with clean_confs, more confs are included than before)**
    Recall Coverage: Mean = 74.30, Median = 90.00
    Recall AMR: Mean = 0.9489, Median = 0.8797
    Precision Coverage: Mean = 65.50, Median = 81.80
    Precision AMR: Mean = 1.1044, Median = 1.0041
    
    **isomericSmiles=True**
    Recall Coverage: Mean = 83.38, Median = 100.00
    Recall AMR: Mean = 0.8233, Median = 0.8079
    Precision Coverage: Mean = 72.73, Median = 87.50
    Precision AMR: Mean = 0.9833, Median = 0.8895
    

    As you can see, if isomericSmiles=True, the performance in GeoMol paper's result can be reproduced.


    When I tried to walk further related to this issue, I found another weird thing that GeoMol will generate the conformers close in 3D geometry though with different stereoisomerism in SMILES as input. And the conformers close in 3D geometry are different stereoisomers in their SMILES. This issue does not exist in RDKit ETKDG and I am not sure if it will affect GeoMol's performance on these molecules. Here I give two examples on that, |SMILES| GeoMol (trans) | GeoMol (cis) | ETKDG (trans) | ETKDG (cis) | |--| -- | -- | -- | -- | | O=S(=O)(_N=C(_c1ccccc1)N1CCOCC1)c1ccc(Br)cc1 |image|image | image | image | Cc1cc(C(=O)c2cnc(_N=C_N(C)C)s2)c(F)cc1Cl|image|image | image | image

    opened by qcxia20 0
  • getting errors in training and while inferencing the model

    getting errors in training and while inferencing the model

    1. I have created a new environment using your .sh file and running the training script with the same datasets. But I am getting this error. RuntimeError: Sizes of tensors must match except in dimension 0. Expected size 21 but got size 19 for tensor number 1 in the list.

    2. While running your generate_confs script I am getting this error. in this line data = Batch.from_data_list(data_list=[tg_data]) TypeError: argument of type 'int' is not iterable

    3. if I am passing data directly to the model bypassing above line then in line model(tg_data, inference=True, n_model_confs=n_confs*2) I am getting this error AttributeError: 'GlobalStorage' object has no attribute 'bincount' NOTE : While passing the data directly to the model i changed n_atoms_per_mol = data.batch.bincount() TO n_atoms_per_mol = data.bincount() in get_neighbor_ids function of model.utils script. If i am not changing this line then the error is like NoneType attribute has no attribute bincount()

    Screenshot from 2021-12-31 15-24-12 Screenshot from 2021-12-31 15-24-17 Screenshot from 2021-12-31 15-24-33

    opened by uttu-parashar 10
Owner
null
Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models

Molecular Sets (MOSES): A benchmarking platform for molecular generation models Deep generative models are rapidly becoming popular for the discovery

MOSES 656 Dec 29, 2022
PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

PyTorch implementation of Conformer: Convolution-augmented Transformer for Speech Recognition. Transformer models are good at capturing content-based

Soohwan Kim 565 Jan 4, 2023
Conformer: Local Features Coupling Global Representations for Visual Recognition

Conformer: Local Features Coupling Global Representations for Visual Recognition (arxiv) This repository is built upon DeiT and timm Usage First, inst

Zhiliang Peng 378 Jan 8, 2023
Code for the ICASSP-2021 paper: Continuous Speech Separation with Conformer.

Continuous Speech Separation with Conformer Introduction We examine the use of the Conformer architecture for continuous speech separation. Conformer

Sanyuan Chen (陈三元) 81 Nov 28, 2022
Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition

Efficient Conformer: Progressive Downsampling and Grouped Attention for Automatic Speech Recognition Official implementation of the Efficient Conforme

Maxime Burchi 145 Dec 30, 2022
Neural Oblivious Decision Ensembles

Neural Oblivious Decision Ensembles A supplementary code for anonymous ICLR 2020 submission. What does it do? It learns deep ensembles of oblivious di

null 25 Sep 21, 2022
Implementation of Learning Gradient Fields for Molecular Conformation Generation (ICML 2021).

[PDF] | [Slides] The official implementation of Learning Gradient Fields for Molecular Conformation Generation (ICML 2021 Long talk) Installation Inst

MilaGraph 117 Dec 9, 2022
A Temporal Extension Library for PyTorch Geometric

Documentation | External Resources | Datasets PyTorch Geometric Temporal is a temporal (dynamic) extension library for PyTorch Geometric. The library

Benedek Rozemberczki 1.9k Jan 7, 2023
Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Geometric Vector Perceptron Implementation of Geometric Vector Perceptron, a simple circuit with 3d rotation equivariance for learning over large biom

Phil Wang 59 Nov 24, 2022
Geometric Deep Learning Extension Library for PyTorch

Documentation | Paper | Colab Notebooks | External Resources | OGB Examples PyTorch Geometric (PyG) is a geometric deep learning extension library for

Matthias Fey 16.5k Jan 8, 2023
Multi-Scale Geometric Consistency Guided Multi-View Stereo

ACMM [News] The code for ACMH is released!!! [News] The code for ACMP is released!!! About ACMM is a multi-scale geometric consistency guided multi-vi

Qingshan Xu 118 Jan 4, 2023
Code for "FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle Detection", ICRA 2021

FGR This repository contains the python implementation for paper "FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle Detection"(I

Yi Wei 31 Dec 8, 2022
A PyTorch implementation of "DGC-Net: Dense Geometric Correspondence Network"

DGC-Net: Dense Geometric Correspondence Network This is a PyTorch implementation of our work "DGC-Net: Dense Geometric Correspondence Network" TL;DR A

null 191 Dec 16, 2022
Geometric Vector Perceptron --- a rotation-equivariant GNN for learning from biomolecular structure

Geometric Vector Perceptron Code to accompany Learning from Protein Structure with Geometric Vector Perceptrons by B Jing, S Eismann, P Suriana, RJL T

Dror Lab 85 Dec 29, 2022
Certifiable Outlier-Robust Geometric Perception

Certifiable Outlier-Robust Geometric Perception About This repository holds the implementation for certifiably solving outlier-robust geometric percep

null 83 Dec 31, 2022
Pytorch Geometric Tutorials

Pytorch Geometric Tutorials

Antonio Longa 648 Jan 8, 2023
A geometric deep learning pipeline for predicting protein interface contacts.

A geometric deep learning pipeline for predicting protein interface contacts.

null 44 Dec 30, 2022
3DMV jointly combines RGB color and geometric information to perform 3D semantic segmentation of RGB-D scans.

3DMV 3DMV jointly combines RGB color and geometric information to perform 3D semantic segmentation of RGB-D scans. This work is based on our ECCV'18 p

Владислав Молодцов 0 Feb 6, 2022