TorchDrug is a PyTorch-based machine learning toolbox designed for drug discovery

Overview

TorchDrug


Docs | Tutorials | Benchmarks | Papers Implemented

TorchDrug is a PyTorch-based machine learning toolbox designed for several purposes.

  • Easy implementation of graph operations in a PyTorchic style with GPU support
  • Being friendly to practioners with minimal knowledge about drug discovery
  • Rapid prototyping of machine learning research

Installation

TorchDrug is compatible with Python >= 3.5 and PyTorch >= 1.4.0.

From Conda

conda install -c milagraph -c conda-forge torchdrug

From Source

TorchDrug depends on rdkit, which is only available via conda. You can install rdkit with the following line.

conda install -c conda-forge rdkit
git clone https://github.com/DeepGraphLearning/torchdrug
cd torchdrug
pip install -r requirements.txt
python setup.py install

Quick Start

TorchDrug is designed for human and focused on graph structured data. It enables easy implementation of graph operations in machine learning models. All the operations in TorchDrug are backed by PyTorch framework, and support GPU acceleration and auto differentiation.

from torchdrug import data

edge_list = [[0, 1], [1, 2], [2, 3], [3, 4], [4, 5], [5, 0]]
graph = data.Graph(edge_list, num_node=6)
graph = graph.cuda()
# the subgraph induced by nodes 2, 3 & 4
subgraph = graph.subgraph([2, 3, 4])

Molecules are also supported in TorchDrug. You can get the desired molecule properties without any domain knowledge.

mol = data.Molecule.from_smiles("CCOC(=O)N", node_feature="default", edge_feature="default")
print(mol.node_feature)
print(mol.atom_type)
print(mol.to_scaffold())

You may also register custom node, edge or graph attributes. They will be automatically processed during indexing operations.

with mol.edge():
	mol.is_CC_bond = (mol.edge_list[:, :2] == td.CARBON).all(dim=-1)
sub_mol = mol.subgraph(mol.atom_type != td.NITROGEN)
print(sub_mol.is_CC_bond)

TorchDrug provides a wide range of common datasets and building blocks for drug discovery. With minimal code, you can apply standard models to solve your own problem.

import torch
from torchdrug import datasets

dataset = datasets.Tox21()
dataset[0].visualize()
lengths = [int(0.8 * len(dataset)), int(0.1 * len(dataset))]
lengths += [len(dataset) - sum(lengths)]
train_set, valid_set, test_set = torch.utils.data.random_split(dataset, lengths)
from torchdrug import models, tasks

model = models.GIN(dataset.node_feature_dim, hidden_dims=[256, 256, 256, 256])
task = tasks.PropertyPrediction(model, task=dataset.tasks)

Training and inference are accelerated by multiple CPUs or GPUs. This can be seamlessly switched in TorchDrug by just a line of code.

from torchdrug import core

# CPU
solver = core.Engine(task, train_set, valid_set, test_set, gpus=None)
# Single GPU
solver = core.Engine(task, train_set, valid_set, test_set, gpus=[0])
# Multiple GPUs
solver = core.Engine(task, train_set, valid_set, test_set, gpus=[0, 1, 2, 3])

Contributing

Everyone is welcome to contribute to the developement of TorchDrug. Please refer to contributing guidelines for more details.

License

TorchDrug is released under Apache-2.0 License.

Comments
  • [Bug] AttributeError: can't set attribute

    [Bug] AttributeError: can't set attribute

    In the retrosynthesis tutorial

    this code of synthon completion

    synthon_optimizer = torch.optim.Adam(synthon_task.parameters(), lr=1e-3)
    synthon_solver = core.Engine(synthon_task, synthon_train, synthon_valid,
                                 synthon_test, synthon_optimizer,
                                 gpus=[0], batch_size=128)
    synthon_solver.train(num_epoch=1)
    synthon_solver.evaluate("valid")
    synthon_solver.save("g2gs_synthon_model.pth")
    

    gives below error

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    Input In [13], in <cell line: 11>()
          7 synthon_optimizer = torch.optim.Adam(synthon_task.parameters(), lr=1e-3)
          8 synthon_solver = core.Engine(synthon_task, synthon_train, synthon_valid,
          9                              synthon_test, synthon_optimizer,
         10                              gpus=[0], batch_size=128)
    ---> 11 synthon_solver.train(num_epoch=1)
         12 synthon_solver.evaluate("valid")
         13 synthon_solver.save("g2gs_synthon_model.pth")
    
    File /usr/local/lib/python3.9/dist-packages/torchdrug/core/engine.py:155, in Engine.train(self, num_epoch, batch_per_epoch)
        152 if self.device.type == "cuda":
        153     batch = utils.cuda(batch, device=self.device)
    --> 155 loss, metric = model(batch)
        156 if not loss.requires_grad:
        157     raise RuntimeError("Loss doesn't require grad. Did you define any loss in the task?")
    
    File /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
       1126 # If we don't have any hooks, we want to skip the rest of the logic in
       1127 # this function, and just call forward.
       1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
       1129         or _global_forward_hooks or _global_forward_pre_hooks):
    -> 1130     return forward_call(*input, **kwargs)
       1131 # Do not call functions when jit is used
       1132 full_backward_hooks, non_full_backward_hooks = [], []
    
    File /usr/local/lib/python3.9/dist-packages/torchdrug/tasks/retrosynthesis.py:592, in SynthonCompletion.forward(self, batch)
        589 all_loss = torch.tensor(0, dtype=torch.float32, device=self.device)
        590 metric = {}
    --> 592 pred, target = self.predict_and_target(batch, all_loss, metric)
        593 node_in_pred, node_out_pred, bond_pred, stop_pred = pred
        594 node_in_target, node_out_target, bond_target, stop_target, size = target
    
    File /usr/local/lib/python3.9/dist-packages/torchdrug/tasks/retrosynthesis.py:984, in SynthonCompletion.predict_and_target(self, batch, all_loss, metric)
        981 with reactant.graph():
        982     reactant.reaction = batch["reaction"]
    --> 984 graph1, node_in_target1, node_out_target1, bond_target1, stop_target1 = self.all_edge(reactant, synthon)
        985 graph2, node_in_target2, node_out_target2, bond_target2, stop_target2 = self.all_stop(reactant, synthon)
        987 graph = self._cat([graph1, graph2])
    
    File /usr/local/lib/python3.9/dist-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
         24 @functools.wraps(func)
         25 def decorate_context(*args, **kwargs):
         26     with self.clone():
    ---> 27         return func(*args, **kwargs)
    
    File /usr/local/lib/python3.9/dist-packages/torchdrug/tasks/retrosynthesis.py:557, in SynthonCompletion.all_edge(self, reactant, synthon)
        555 assert (graph.num_edges % 2 == 0).all()
        556 # node / edge features may change because we mask some nodes / edges
    --> 557 graph, feature_valid = self._update_molecule_feature(graph)
        559 return graph[feature_valid], node_in_target[feature_valid], node_out_target[feature_valid], \
        560        bond_target[feature_valid], stop_target[feature_valid]
    
    File /usr/local/lib/python3.9/dist-packages/torchdrug/tasks/retrosynthesis.py:398, in SynthonCompletion._update_molecule_feature(self, graphs)
        395 bond_type[edge_mask] = new_graphs.bond_type.to(device=graphs.device)
        397 with graphs.node():
    --> 398     graphs.node_feature = node_feature
        399 with graphs.edge():
        400     graphs.edge_feature = edge_feature
    
    File /usr/local/lib/python3.9/dist-packages/torchdrug/data/graph.py:160, in Graph.__setattr__(self, key, value)
        158 if hasattr(self, "meta_dict"):
        159     self._check_attribute(key, value)
    --> 160 super(Graph, self).__setattr__(key, value)
    
    File /usr/local/lib/python3.9/dist-packages/torchdrug/core/core.py:84, in _MetaContainer.__setattr__(self, key, value)
         82     if types:
         83         self.meta_dict[key] = types.copy()
    ---> 84 self._setattr(key, value)
    
    File /usr/local/lib/python3.9/dist-packages/torchdrug/core/core.py:93, in _MetaContainer._setattr(self, key, value)
         92 def _setattr(self, key, value):
    ---> 93     return super(_MetaContainer, self).__setattr__(key, value)
    
    AttributeError: can't set attribute
    
    opened by bhadreshpsavani 19
  • Customized target for retrosynthesis

    Customized target for retrosynthesis

    Hi, thanks for sharing this repo!

    I am wondering how I could input arbitrary target/product for retrosynthesis analysis? What target format would the model required besides SMILES? In the notebook, it's performing prediction on USPTO dataset. I am interested in knowing how I could apply this model to the target outside of USPTO.

    Thanks!!

    enhancement 
    opened by juliachen123 11
  • How to evaluate molecule generation models?

    How to evaluate molecule generation models?

    Hi torchdrug team, thank you for the awesome project! I am playing with molecule generation models, and am interested in trying to reproduce the benchmarks posted here: https://torchdrug.ai/docs/benchmark/generation.html

    I am able to follow the tutorial for molecule generation: https://torchdrug.ai/docs/tutorials/generation.html

    But I found that there was no mention of how we can evaluate models once they are fully trained. Is there any evaluator class or oracle that can be called to obtain the metrics as in your benchmark?

    Additionally, do you have any advice on how to set the hyperparameters to fairly reproduce/compare to the GCPN or GraphAF papers?

    opened by chaitjo 10
  • Redesign of the meter logger and integration of the Weights and Biases logger

    Redesign of the meter logger and integration of the Weights and Biases logger

    Feature

    This pull request consists of a redesign of the meter logger. There is a new abstract BaseLogger class in torchdrug.utils.loggers.base_logger. The update and log functions of the core.Meter class are now implemented as methods of this class and are called inside the core.Meter methods.

    For a new logger, there are 2 abstract methods in BaseLogger: log and save_hyperparams. These two have to be defined for a new custom logger.

    The ConsoleLogger is a child of the BaseLogger class and performs the logging as being done currently in TorchDrug. The ConsoleLogger is always on irrespective if another logger is being used or not.

    Similarly, the WandbLogger is also a child of BaseLogger and logs all the metrics to the user's W&B account. The wandb package is not a dependency and if a user tries to use the WandbLogger without installing wandb, they are prompted to install it.

    The constructor of core.Engine is updated to take two more optional arguments:

    • metric_logger: This can be a str or an instance of the custom logger. The accepted str arguments currently are 'wandb' and 'console' (default value is 'console')
    • project: The name of the W&B project the user wants to log to (default value is None)

    Example:

    engine = core.Engine(...,  metric_logger='wandb', project='PropertyPrediction')
    

    or

     from torchdrug.utils.logger.wandb_logger import WandbLogger
     wandb_logger = WandbLogger(project="PropertyPrediction", name="Toxicity Prediction", save_dir="./ClinTox", log_interval=10)
     engine = core.Engine(..., metric_logger=wandb_logger)
    

    A couple of runs for different tasks logged to W&B

    • https://wandb.ai/manan-goel/TorchDrug-Generation
    • https://wandb.ai/manan-goel/TorchDrug-Pretrain
    • https://wandb.ai/manan-goel/TorchDrug-PropertyPrediction
    opened by manangoel99 7
  • `num_relation` mismatches in `message_and_aggregate()`

    `num_relation` mismatches in `message_and_aggregate()`

    I would like to use my custom data https://raw.githubusercontent.com/goga0001/graph/main/data.csv I prepared the data as CSV file and followed the implementation of existing datasets:

    import os
    from torchdrug import data, utils
    from torchdrug.core import Registry as R
    from collections import defaultdict
    from torch.utils import data as torch_data
    from torchdrug import data
    from torchdrug.utils import doc
    
    
    @R.register("datasets.Flavonoid2")
    @doc.copy_args(data.MoleculeDataset.load_csv, ignore=("smiles_field", "target_fields"))
    class Flavonoid2(data.MoleculeDataset):
        """
        Subset of Flavonoid compound database for virtual screening.
    
        Statistics:
            - #Molecule:  4806
            - #Regression task: 2
    
        Parameters:
            path (str): path to store the dataset
            verbose (int, optional): output verbose level
            **kwargs
        """
    
        csv_file = "/content/torchdrug/torchdrug/datasets/data.csv"
        target_fields = ["logP","qed"]
    
        def __init__(self, path, verbose=1, **kwargs):
            self.load_csv(self.csv_file, smiles_field="smiles", target_fields=self.target_fields,
                          verbose=verbose, **kwargs)
    

    Molecules were constructed from smiles but I get assertion error:

    from torch import nn, optim
    optimizer = optim.Adam(task.parameters(), lr = 1e-3)
    solver = core.Engine(task, dataset, None, None, optimizer,
                         gpus=(0,), batch_size=128, log_interval=10)
    
    solver.train(num_epoch=1)
    solver.save("graphaf_flavonoid_1epoch.pkl")
    
    AssertionError                            Traceback (most recent call last)
    [<ipython-input-23-a6a027c50b11>](https://localhost:8080/#) in <module>
          4                      gpus=(0,), batch_size=128, log_interval=10)
          5 
    ----> 6 solver.train(num_epoch=1)
          7 solver.save("graphaf_flavonoid_1epoch.pkl")
    
    10 frames
    [/content/torchdrug/torchdrug/layers/conv.py](https://localhost:8080/#) in message_and_aggregate(self, graph, input)
        414 
        415     def message_and_aggregate(self, graph, input):
    --> 416         assert graph.num_relation == self.num_relation
        417 
        418         node_in, node_out, relation = graph.edge_list.t()
    
    AssertionError:
    

    Thank you! Looking forward to your reply!

    opened by goga0001 6
  • An error:TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

    An error:TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

    there is an error when i tried to run the following test code.

    from torchdrug import data

    edge_list = [[0, 1], [1, 2], [2, 3], [3, 4], [4, 5], [5, 0]] graph = data.Graph(edge_list, num_node=6) graph = graph.cuda()

    the subgraph induced by nodes 2, 3 & 4

    subgraph = graph.subgraph([2, 3, 4])

    the error is : TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases image image how can I fix it

    compatibility 
    opened by JianBin-Liu 5
  • The error when run tutorial of retrosynthesis

    The error when run tutorial of retrosynthesis

    Dear everyone,

    I have install torchdrug correctly, and then follow the tutorial https://torchdrug.ai/docs/tutorials/retrosynthesis.html When I run the code as below:

    from torchdrug import datasets
    
    reaction_dataset = datasets.USPTO50k("D:/test/molecule-datasets/",
                                         node_feature="reaction_reaction_identification",
                                         kekulize=True)
    synthon_dataset = datasets.USPTO50k("D:/test/molecule-dataset/", as_synthon=True,
                                        node_feature="synthon_completion",
                                        kekulize=True)
    

    It happens error as follows:

    Loading D:/test/molecule-datasets/data_processed.csv: 100%|██████████| 50017/50017 [00:00<00:00, 92358.37it/s]
    Constructing molecules from SMILES:   0%|          | 0/50016 [00:00<?, ?it/s]
    Traceback (most recent call last):
      File "F:/workdir/pycharm/Retrosynthesis/main.py", line 5, in <module>
        kekulize=True)
      File "D:\soft\Anaconda3\envs\py37\lib\site-packages\decorator.py", line 232, in fun
        return caller(func, *(extras + args), **kw)
      File "D:\soft\Anaconda3\envs\py37\lib\site-packages\torchdrug-0.1.0-py3.7.egg\torchdrug\core\core.py", line 282, in wrapper
        return init(self, *args, **kwargs)
      File "D:\soft\Anaconda3\envs\py37\lib\site-packages\torchdrug-0.1.0-py3.7.egg\torchdrug\datasets\uspto50k.py", line 63, in __init__
        **kwargs)
      File "D:\soft\Anaconda3\envs\py37\lib\site-packages\torchdrug-0.1.0-py3.7.egg\torchdrug\data\dataset.py", line 112, in load_csv
        self.load_smiles(smiles, targets, verbose=verbose, **kwargs)
      File "D:\soft\Anaconda3\envs\py37\lib\site-packages\torchdrug-0.1.0-py3.7.egg\torchdrug\data\dataset.py", line 232, in load_smiles
        mol = data.Molecule.from_molecule(mol, **kwargs)
      File "D:\soft\Anaconda3\envs\py37\lib\site-packages\torchdrug-0.1.0-py3.7.egg\torchdrug\data\molecule.py", line 183, in from_molecule
        func = R.get("features.atom.%s" % name)
      File "D:\soft\Anaconda3\envs\py37\lib\site-packages\torchdrug-0.1.0-py3.7.egg\torchdrug\core\core.py", line 208, in get
        raise KeyError("Can't find `%s` in `%s`" % (key, ".".join(keys[:i])))
    KeyError: "Can't find `reaction_reaction_identification` in `features.atom`"
    

    what is the problem? could you help me to solve it? Thanks.

    documentation 
    opened by Drlittlelab 5
  • TorchDrug can't use Lr_Scheduler

    TorchDrug can't use Lr_Scheduler

    Hey, I found a bug that when I load the related TorchDrug modules, I can't use the torch.optim.lr_scheduler. Look at this picture which comes from the TorchDrug Colab files(Property Prediction). I add one lr_schduler for the optimizer. and it throws an error.

    image

    However, When I don't load any TorchDrug modules, I can use the optimizer normally.

    opened by Mrz-zz 4
  • ValueError: Fail to parse the docstring of `Smol`. Inconsistent number of parameters in signature and docstring.

    ValueError: Fail to parse the docstring of `Smol`. Inconsistent number of parameters in signature and docstring.

    Trying to build a customized dataset as follows for the molecular generation task.

    @R.register("datasets.Smol")
    
    @doc.copy_args(data.MoleculeDataset.load_csv, ignore=("smiles_field", "target_fields"))
    
    class Smol(data.MoleculeDataset):
    
      smiles_file = "/content/drive/MyDrive/molecule_design/resources/smiles_train.csv"
      target_fields = ["SPLIT"]
    
      def __init__(self, smiles_file, verbose=1, **kwargs):
        self.load_csv(self.smiles_file, smiles_field="smiles", target_fields=self.target_fields,lazy=True,
                          verbose=verbose, **kwargs)
        
      def split(self):
        indexes = defaultdict(list)
        for i, split in enumerate(self.targets["SPLIT"]):
            indexes[split].append(i)
        train_set = torch_data.Subset(self, indexes["train"])
        valid_set = torch_data.Subset(self, indexes["valid"])
        test_set = torch_data.Subset(self, indexes["test"])
        return train_set, valid_set, test_set
    

    but get the following error:

    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    [<ipython-input-29-ada51874abcc>](https://localhost:8080/#) in <module>()
          3 @doc.copy_args(data.MoleculeDataset.load_csv, ignore=("smiles_field", "target_fields"))
          4 
    ----> 5 class Smol(data.MoleculeDataset):
          6 
          7   smiles_file = "/content/drive/MyDrive/molecule_design/resources/smiles_train.csv"
    
    [/usr/local/lib/python3.7/dist-packages/torchdrug/utils/doc.py](https://localhost:8080/#) in wrapper(obj)
         90         if len(docs) != len(parameters):
         91             raise ValueError("Fail to parse the docstring of `%s`. "
    ---> 92                              "Inconsistent number of parameters in signature and docstring." % obj.__name__)
         93         new_params = []
         94         new_docs = []
    
    ValueError: Fail to parse the docstring of `Smol`. Inconsistent number of parameters in signature and docstring.
    

    Did I miss something?

    opened by CaiYitao 4
  • Question about the negative example of KnowledgeGraphCompletion Class

    Question about the negative example of KnowledgeGraphCompletion Class

    Hi,

    In the _strict_negative method function of KnowledgeGraphCompletion, if 'A-->B', 'B-->C' (A and B are entities, --> is relation) are the samples of traning set (.i.e. self.fact_graph) while 'A-->C' is the sample of valiation set, then I think 'A-->C' will be regard as a negative sample in the traning stage. Is that a problem?

    @torch.no_grad()
    def _strict_negative(self, pos_h_index, pos_t_index, pos_r_index):
        batch_size = len(pos_h_index)
        any = -torch.ones_like(pos_h_index)
    
        pattern = torch.stack([pos_h_index, any, pos_r_index], dim=-1)
        pattern = pattern[:batch_size // 2]
    
        # ==================== Code I Talk About ======================
        edge_index, num_t_truth = self.fact_graph.match(pattern)
        t_truth_index = self.fact_graph.edge_list[edge_index, 1]
        pos_index = functional._size_to_index(num_t_truth)
        t_mask = torch.ones(len(pattern), self.num_entity, dtype=torch.bool, device=self.device)
        t_mask[pos_index, t_truth_index] = 0
        neg_t_candidate = t_mask.nonzero()[:, 1]
        num_t_candidate = t_mask.sum(dim=-1)
        neg_t_index = functional.variadic_sample(neg_t_candidate, num_t_candidate, self.num_negative)
        # =======================================================
    
        pattern = torch.stack([any, pos_t_index, pos_r_index], dim=-1)
        pattern = pattern[batch_size // 2:]
        edge_index, num_h_truth = self.fact_graph.match(pattern)
        h_truth_index = self.fact_graph.edge_list[edge_index, 0]
        pos_index = functional._size_to_index(num_h_truth)
        h_mask = torch.ones(len(pattern), self.num_entity, dtype=torch.bool, device=self.device)
        h_mask[pos_index, h_truth_index] = 0
        neg_h_candidate = h_mask.nonzero()[:, 1]
        num_h_candidate = h_mask.sum(dim=-1)
        neg_h_index = functional.variadic_sample(neg_h_candidate, num_h_candidate, self.num_negative)
    
        neg_index = torch.cat([neg_t_index, neg_h_index])
    
        return neg_index
    
    
    opened by AlexHex7 4
  • Conflict with torch due to overwritten modules

    Conflict with torch due to overwritten modules

    I'm interested to understand why it is necessary to overwrite the default nn.Module of torch in patch.py:

    https://github.com/DeepGraphLearning/torchdrug/blob/eeee19181572ef5b8a806b71bdd4d2d1a4e27f67/torchdrug/patch.py#L125

    This seems to be a quite invasive thing since it alters the behavior of any torch.nn module after torchdrug has been imported.

    For example, your implementation of register_buffer in patch.py lacks the keyword argument persistent which is present in native torch: https://github.com/pytorch/pytorch/blob/989b24855efe0a8287954040c89d679625dcabe1/torch/nn/modules/module.py#L277

    I would greatly appreciate if you could please let me know how I can fall back to the native torch behavior after having imported torchdrug somewhere above in my code.

    help wanted 
    opened by jannisborn 4
  • CPU vs GPU

    CPU vs GPU

    I came up against a weird obstacle: after running the same code for Retrosynthesis prediction task on gpu and cpu (perhaps only versions of certain libraries might have differed) I got significantly diffirent results... For gpu the accuracy is much larger. Do you maybe know the reason for this? Because as far as I understand even if results would differ this difference would be pretty small.

    opened by DimGorr 0
  • How to use the generation model to generate specific molecules?

    How to use the generation model to generate specific molecules?

    Hello,

    I was wondering how can I use the generation model to generate specific molecules? For example, I have a small dataset of molecules I am interested in generating, should I use ZINC250k dataset to train GraphAF model on and then use property optimization to generate novel molecules with desired QED, logP properties or should I use my small dataset(around 4k) to train the GraphAF model?

    Thank you kindly,

    Looking forward for your reply

    opened by goga0001 0
  • num_relation mismatches in message_and_aggregate()

    num_relation mismatches in message_and_aggregate()

    There was another issue on this that was closed but there wasn't really a resolution. The problem occurs when I use a custom dataset. The dataset loads correctly:

    import torch
    
    from torchdrug import datasets
    
    ​
    
    dataset = datasets.flav("~/molecule-datasets/", kekulize=True,
    
                                atom_feature="symbol")
    
    18:01:53   Downloading https://raw.githubusercontent.com/gdeol4/torchdrug/master/flav.csv to /root/molecule-datasets/flav.csv
    
    Loading /root/molecule-datasets/flav.csv: 4807it [00:00, 70415.81it/s]            
    Constructing molecules from SMILES: 100%|██████████| 4806/4806 [00:10<00:00, 473.69it/s]
    
    

    However, when attempting to train a model, I encounter the assertion error:

    solver.train(num_epoch=1)
    solver.save("graphaf_data_1epoch.pkl")
    
    File /notebooks/torchdrug/torchdrug/layers/conv.py:416, in RelationalGraphConv.message_and_aggregate(self, graph, input)
        415 def message_and_aggregate(self, graph, input):
    --> 416     assert graph.num_relation == self.num_relation
        418     node_in, node_out, relation = graph.edge_list.t()
        419     node_out = node_out * self.num_relation + relation
    
    AssertionError: 
    

    There does seem to be a mismatch here:

    dataset.num_bond_type
    
    2
    
    model.layers[0].num_relation
    
    2
    
    dataset[0]["graph"].num_relation
    
    tensor(3)
    

    My attempt at fixing it:

    for data in dataset:
      data['graph'].num_relation = torch.tensor(2)
    

    But the value remains unchanged

    opened by gdeol4 0
  • The accuracy of retrosynthesis are different from the paper

    The accuracy of retrosynthesis are different from the paper

    Hello, Thanks for sharing this library! The results of https://torchdrug.ai/docs/tutorials/retrosynthesis.html are different from the G2Gs. For reaction class is un-known, these are the reported results from the paper: top-1 accuracy: 0.489 top-3 accuracy: 0.676 top-5 accuracy: 0.725 top-10 accuracy: 0.755 These are the reported results from the https://torchdrug.ai/docs/tutorials/retrosynthesis.html: top-1 accuracy: 0.47541 top-3 accuracy: 0.741803 top-5 accuracy: 0.827869 top-10 accuracy: 0.879098 I cannot understand why the result of k>2 is higher than reported in the literature. Thank you very much

    duplicate 
    opened by z15544534 2
  • ImportError: No module named 'embedding'

    ImportError: No module named 'embedding'

    Below is my code

    import torch
    from torchdrug import core, datasets, tasks, models
    from torchdrug.models import RotatE
    
    import matplotlib
    matplotlib.use('TkAgg')
    import matplotlib.pyplot as plt
    
    dataset = datasets.FB15k237("~/kg-datasets/")
    train_set, valid_set, test_set = dataset.split()
    
    
    model: RotatE = models.RotatE(num_entity=dataset.num_entity,
                          num_relation=dataset.num_relation,
                          embedding_dim=2048, max_score=9)
    
    task = tasks.KnowledgeGraphCompletion(model, num_negative=256,
                                          adversarial_temperature=1)
    
    optimizer = torch.optim.Adam(task.parameters(), lr=2e-5)
    solver= core.Engine(task, train_set, valid_set, test_set, optimizer,
                         gpus=[0], batch_size=1024)
    solver.train(num_epoch=100)
    solver.evaluate("valid")
    

    Below is the error:

    Traceback (most recent call last):
      File "C:\Users\lenovo\PycharmProjects\pythonProject2\main.py", line 23, in <module>
        solver.train(num_epoch=100)
      File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\core\engine.py", line 155, in train
        loss, metric = model(batch)
      File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\tasks\reasoning.py", line 85, in forward
        pred = self.predict(batch, all_loss, metric)
      File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\tasks\reasoning.py", line 160, in predict
        pred = self.model(self.fact_graph, h_index, t_index, r_index, all_loss=all_loss, metric=metric)
      File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
        result = self.forward(*input, **kwargs)
      File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\models\embedding.py", line 191, in forward
        score = functional.rotate_score(self.entity, self.relation * self.relation_scale,
      File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\layers\functional\embedding.py", line 266, in rotate_score
        score = RotatEFunction.apply(entity, relation, h_index, t_index, r_index)
      File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\layers\functional\embedding.py", line 108, in forward
        forward = embedding.rotate_forward_cuda
      File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\utils\torch.py", line 27, in __getattr__
        return getattr(self.module, key)
      File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\utils\decorator.py", line 102, in __get__
        result = self.func(obj)
      File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\utils\torch.py", line 31, in module
        return cpp_extension.load(self.name, self.sources, self.extra_cflags, self.extra_cuda_cflags,
      File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torch\utils\cpp_extension.py", line 1079, in load
        return _jit_compile(
      File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torch\utils\cpp_extension.py", line 1317, in _jit_compile
        return _import_module_from_library(name, build_directory, is_python_module)
      File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torch\utils\cpp_extension.py", line 1699, in _import_module_from_library
        file, path, description = imp.find_module(module_name, [path])
      File "C:\Users\lenovo\.conda\evns\td2\lib\imp.py", line 296, in find_module
        raise ImportError(_ERR_MSG.format(name), name=name)
    ImportError: No module named 'embedding'
    

    I've did some research but couldn't figure out why, can anyone help me here?

    opened by iamme1234567 1
  • Error when trying to use the MPNN implementation on QM9

    Error when trying to use the MPNN implementation on QM9

    python = 3.9 torch = 1.13 torchdrug = 0.2.0.post1 torchscatter = 2.1.0 torch cluster = 1.6.0

    Code to reproduce error: " import torch import pickle from torchdrug import datasets from torchdrug import core, models, tasks

    #dataset = datasets.QM9("~/molecule-datasets/",node_position=True) #with open("QM9.pkl", "wb") as fout:

    pickle.dump(dataset, fout)

    #exit() with open("QM9.pkl", "rb") as fin: dataset = pickle.load(fin)

    lengths = [int(0.8 * len(dataset)), int(0.1 * len(dataset))] lengths += [len(dataset) - sum(lengths)] train_set, valid_set, test_set = torch.utils.data.random_split(dataset, lengths)

    model = models.MPNN(input_dim=dataset.node_feature_dim, hidden_dim=256, edge_input_dim=dataset.edge_feature_dim, num_layer=1, num_gru_layer=1, num_mlp_layer=2, num_s2s_step=3, short_cut=False, batch_norm=False, activation='relu', concat_hidden=False)

    task = tasks.PropertyPrediction(model, task=dataset.tasks)

    optimizer = torch.optim.Adam(task.parameters(), lr=1e-3) solver = core.Engine(task, train_set, valid_set, test_set, optimizer, gpus=[0], batch_size=32) solver.train(num_epoch=1) #solver.evaluate("valid") "

    Error: File "/home/nhattrup/deep_learning/final_proj/example.py", line 35, in solver.train(num_epoch=1) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torchdrug/core/engine.py", line 155, in train loss, metric = model(batch) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torchdrug/tasks/property_prediction.py", line 96, in forward pred = self.predict(batch, all_loss, metric) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torchdrug/tasks/property_prediction.py", line 134, in predict output = self.model(graph, graph.node_feature.float(), all_loss=all_loss, metric=metric) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torchdrug/models/mpnn.py", line 75, in forward x = self.layer(graph, layer_input) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torchdrug/layers/conv.py", line 92, in forward update = self.message_and_aggregate(graph, input) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torchdrug/layers/conv.py", line 61, in message_and_aggregate message = self.message(graph, input) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torchdrug/layers/conv.py", line 650, in message message = torch.einsum("bed, bd -> be", transform, input[node_in]) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torch/functional.py", line 378, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)

    One thing I should note is that I ran into similar issues to issue #95 so my pickle file only contains the molecules that were loaded properly (only < 100 couldn't be loaded properly so essentially the entire dataset). Thanks for any help on this.

    opened by NicholasHattrup 0
Releases(v0.2.0)
  • v0.2.0(Sep 19, 2022)

    V0.2.0 is a major release with a new family member TorchProtein, a library for machine-learning-guided protein science. Aiming at simplifying the development of protein methods, TorchProtein encapsulates many complicated yet repetitive subroutines into functional modules, including widely-used datasets, flexible data processing operations, advanced encoding models, and diverse protein tasks.

    Such comprehensive encapsulation enables users to develop protein machine learning solutions with one easy-to-use library. It avoids the embarrassment of gluing multiple libraries into a pipeline.

    With TorchProtein, we can rapidly prototype machine learning solutions to various protein applications within 20 lines of codes, and conduct ablation studies by substituting different parts of the solution with off-the-shelf modules. Furthermore, we can easily adapt these modules to our own needs, and make systematic analyses by comparing the new results to a benchmark provided in the library.

    Additionally, TorchProtein is designed to be accessible to everyone. For inexperienced users, like beginners or biological researchers, TorchProtein provides user-friendly APIs to simplify the development of protein machine learning solutions. Meanwhile, for professional users, TorchProtein also preserves enough flexibility to satisfy their demands, supported by features like modular design of the library and on-the-fly graph construction.

    Main Features

    Simplify Data Processing

    • It is challenging to transform raw bioinformatic protein datasets into tensor formats for machine learning. To reduce tedious operations, TorchProtein provides us with a data structure data.Protein and its batched extension data.PackedProtein to automate the data processing step.

      • data.Protein and data.PackedProtein automatically gather protein data from various bio-sources and seamlessly switch between data formats like pdb files, RDKit objects and sequences. Please see the section data structures and operations for transforming from and to sequences and RDKit objects.

        # construct a data.Protein instance from a pdb file
        pdb_file = ...
        protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol")
        print(protein)
        
        # write a data.Protein instance back to a pdb file
        new_pdb_file = ...
        protein.to_pdb(new_pdb_file)
        
        Protein(num_atom=445, num_bond=916, num_residue=57)
        
      • data.Protein and data.PackedProtein automatically pre-process all kinds of features of atoms, bonds and residues, by simply setting up several arguments.

        pdb_file = ...
        protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol")
        
        # feature
        print(protein.residue_feature.shape)
        print(protein.atom_feature.shape)
        print(protein.bond_feature.shape)
        
        torch.Size([57, 21])
        torch.Size([445, 3])
        torch.Size([916, 1])
        
      • data.Protein and data.PackedProtein automatically keeps track of numerous attributes associated with atoms, bonds, residues and the whole protein.

        • For example, reference offers a way to register new attributes as node, edge or graph property, and in this way, the new attributes would automatically go along with the node, edge or graph themself. More in-built attributes are listed in the section data structures and operations.
        protein = ...
        
        with protein.node():
            protein.node_id = torch.tensor([i for i in range(0, protein.num_node)])
        with protein.edge():
            protein.edge_cost = torch.rand(protein.num_edge)
        with protein.graph():
            protein.graph_feature = torch.randn(128)
        
        • Even more, reference can be utilized to maintain the correspondence between two well related objects. For example, the mapping atom2residue maintains relationship between atoms and residues, and enables indexing on either of them.
        protein = ...
        
        # create a mask indices for atoms in a glutamine (GLN)
        is_glutamine = protein.residue_type[protein.atom2residue] == protein.residue2id["GLN"]
        mask_indices = is_glutamine.nonzero().squeeze(-1)
        print(mask_indices)
        
        # map the masked atoms back to the glutamine residue
        residue_type = protein.residue_type[protein.atom2residue[mask_indices]]
        print([protein.id2residue[r] for r in residue_type.tolist()])
        
        tensor([ 26,  27,  28,  29,  30,  31,  32,  33,  34, 307, 308, 309, 310, 311,
                312, 313, 314, 315, 384, 385, 386, 387, 388, 389, 390, 391, 392])
        ['GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN']
        
    • It is useful to augment protein data by modifying protein graphs or constructing new ones. With the protein operations and the graph construction layers provided in TorchProtein,

      • we can easily modify proteins on the fly by batching, slicing sequences, masking out side chains, etc. Please see the tutorials for more details on masking.

        pdb_file = ...
        protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol")
        
        # batch
        proteins = data.Protein.pack([protein, protein, protein])
        
        # slice sequences
        # use indexing to extract consecutive residues of a particular protein
        two_residues = protein[[0,2]]
        two_residues.visualize()
        

        two residues

      • we can construct protein graphs on the fly with GPU acceleration, which offers users flexible choices rather than using fixed pre-processed graphs. Below is an example to build a graph with only alpha carbon atoms, please check tutorials for more cases, such as adding spatial / KNN / sequential edges.

        protein = ...
        # transfer from CPU to GPU
        protein = protein.cuda()
        print(protein)
        
        # build a graph with only alpha carbon (CA) atoms
        node_layers = [geometry.AlphaCarbonNode()]
        graph_construction_model = layers.GraphConstruction(node_layers=node_layers)
        
        original_protein = data.Protein.pack([protein])
        CA_protein = graph_construction_model(_protein)
        print("Graph before:", original_protein)
        print("Graph after:", CA_protein)
        
        Protein(num_atom=445, num_bond=916, num_residue=57, device='cuda:0')
        Graph before: PackedProtein(batch_size=1, num_atoms=[2639], num_bonds=[5368], num_residues=[350])
        Graph after: PackedProtein(batch_size=1, num_atoms=[350], num_bonds=[0], num_residues=[350])
        

    Easy to Prototype Solutions

    With TorchProtein, common protein tasks can be finished within 20 lines of codes, such as sequence-based protein property prediction task. Below is an example and more examples of different popular protein tasks and models can be found in Protein Tasks, Models and Tutorials.

    import torch
    from torchdrug import datasets, transforms, models, tasks, core
    
    truncate_transform = transforms.TruncateProtein(max_length=200, random=False)
    protein_view_transform = transforms.ProteinView(view="residue")
    transform = transforms.Compose([truncate_transform, protein_view_transform])
    
    dataset = datasets.BetaLactamase("~/protein-datasets/", residue_only=True, transform=transform)
    train_set, valid_set, test_set = dataset.split()
    
    model = models.ProteinCNN(input_dim=21,
                              hidden_dims=[1024, 1024],
                              kernel_size=5, padding=2, readout="max")
    
    task = tasks.PropertyPrediction(model, task=dataset.tasks,
                                    criterion="mse", metric=("mae", "rmse", "spearmanr"),
                                    normalization=False, num_mlp_layer=2)
    
    optimizer = torch.optim.Adam(task.parameters(), lr=1e-4)
    solver = core.Engine(task, train_set, valid_set, test_set, optimizer, 
                         gpus=[0], batch_size=64)
    solver.train(num_epoch=10)
    solver.evaluate("valid")
    
    mean absolute error [scaled_effect1]: 0.249482
    root mean squared error [scaled_effect1]: 0.304326
    spearmanr [scaled_effect1]: 0.44572
    

    Compatible with Existing Molecular Models in TorchDrug

    • TorchProtein follows the scientific fact that proteins are macromolecules. The core data structures data.Protein and data.PackedProtein inherit from data.Molecule and data.PackedMolecule respectively. Therefore, we can apply any existing molecule model in TorchDrug to proteins

      import torch
      from torchdrug import layers, datasets, transforms, models, tasks, core
      from torchdrug.layers import geometry
      
      truncate_transform = transforms.TruncateProtein(max_length=200, random=False)
      protein_view_transform = transforms.ProteinView(view="residue")
      transform = transforms.Compose([truncate_transform, protein_view_transform])
      
      dataset = datasets.EnzymeCommission("~/protein-datasets/", transform=transform)
      train_set, valid_set, test_set = dataset.split()
      
      model = models.GIN(input_dim=21,
                          hidden_dims=[256, 256, 256, 256],
                          batch_norm=True, short_cut=True, concat_hidden=True)
      
      graph_construction_model = layers.GraphConstruction(
                                        node_layers=[geometry.AlphaCarbonNode()], 
                                        edge_layers=[geometry.SpatialEdge(radius=10.0, min_distance=5),
                                        geometry.KNNEdge(k=10, min_distance=5),
                                        geometry.SequentialEdge(max_distance=2)],
                                        edge_feature="residue_type"
                                 )
      
      task = tasks.MultipleBinaryClassification(model, graph_construction_model=graph_construction_model, num_mlp_layer=3,
                                                task=list(range(len(dataset.tasks))), criterion="bce",
                                                metric=("auprc@micro", "f1_max"))
      
      optimizer = torch.optim.Adam(task.parameters(), lr=1e-4)
      solver = core.Engine(task, train_set, valid_set, test_set, optimizer, 
                           gpus=[0], batch_size=4)
      solver.train(num_epoch=10)
      solver.evaluate("valid")
      
      auprc@micro: 0.187884
      f1_max: 0.231008
      
    • In Protein-Ligand Interaction (PLI) prediction task, we can utilize a molecular encoder module to extract the representations of molecules. Please check tutorial 2 for more details.

      train_set, valid_set, test_set = ...
      
      # protein encoder
      model = models.ProteinCNN(input_dim=21,
                                hidden_dims=[1024, 1024],
                                kernel_size=5, padding=2, readout="max")
      # molecule encoder
      model2 = models.GIN(input_dim=66,
                          hidden_dims=[256, 256, 256, 256],
                          batch_norm=True, short_cut=True, concat_hidden=True)
      
      task = tasks.InteractionPrediction(model, model2=model2, task=dataset.tasks,
                                         criterion="mse", metric=("mae", "rmse", "spearmanr"),
                                         normalization=False, num_mlp_layer=2)
      
      optimizer = torch.optim.Adam(task.parameters(), lr=1e-4)
      solver = core.Engine(task, train_set, valid_set, test_set, optimizer,
                           gpus=[0], batch_size=16)
      solver.train(num_epoch=5)
      solver.evaluate("valid")
      
      mean absolute error [scaled_effect1]: 0.249482
      root mean squared error [scaled_effect1]: 0.304326
      spearmanr [scaled_effect1]: 0.44572
      

    Support From the Developer (@DeepGraphLearning/torchdrug-maintainers)

    There is always an active supporting team to answer questions and provide helps. Feedbacks of use experience and contributions for development are welcomed.

    New Modules

    Data Structures and Operations

    data.Protein

    • Representative attributes:
      • data.Protein.edge_list: list of edges and each edge is represented by a tuple (node_in, node_out, bond_type)
      • data.Protein.atom_type: atom types
      • data.Protein.bond_type: bond types
      • data.Protein.residue_type: residue types
      • data.Protein.view: default view for this protein. Can be “atom” or “residue”
      • data.Protein.atom_name: atom names in each residue
      • data.Protein.atom2residue: atom id to residue id mapping
      • data.Protein.is_hetero_atom: hetero atom indicator
      • data.Protein.occupancy: protein occupancy
      • data.Protein.b_factor: temperature factors
      • data.Protein.residue_number: residue numbers
      • data.Protein.insertion_code: insertion codes
      • data.Protein.chain_id: chain ids
    • Representative Methods:
      • data.Protein.from_molecule: create a protein from an RDKit object.
      • data.Protein.from_sequence: create a protein from a sequence.
      • data.Protein.from_sequence_fast: a faster version of creating a protein from a sequence.
      • data.Protein.from_pdb: create a protein from a PDB file.
      • data.Protein.to_molecule: return an RDKit object of this protein.
      • data.Protein.to_sequence: return a sequence of this protein.
      • data.Protein.to_pdb: write this protein to a pdb file.
      • data.Protein.split: split this protein graph into multiple disconnected protein graphs.
      • data.Protein.pack: batch a list of data.Protein into data.PackedProtein.
      • data.Protein.repeat: repeat this protein.
      • data.Protein.residue2atom: map residue id to atom ids.
      • data.Protein.residue_mask: return a masked protein based on the specified residues.
      • data.Protein.subresidue: return a subgraph based on the specified residues.
      • data.Protein.residue2graph: residue id to protein id mapping.
      • data.Protein.node_mask: return a masked protein based on the specified nodes.
      • data.Protein.edge_mask: return a masked protein based on the specified edges.
      • data.Protein.compact: remove isolated nodes and compact node ids.

    data.PackedProtein

    • Representative attributes:
      • data.PackedProtein.edge_list: list of edges and each edge is represented by a tuple (node_in, node_out, bond_type)
      • data.PackedProtein.atom_type: atom types
      • data.PackedProtein.bond_type: bond types
      • data.PackedProtein.residue_type: residue types
      • data.PackedProtein.view: default view for this protein. Can be “atom” or “residue”
      • data.PackedProtein.num_nodes: number of nodes in each protein graph
      • data.PackedProtein.num_edges: number of edges in each protein graph
      • data.PackedProtein.num_residues: number of residues in each protein graph
      • data.PackedProtein.offsets: node id offsets in different proteins
    • Representative methods:
      • data.PackedProtein.node_mask: return a masked packed protein based on the specified nodes.
      • data.PackedProtein.edge_mask: return a masked packed protein based on the specified edges.
      • data.PackedProtein.residue_mask: return a masked packed protein based on the specified residues.
      • data.PackedProtein.graph_mask: return a masked packed protein based on the specified protein graphs.
      • data.PackedProtein.from_molecule: create a protein from a list of RDKit objects.
      • data.PackedProtein.from_sequence: create a protein from a list of sequences.
      • data.PackedProtein.from_sequence_fast: a faster version of creating a protein from a list of sequences.
      • data.PackedProtein.from_pdb: create a protein from a list of PDB files.
      • data.PackedProtein.to_molecule: return a list of RDKit objects of this packed protein.
      • data.PackedProtein.to_sequence: return a list of sequences of this packed protein.
      • data.PackedProtein.to_pdb: write this packed protein to a list of pdb files.
      • data.PackedProtein.merge: merge multiple packed proteins into a single packed protein.
      • data.PackedProtein.repeat: repeat this packed protein.
      • data.PackedProtein.repeat_interleave: repeat this packed protein, behaving similarly to torch.repeat_interleave_.
      • data.PackedProtein.residue2graph: residue id to graph id mapping.

    Models

    • GearNet: Geometry Aware Relational Graph Neural Network.
    • ESM: Evolutionary Scale Modeling (ESM).
    • ProteinCNN: protein shallow CNN.
    • ProteinResNet: protein ResNet.
    • ProteinLSTM: protein LSTM.
    • ProteinBERT: protein BERT.
    • Statistic: the statistic feature engineering for protein sequence.
    • Physicochemical: the physicochemical feature engineering for protein sequence.

    Protein Tasks

    Sequence-based Protein Property Prediction:

    • tasks.PropertyPrediction predicts some property of each protein, such as Beta-lactamase activity, stability and solubility for proteins.
    • tasks.NodePropertyPrediction predicts some property of each residue in proteins, such as the secondary structure (coil, strand or helix) of each residue.
    • tasks.ContactPrediction predicts whether any pair of residues contact or not in the folded structure.
    • tasks.InteractionPrediction predicts the binding affinity of two interacting proteins or of a protein and a ligand, i.e. performing PPI affinity prediction or PLI affinity prediction.

    Structure-based Protein Property Prediction:

    • tasks.MultipleBinaryClassification predicts whether a protein owns several specific functions or not with binary labels.

    Pre-trained Protein Structure Representations:

    • Self-Supervised Protein Structure Pre-training: acquires informative protein representations from massive unlabeled protein structures, such as tasks.EdgePrediction, tasks.AttributeMasking, tasks.ContextPrediction, tasks.DistancePrediction, tasks.AnglePrediction, tasks.DihedralPrediction .
    • Fine-tuning on Downstream Task: fine-tunes the pre-trained protein encoder on downstream tasks, such as any property prediction task mentioned above.

    Protein Datasets

    Protein Property Prediction Datasets

    • BetaLactamase : protein sequences with activity labels
    • Fluorescence: protein sequences with fitness labels
    • Stability: protein sequences with stability labels
    • Solubility: protein sequences with solubility labels
    • BinaryLocalization: protein sequences with membrane-bound or soluble labels
    • SubcellularLocalization: protein sequences with natural cell location labels
    • EnzymeCommission: protein sequences and 3D structures with EC number labels for catalysis in biochemical reactions
    • GeneOntology: protein sequences and 3D structures with GO term labels, including molecular function (MF), biological process (BP) and cellular component (CC)
    • AlphaFoldDB: protein sequences and 3D structures predicted by AlphaFold

    Protein Structure Prediction Datasets

    • Fold: protein sequences and 3D structures with fold labels determined by the global structural topology
    • SecondaryStructure: protein sequences and 3D structures with secondary structure labels determined by the local structures
    • ProteinNet: protein sequences and 3D structures for the contact prediction task

    Protein-Protein Interaction Prediction Datasets

    • HumanPPI: protein sequences with binary interaction labels for human proteins
    • YeastPPI: protein sequences with binary interaction labels for yeast proteins
    • PPIAffinity: protein sequences with binding affinity values measured by $p_{K_d}$

    Protein Ligand Interaction Prediction Datasets

    • BindingDB: protein sequences and molecule graphs with binding affinity between pairs of protein and ligand
    • PDBBind: protein sequences and molecule graphs with binding affinity between pairs of protein and ligand

    Data Transform Modules

    • TruncateProtein: truncate over long protein sequences into a fixed length
    • ProteinView: convert proteins to a specific view

    Graph Construction Layers

    • SubsequenceNode: take a protein subsequence of a specific length
    • SubspaceNode: extract a subgraph by only keeping neighboring nodes in a spatial ball for each centered node
    • RandomEdgeMask: mask out some edges randomly from the protein graph

    Tutorials

    To help users gain a comprehensive understanding of TorchProtein, we recommend some user-friendly tutorials for its basic usage and examples to various protein-related tasks. These tutorials may also serve as boilerplate codes for users to develop their own applications.

    Bug Fixes

    • Fix an error in the decorator @utils.cached (#118)
    • Fix an index error in data.Graph.split() (#115)
    • Fix setting attribute node_feature , edge_feature and graph_feature (#116)
    • Fix incorrect node feature shape for the synthon dataset USPTO50k (#116)
    • Fix a compatible issue when adding node/edge/graph reference and changing node/edge to atom/bond (#116, #117)
    Source code(tar.gz)
    Source code(zip)
  • v0.1.3(Jun 4, 2022)

    TorchDrug 0.1.3 release introduces new features like W&B intergration and index reference. It also provides new functions and metrics for common development need. Note 0.1.3 has some compatibility changes and be careful when you update your TorchDrug from an older version.

    • W&B Integration
    • Index Reference
    • New Functions
    • New Metrics
    • Improvements
    • Bug Fixes
    • Compatibility Changes

    W&B Integration

    Tracking experiment progress is one of the most important demand from ML researchers and developers. For TorchDrug users, we provide a native integration of W&B platform. By adding only one argument in core.Engine, TorchDrug will automatically copy every hyperparameter and training log to your W&B database (thanks to @manangoel99).

    solver = core.Engine(task, train_set, valid_set, test_set, optimizer, logger="wandb")
    

    Now you can track your training and validation performance in your browser, and compare them across different experiments.

    Wandb demo

    Index Reference

    Maintaining node and edge attributes could be painful when one applies a lot of transformations to a graph. TorchDrug aims to eliminate such tedious steps by registering custom attributes. This update extends the capacity of custom attributes to index reference. That means, we allow attributes to refer to indexes of nodes, edges or graphs, and they will be automatically maintained in any graph operation.

    To use index reference, simply add a context manager when we define the attributes.

    with graph.edge(), graph.edge_reference():
        graph.inv_edge_index = torch.tensor(inv_edge_index)
    

    Foor more details on index reference, please take a look at our notes. Typical use cases include

    • A pointer to the inverse edge of each edge.
    • A pointer to the parent node of each node in a tree.
    • A pointer to the incoming tree edge of each node in a DFS.

    Let us know if you find more interesting usage of index reference!

    New Functions

    Message passing over line graphs is getting more and more popular in the recent years. This version provides data.Graph.line_graph to efficiently construct line graphs on GPUs. It supports both a single graph or a batch of graphs.

    We are constantly focusing on better batching of irregular structures, and the variadic functions in TorchDrug are an efficient way to process batch of variadic-sized tensors without padding. This update introduces 3 new variadic functions.

    • variadic_meshgrid generates a meshgrid from two variadic tensors. Useful for implementing pairwise operations.
    • variadic_to_padded converts a variadic tensor to a padded tensor.
    • padded_to_variadic converts a padded tensor to a variadic tensor.

    New Metrics

    New metrics include accuracy, matthews_corrcoef, pearsonr, spearmanr. All the metrics are the same as their counterparts in scipy, but they are implemented in PyTorch and support auto differentiation.

    Improvements

    • Add data.Graph.to (#70, thanks to @cthoyt)
    • Extend tasks.SynthonCompletion for arbitrary atom features (#62)
    • Speed up lazy data loading (#58, thanks to @wconnell)
    • Speed up rspmm cuda kernels
    • Add docker support
    • Add more documentation for data.Graph and data.Molecule

    Bug Fixes

    • Fix computation of output dimension in several GNNs (#92, thanks to @kanojikajino)
    • Fix data.PackedGraph.__getitem__ when the batch is empty (#83, thanks to @jannisborn)
    • Fix patched modules for PyTorch>=1.6.0 (#77)
    • Fix make_configurable for torch.utils.data (#85)
    • Fix multi_slice_mask, variadic_max for multi-dimensional input
    • Fix variadic_topk for input containing infinite values

    Compatibility Changes

    TorchDrug now supports Python 3.7/3.8/3.9. Starting from this version, TorchDrug requires a minimal PyTorch version of 1.8.0 and a minimal RDKit version of 2020.09.

    Argument node_feature and edge_feature are renamed to atom_feature and bond_feature in data.Molecule.from_smiles and data.Molecule.from_molecule. The old interface is still supported with deprecated warnings.

    Source code(tar.gz)
    Source code(zip)
  • v0.1.2(Oct 23, 2021)

    0.1.2 Release Notes

    The recent 0.1.2 release of TorchDrug is an update on Colab tutorials, data structures, functions, datasets and bug fixes. We are grateful to see growing interests and involvement from the community, especially on the retrosynthesis task. Welcome more in the future!

    • Colab Tutorials
    • New Data Structures
    • New Functions
    • New Datasets
    • Bug Fixes

    Colab Tutorials

    To familiarize users with the logic and capacity of TorchDrug, we compile a full set of Colab tutorials, covering from basic usage to different drug discovery tasks. All the tutorials are fully interactive and may serve as boilerplate code for your own applications.

    • Basic Usage and Pipeline shows the manipulation of data structures like data.Graph and data.Molecule, as well as the training and evaluation pipelines for property prediction models.
    • Pretrained Molecular Representations demonstrates the steps for self-supervised pretraining of a molecular representation model and finetuning it on downstream tasks.
    • De novo Molecule Design illustrates the routine of training generative models for molecule generation and finetuning them with reinforcement learning for property optimization. Two popular models, GCPN and GraphAF, are covered in the tutorial.
    • Retrosynthesis shows how to use the state-of-the-art model, G2Gs, to predict a set reactants for synthesizing a target molecule.
    • Knowledge Graph Reasoning goes through the steps of training and evaluating models for knowledge graph completion, including both knowledge graph embeddings and neural inductive logic programming.

    New Data Structures

    • A new data structure data.Dictionary that stores key-value mapping of PyTorch tensors on either CPUs or GPUs. It enjoys O(n) memory consumption and O(1) query time, and supports parallelism over batch of queries. This API provides a great opportunity for implementing sparse lookup tables or set operations in a PyTorchic style.
    • A new method data.Graph.match to efficiently retrieve all edges of specific patterns on either CPUs or GPUs. It scales linearly w.r.t. the number of patterns plus the number of retrieved edges, regardless the size of the graph. Typical usage of this method includes querying the existence of edges, generating random walks or even extracting ego graphs.

    New Functions

    Batching irregular structures, such as graphs, sets or sequences with different sizes, is a common demand in drug discovery. Instead of clumsy padding-based implementation, TorchDrug provides a family of functions that efficiently manipulate batch of variadic-sized tensors without padding. The update contains the following new variadic functions.

    • variadic_arange returns a 1-D tensor that contains integer intervals of variadic sizes.
    • variadic_softmax computes softmax over categories with variadic sizes.
    • variadic_sort sorts elements in sets with variadic sizes.
    • variadic_randperm returns random permutations for sets with variadic sizes, where the i-th permutation contains integers from 0 to size[i] - 1.
    • variadic_sample draws samples with replacement from sets with variadic sizes.

    New Datasets

    • PCQM4M: A large-scale molecule property prediction dataset, originally used in OGB-LSC (thanks to @OPAYA )

    Bug Fixes

    • Fix import of sascorer in plogp evaluation (#18, #31)
    • Fix atoms with stereo bonds in retrosynthesis (#42, #43)
    • Fix lazy construction for molecule datasets (#30, thanks to @DaShenZi721 )
    • Fix ChEMBLFiltered dataset (#36)
    • Fix ZINC2m dataset (#33)
    • Fix USPTO50k dataset (#32)
    • Fix bugs in core.Configurable (#26)
    • Fix/improve documentation (#16, #28, #41)
    • Fix installation on macOS (#29)
    Source code(tar.gz)
    Source code(zip)
Owner
MilaGraph
Research group led by Prof. Jian Tang at Mila-Quebec AI Institute (https://mila.quebec/) focusing on graph representation learning and graph neural networks.
MilaGraph
Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Automated Machine Learning Pipeline for tabular data. Designed for predictive maintenance applications, failure identification, failure prediction, condition monitoring, etc.

Amplo 10 May 15, 2022
Contains an implementation (sklearn API) of the algorithm proposed in "GENDIS: GEnetic DIscovery of Shapelets" and code to reproduce all experiments.

GENDIS GENetic DIscovery of Shapelets In the time series classification domain, shapelets are small subseries that are discriminative for a certain cl

IDLab Services 90 Oct 28, 2022
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 9, 2023
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

Vowpal Wabbit 8.1k Dec 30, 2022
CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

CML with cloud compute This repository contains a sample project using CML with Terraform (via the cml-runner function) to launch an AWS EC2 instance

Iterative 19 Oct 3, 2022
Python 3.6+ toolbox for submitting jobs to Slurm

Submit it! What is submitit? Submitit is a lightweight tool for submitting Python functions for computation within a Slurm cluster. It basically wraps

Facebook Incubator 768 Jan 3, 2023
A Python implementation of the Robotics Toolbox for MATLAB

Robotics Toolbox for Python A Python implementation of the Robotics Toolbox for MATLAB® GitHub repository Documentation Wiki (examples and details) Sy

Peter Corke 1.2k Jan 7, 2023
PyPOTS - A Python Toolbox for Data Mining on Partially-Observed Time Series

A python toolbox/library for data mining on partially-observed time series, supporting tasks of forecasting/imputation/classification/clustering on incomplete multivariate time series with missing values.

Wenjie Du 179 Dec 31, 2022
A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

alkaline-ml 1.3k Dec 22, 2022
CobraML: Completely Customizable A python ML library designed to give the end user full control

CobraML: Completely Customizable What is it? CobraML is a python library built on both numpy and numba. Unlike other ML libraries CobraML gives the us

Sriram Govindan 14 Dec 19, 2021
A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

Light Gradient Boosting Machine LightGBM is a gradient boosting framework that uses tree based learning algorithms. It is designed to be distributed a

Microsoft 14.5k Jan 7, 2023
Machine learning template for projects based on sklearn library.

Machine learning template for projects based on sklearn library.

Janez Lapajne 17 Oct 28, 2022
PLUR is a collection of source code datasets suitable for graph-based machine learning.

PLUR (Programming-Language Understanding and Repair) is a collection of source code datasets suitable for graph-based machine learning. We provide scripts for downloading, processing, and loading the datasets. This is done by offering a unified API and data structures for all datasets.

Google Research 76 Nov 25, 2022
Predico Disease Prediction system based on symptoms provided by patient- using Python-Django & Machine Learning

Predico Disease Prediction system based on symptoms provided by patient- using Python-Django & Machine Learning

Felix Daudi 1 Jan 6, 2022
Painless Machine Learning for python based on scikit-learn

PlainML Painless Machine Learning Library for python based on scikit-learn. Install pip install plainml Example from plainml import KnnModel, load_ir

null 1 Aug 6, 2022
Uber Open Source 1.6k Dec 31, 2022
SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow

SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow, in High Performance Computing (HPC) simulations and workloads.

Cray Labs 139 Jan 1, 2023