TorchDrug is a PyTorch-based machine learning toolbox designed for drug discovery

MilaGraph

Last update: Jan 8, 2023

Related tags

Overview

Docs | Tutorials | Benchmarks | Papers Implemented

TorchDrug is a PyTorch-based machine learning toolbox designed for several purposes.

Easy implementation of graph operations in a PyTorchic style with GPU support
Being friendly to practioners with minimal knowledge about drug discovery
Rapid prototyping of machine learning research

Installation

TorchDrug is compatible with Python >= 3.5 and PyTorch >= 1.4.0.

From Conda

conda install -c milagraph -c conda-forge torchdrug

From Source

TorchDrug depends on rdkit, which is only available via conda. You can install rdkit with the following line.

conda install -c conda-forge rdkit

git clone https://github.com/DeepGraphLearning/torchdrug
cd torchdrug
pip install -r requirements.txt
python setup.py install

Quick Start

TorchDrug is designed for human and focused on graph structured data. It enables easy implementation of graph operations in machine learning models. All the operations in TorchDrug are backed by PyTorch framework, and support GPU acceleration and auto differentiation.

from torchdrug import data

edge_list = [[0, 1], [1, 2], [2, 3], [3, 4], [4, 5], [5, 0]]
graph = data.Graph(edge_list, num_node=6)
graph = graph.cuda()
# the subgraph induced by nodes 2, 3 & 4
subgraph = graph.subgraph([2, 3, 4])

Molecules are also supported in TorchDrug. You can get the desired molecule properties without any domain knowledge.

mol = data.Molecule.from_smiles("CCOC(=O)N", node_feature="default", edge_feature="default")
print(mol.node_feature)
print(mol.atom_type)
print(mol.to_scaffold())

You may also register custom node, edge or graph attributes. They will be automatically processed during indexing operations.

with mol.edge():
	mol.is_CC_bond = (mol.edge_list[:, :2] == td.CARBON).all(dim=-1)
sub_mol = mol.subgraph(mol.atom_type != td.NITROGEN)
print(sub_mol.is_CC_bond)

TorchDrug provides a wide range of common datasets and building blocks for drug discovery. With minimal code, you can apply standard models to solve your own problem.

import torch
from torchdrug import datasets

dataset = datasets.Tox21()
dataset[0].visualize()
lengths = [int(0.8 * len(dataset)), int(0.1 * len(dataset))]
lengths += [len(dataset) - sum(lengths)]
train_set, valid_set, test_set = torch.utils.data.random_split(dataset, lengths)

from torchdrug import models, tasks

model = models.GIN(dataset.node_feature_dim, hidden_dims=[256, 256, 256, 256])
task = tasks.PropertyPrediction(model, task=dataset.tasks)

Training and inference are accelerated by multiple CPUs or GPUs. This can be seamlessly switched in TorchDrug by just a line of code.

from torchdrug import core

# CPU
solver = core.Engine(task, train_set, valid_set, test_set, gpus=None)
# Single GPU
solver = core.Engine(task, train_set, valid_set, test_set, gpus=[0])
# Multiple GPUs
solver = core.Engine(task, train_set, valid_set, test_set, gpus=[0, 1, 2, 3])

Contributing

Everyone is welcome to contribute to the developement of TorchDrug. Please refer to contributing guidelines for more details.

License

TorchDrug is released under Apache-2.0 License.

Comments

[Bug] AttributeError: can't set attribute

In the retrosynthesis tutorial

this code of synthon completion

synthon_optimizer = torch.optim.Adam(synthon_task.parameters(), lr=1e-3)
synthon_solver = core.Engine(synthon_task, synthon_train, synthon_valid,
                             synthon_test, synthon_optimizer,
                             gpus=[0], batch_size=128)
synthon_solver.train(num_epoch=1)
synthon_solver.evaluate("valid")
synthon_solver.save("g2gs_synthon_model.pth")

gives below error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [13], in <cell line: 11>()
      7 synthon_optimizer = torch.optim.Adam(synthon_task.parameters(), lr=1e-3)
      8 synthon_solver = core.Engine(synthon_task, synthon_train, synthon_valid,
      9                              synthon_test, synthon_optimizer,
     10                              gpus=[0], batch_size=128)
---> 11 synthon_solver.train(num_epoch=1)
     12 synthon_solver.evaluate("valid")
     13 synthon_solver.save("g2gs_synthon_model.pth")

File /usr/local/lib/python3.9/dist-packages/torchdrug/core/engine.py:155, in Engine.train(self, num_epoch, batch_per_epoch)
    152 if self.device.type == "cuda":
    153     batch = utils.cuda(batch, device=self.device)
--> 155 loss, metric = model(batch)
    156 if not loss.requires_grad:
    157     raise RuntimeError("Loss doesn't require grad. Did you define any loss in the task?")

File /usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File /usr/local/lib/python3.9/dist-packages/torchdrug/tasks/retrosynthesis.py:592, in SynthonCompletion.forward(self, batch)
    589 all_loss = torch.tensor(0, dtype=torch.float32, device=self.device)
    590 metric = {}
--> 592 pred, target = self.predict_and_target(batch, all_loss, metric)
    593 node_in_pred, node_out_pred, bond_pred, stop_pred = pred
    594 node_in_target, node_out_target, bond_target, stop_target, size = target

File /usr/local/lib/python3.9/dist-packages/torchdrug/tasks/retrosynthesis.py:984, in SynthonCompletion.predict_and_target(self, batch, all_loss, metric)
    981 with reactant.graph():
    982     reactant.reaction = batch["reaction"]
--> 984 graph1, node_in_target1, node_out_target1, bond_target1, stop_target1 = self.all_edge(reactant, synthon)
    985 graph2, node_in_target2, node_out_target2, bond_target2, stop_target2 = self.all_stop(reactant, synthon)
    987 graph = self._cat([graph1, graph2])

File /usr/local/lib/python3.9/dist-packages/torch/autograd/grad_mode.py:27, in _DecoratorContextManager.__call__.<locals>.decorate_context(*args, **kwargs)
     24 @functools.wraps(func)
     25 def decorate_context(*args, **kwargs):
     26     with self.clone():
---> 27         return func(*args, **kwargs)

File /usr/local/lib/python3.9/dist-packages/torchdrug/tasks/retrosynthesis.py:557, in SynthonCompletion.all_edge(self, reactant, synthon)
    555 assert (graph.num_edges % 2 == 0).all()
    556 # node / edge features may change because we mask some nodes / edges
--> 557 graph, feature_valid = self._update_molecule_feature(graph)
    559 return graph[feature_valid], node_in_target[feature_valid], node_out_target[feature_valid], \
    560        bond_target[feature_valid], stop_target[feature_valid]

File /usr/local/lib/python3.9/dist-packages/torchdrug/tasks/retrosynthesis.py:398, in SynthonCompletion._update_molecule_feature(self, graphs)
    395 bond_type[edge_mask] = new_graphs.bond_type.to(device=graphs.device)
    397 with graphs.node():
--> 398     graphs.node_feature = node_feature
    399 with graphs.edge():
    400     graphs.edge_feature = edge_feature

File /usr/local/lib/python3.9/dist-packages/torchdrug/data/graph.py:160, in Graph.__setattr__(self, key, value)
    158 if hasattr(self, "meta_dict"):
    159     self._check_attribute(key, value)
--> 160 super(Graph, self).__setattr__(key, value)

File /usr/local/lib/python3.9/dist-packages/torchdrug/core/core.py:84, in _MetaContainer.__setattr__(self, key, value)
     82     if types:
     83         self.meta_dict[key] = types.copy()
---> 84 self._setattr(key, value)

File /usr/local/lib/python3.9/dist-packages/torchdrug/core/core.py:93, in _MetaContainer._setattr(self, key, value)
     92 def _setattr(self, key, value):
---> 93     return super(_MetaContainer, self).__setattr__(key, value)

AttributeError: can't set attribute

opened by bhadreshpsavani 19

Customized target for retrosynthesis

Hi, thanks for sharing this repo!

I am wondering how I could input arbitrary target/product for retrosynthesis analysis? What target format would the model required besides SMILES? In the notebook, it's performing prediction on USPTO dataset. I am interested in knowing how I could apply this model to the target outside of USPTO.

Thanks!!
enhancement

opened by juliachen123 11
How to evaluate molecule generation models?

Hi torchdrug team, thank you for the awesome project! I am playing with molecule generation models, and am interested in trying to reproduce the benchmarks posted here: https://torchdrug.ai/docs/benchmark/generation.html

I am able to follow the tutorial for molecule generation: https://torchdrug.ai/docs/tutorials/generation.html

But I found that there was no mention of how we can evaluate models once they are fully trained. Is there any evaluator class or oracle that can be called to obtain the metrics as in your benchmark?

Additionally, do you have any advice on how to set the hyperparameters to fairly reproduce/compare to the GCPN or GraphAF papers?

opened by chaitjo 10
Redesign of the meter logger and integration of the Weights and Biases logger
Feature

This pull request consists of a redesign of the meter logger. There is a new abstract BaseLogger class in torchdrug.utils.loggers.base_logger. The update and log functions of the core.Meter class are now implemented as methods of this class and are called inside the core.Meter methods.

For a new logger, there are 2 abstract methods in BaseLogger: log and save_hyperparams. These two have to be defined for a new custom logger.

The ConsoleLogger is a child of the BaseLogger class and performs the logging as being done currently in TorchDrug. The ConsoleLogger is always on irrespective if another logger is being used or not.

Similarly, the WandbLogger is also a child of BaseLogger and logs all the metrics to the user's W&B account. The wandb package is not a dependency and if a user tries to use the WandbLogger without installing wandb, they are prompted to install it.

The constructor of core.Engine is updated to take two more optional arguments:

metric_logger: This can be a str or an instance of the custom logger. The accepted str arguments currently are 'wandb' and 'console' (default value is 'console')

project: The name of the W&B project the user wants to log to (default value is None)

Example:

engine = core.Engine(..., metric_logger='wandb', project='PropertyPrediction')

or

from torchdrug.utils.logger.wandb_logger import WandbLogger wandb_logger = WandbLogger(project="PropertyPrediction", name="Toxicity Prediction", save_dir="./ClinTox", log_interval=10) engine = core.Engine(..., metric_logger=wandb_logger)

A couple of runs for different tasks logged to W&B

https://wandb.ai/manan-goel/TorchDrug-Generation

https://wandb.ai/manan-goel/TorchDrug-Pretrain

https://wandb.ai/manan-goel/TorchDrug-PropertyPrediction
opened by manangoel99 7

`num_relation` mismatches in `message_and_aggregate()`

I would like to use my custom data https://raw.githubusercontent.com/goga0001/graph/main/data.csv I prepared the data as CSV file and followed the implementation of existing datasets:

import os
from torchdrug import data, utils
from torchdrug.core import Registry as R
from collections import defaultdict
from torch.utils import data as torch_data
from torchdrug import data
from torchdrug.utils import doc


@R.register("datasets.Flavonoid2")
@doc.copy_args(data.MoleculeDataset.load_csv, ignore=("smiles_field", "target_fields"))
class Flavonoid2(data.MoleculeDataset):
    """
    Subset of Flavonoid compound database for virtual screening.

    Statistics:
        - #Molecule:  4806
        - #Regression task: 2

    Parameters:
        path (str): path to store the dataset
        verbose (int, optional): output verbose level
        **kwargs
    """

    csv_file = "/content/torchdrug/torchdrug/datasets/data.csv"
    target_fields = ["logP","qed"]

    def __init__(self, path, verbose=1, **kwargs):
        self.load_csv(self.csv_file, smiles_field="smiles", target_fields=self.target_fields,
                      verbose=verbose, **kwargs)

Molecules were constructed from smiles but I get assertion error:

from torch import nn, optim
optimizer = optim.Adam(task.parameters(), lr = 1e-3)
solver = core.Engine(task, dataset, None, None, optimizer,
                     gpus=(0,), batch_size=128, log_interval=10)

solver.train(num_epoch=1)
solver.save("graphaf_flavonoid_1epoch.pkl")

AssertionError                            Traceback (most recent call last)
[<ipython-input-23-a6a027c50b11>](https://localhost:8080/#) in <module>
      4                      gpus=(0,), batch_size=128, log_interval=10)
      5 
----> 6 solver.train(num_epoch=1)
      7 solver.save("graphaf_flavonoid_1epoch.pkl")

10 frames
[/content/torchdrug/torchdrug/layers/conv.py](https://localhost:8080/#) in message_and_aggregate(self, graph, input)
    414 
    415     def message_and_aggregate(self, graph, input):
--> 416         assert graph.num_relation == self.num_relation
    417 
    418         node_in, node_out, relation = graph.edge_list.t()

AssertionError:

Thank you! Looking forward to your reply!

opened by goga0001 6

An error:TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases

there is an error when i tried to run the following test code.

from torchdrug import data

edge_list = [[0, 1], [1, 2], [2, 3], [3, 4], [4, 5], [5, 0]] graph = data.Graph(edge_list, num_node=6) graph = graph.cuda()

the subgraph induced by nodes 2, 3 & 4

subgraph = graph.subgraph([2, 3, 4])

the error is : TypeError: metaclass conflict: the metaclass of a derived class must be a (non-strict) subclass of the metaclasses of all its bases how can I fix it
compatibility

opened by JianBin-Liu 5

The error when run tutorial of retrosynthesis

Dear everyone,

I have install torchdrug correctly, and then follow the tutorial https://torchdrug.ai/docs/tutorials/retrosynthesis.html When I run the code as below:

from torchdrug import datasets

reaction_dataset = datasets.USPTO50k("D:/test/molecule-datasets/",
                                     node_feature="reaction_reaction_identification",
                                     kekulize=True)
synthon_dataset = datasets.USPTO50k("D:/test/molecule-dataset/", as_synthon=True,
                                    node_feature="synthon_completion",
                                    kekulize=True)

It happens error as follows:

Loading D:/test/molecule-datasets/data_processed.csv: 100%|██████████| 50017/50017 [00:00<00:00, 92358.37it/s]
Constructing molecules from SMILES:   0%|          | 0/50016 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "F:/workdir/pycharm/Retrosynthesis/main.py", line 5, in <module>
    kekulize=True)
  File "D:\soft\Anaconda3\envs\py37\lib\site-packages\decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "D:\soft\Anaconda3\envs\py37\lib\site-packages\torchdrug-0.1.0-py3.7.egg\torchdrug\core\core.py", line 282, in wrapper
    return init(self, *args, **kwargs)
  File "D:\soft\Anaconda3\envs\py37\lib\site-packages\torchdrug-0.1.0-py3.7.egg\torchdrug\datasets\uspto50k.py", line 63, in __init__
    **kwargs)
  File "D:\soft\Anaconda3\envs\py37\lib\site-packages\torchdrug-0.1.0-py3.7.egg\torchdrug\data\dataset.py", line 112, in load_csv
    self.load_smiles(smiles, targets, verbose=verbose, **kwargs)
  File "D:\soft\Anaconda3\envs\py37\lib\site-packages\torchdrug-0.1.0-py3.7.egg\torchdrug\data\dataset.py", line 232, in load_smiles
    mol = data.Molecule.from_molecule(mol, **kwargs)
  File "D:\soft\Anaconda3\envs\py37\lib\site-packages\torchdrug-0.1.0-py3.7.egg\torchdrug\data\molecule.py", line 183, in from_molecule
    func = R.get("features.atom.%s" % name)
  File "D:\soft\Anaconda3\envs\py37\lib\site-packages\torchdrug-0.1.0-py3.7.egg\torchdrug\core\core.py", line 208, in get
    raise KeyError("Can't find `%s` in `%s`" % (key, ".".join(keys[:i])))
KeyError: "Can't find `reaction_reaction_identification` in `features.atom`"

what is the problem? could you help me to solve it? Thanks.

documentation

opened by Drlittlelab 5

TorchDrug can't use Lr_Scheduler

Hey, I found a bug that when I load the related TorchDrug modules, I can't use the torch.optim.lr_scheduler. Look at this picture which comes from the TorchDrug Colab files(Property Prediction). I add one lr_schduler for the optimizer. and it throws an error.

However, When I don't load any TorchDrug modules, I can use the optimizer normally.

opened by Mrz-zz 4

ValueError: Fail to parse the docstring of `Smol`. Inconsistent number of parameters in signature and docstring.

Trying to build a customized dataset as follows for the molecular generation task.

@R.register("datasets.Smol")

@doc.copy_args(data.MoleculeDataset.load_csv, ignore=("smiles_field", "target_fields"))

class Smol(data.MoleculeDataset):

  smiles_file = "/content/drive/MyDrive/molecule_design/resources/smiles_train.csv"
  target_fields = ["SPLIT"]

  def __init__(self, smiles_file, verbose=1, **kwargs):
    self.load_csv(self.smiles_file, smiles_field="smiles", target_fields=self.target_fields,lazy=True,
                      verbose=verbose, **kwargs)
    
  def split(self):
    indexes = defaultdict(list)
    for i, split in enumerate(self.targets["SPLIT"]):
        indexes[split].append(i)
    train_set = torch_data.Subset(self, indexes["train"])
    valid_set = torch_data.Subset(self, indexes["valid"])
    test_set = torch_data.Subset(self, indexes["test"])
    return train_set, valid_set, test_set

but get the following error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-29-ada51874abcc>](https://localhost:8080/#) in <module>()
      3 @doc.copy_args(data.MoleculeDataset.load_csv, ignore=("smiles_field", "target_fields"))
      4 
----> 5 class Smol(data.MoleculeDataset):
      6 
      7   smiles_file = "/content/drive/MyDrive/molecule_design/resources/smiles_train.csv"

[/usr/local/lib/python3.7/dist-packages/torchdrug/utils/doc.py](https://localhost:8080/#) in wrapper(obj)
     90         if len(docs) != len(parameters):
     91             raise ValueError("Fail to parse the docstring of `%s`. "
---> 92                              "Inconsistent number of parameters in signature and docstring." % obj.__name__)
     93         new_params = []
     94         new_docs = []

ValueError: Fail to parse the docstring of `Smol`. Inconsistent number of parameters in signature and docstring.

Did I miss something?

opened by CaiYitao 4

Question about the negative example of KnowledgeGraphCompletion Class

Hi,

In the _strict_negative method function of KnowledgeGraphCompletion, if 'A-->B', 'B-->C' (A and B are entities, --> is relation) are the samples of traning set (.i.e. self.fact_graph) while 'A-->C' is the sample of valiation set, then I think 'A-->C' will be regard as a negative sample in the traning stage. Is that a problem?

@torch.no_grad()
def _strict_negative(self, pos_h_index, pos_t_index, pos_r_index):
    batch_size = len(pos_h_index)
    any = -torch.ones_like(pos_h_index)

    pattern = torch.stack([pos_h_index, any, pos_r_index], dim=-1)
    pattern = pattern[:batch_size // 2]

    # ==================== Code I Talk About ======================
    edge_index, num_t_truth = self.fact_graph.match(pattern)
    t_truth_index = self.fact_graph.edge_list[edge_index, 1]
    pos_index = functional._size_to_index(num_t_truth)
    t_mask = torch.ones(len(pattern), self.num_entity, dtype=torch.bool, device=self.device)
    t_mask[pos_index, t_truth_index] = 0
    neg_t_candidate = t_mask.nonzero()[:, 1]
    num_t_candidate = t_mask.sum(dim=-1)
    neg_t_index = functional.variadic_sample(neg_t_candidate, num_t_candidate, self.num_negative)
    # =======================================================

    pattern = torch.stack([any, pos_t_index, pos_r_index], dim=-1)
    pattern = pattern[batch_size // 2:]
    edge_index, num_h_truth = self.fact_graph.match(pattern)
    h_truth_index = self.fact_graph.edge_list[edge_index, 0]
    pos_index = functional._size_to_index(num_h_truth)
    h_mask = torch.ones(len(pattern), self.num_entity, dtype=torch.bool, device=self.device)
    h_mask[pos_index, h_truth_index] = 0
    neg_h_candidate = h_mask.nonzero()[:, 1]
    num_h_candidate = h_mask.sum(dim=-1)
    neg_h_index = functional.variadic_sample(neg_h_candidate, num_h_candidate, self.num_negative)

    neg_index = torch.cat([neg_t_index, neg_h_index])

    return neg_index

opened by AlexHex7 4

Conflict with torch due to overwritten modules

I'm interested to understand why it is necessary to overwrite the default nn.Module of torch in patch.py:

https://github.com/DeepGraphLearning/torchdrug/blob/eeee19181572ef5b8a806b71bdd4d2d1a4e27f67/torchdrug/patch.py#L125

This seems to be a quite invasive thing since it alters the behavior of any torch.nn module after torchdrug has been imported.

For example, your implementation of register_buffer in patch.py lacks the keyword argument persistent which is present in native torch: https://github.com/pytorch/pytorch/blob/989b24855efe0a8287954040c89d679625dcabe1/torch/nn/modules/module.py#L277

I would greatly appreciate if you could please let me know how I can fall back to the native torch behavior after having imported torchdrug somewhere above in my code.
help wanted

opened by jannisborn 4
CPU vs GPU

I came up against a weird obstacle: after running the same code for Retrosynthesis prediction task on gpu and cpu (perhaps only versions of certain libraries might have differed) I got significantly diffirent results... For gpu the accuracy is much larger. Do you maybe know the reason for this? Because as far as I understand even if results would differ this difference would be pretty small.

opened by DimGorr 0
How to use the generation model to generate specific molecules?

Hello,

I was wondering how can I use the generation model to generate specific molecules? For example, I have a small dataset of molecules I am interested in generating, should I use ZINC250k dataset to train GraphAF model on and then use property optimization to generate novel molecules with desired QED, logP properties or should I use my small dataset(around 4k) to train the GraphAF model?

Thank you kindly,

Looking forward for your reply

opened by goga0001 0

num_relation mismatches in message_and_aggregate()

There was another issue on this that was closed but there wasn't really a resolution. The problem occurs when I use a custom dataset. The dataset loads correctly:

import torch

from torchdrug import datasets



dataset = datasets.flav("~/molecule-datasets/", kekulize=True,

                            atom_feature="symbol")

18:01:53   Downloading https://raw.githubusercontent.com/gdeol4/torchdrug/master/flav.csv to /root/molecule-datasets/flav.csv

Loading /root/molecule-datasets/flav.csv: 4807it [00:00, 70415.81it/s]            
Constructing molecules from SMILES: 100%|██████████| 4806/4806 [00:10<00:00, 473.69it/s]

However, when attempting to train a model, I encounter the assertion error:

solver.train(num_epoch=1)
solver.save("graphaf_data_1epoch.pkl")

File /notebooks/torchdrug/torchdrug/layers/conv.py:416, in RelationalGraphConv.message_and_aggregate(self, graph, input)
    415 def message_and_aggregate(self, graph, input):
--> 416     assert graph.num_relation == self.num_relation
    418     node_in, node_out, relation = graph.edge_list.t()
    419     node_out = node_out * self.num_relation + relation

AssertionError:

There does seem to be a mismatch here:

dataset.num_bond_type

2

model.layers[0].num_relation

2

dataset[0]["graph"].num_relation

tensor(3)

My attempt at fixing it:

for data in dataset:
  data['graph'].num_relation = torch.tensor(2)

But the value remains unchanged

opened by gdeol4 0

The accuracy of retrosynthesis are different from the paper

Hello, Thanks for sharing this library! The results of https://torchdrug.ai/docs/tutorials/retrosynthesis.html are different from the G2Gs. For reaction class is un-known, these are the reported results from the paper： top-1 accuracy: 0.489 top-3 accuracy: 0.676 top-5 accuracy: 0.725 top-10 accuracy: 0.755 These are the reported results from the https://torchdrug.ai/docs/tutorials/retrosynthesis.html: top-1 accuracy: 0.47541 top-3 accuracy: 0.741803 top-5 accuracy: 0.827869 top-10 accuracy: 0.879098 I cannot understand why the result of k>2 is higher than reported in the literature. Thank you very much
duplicate

opened by z15544534 2

ImportError: No module named 'embedding'

Below is my code

import torch
from torchdrug import core, datasets, tasks, models
from torchdrug.models import RotatE

import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt

dataset = datasets.FB15k237("~/kg-datasets/")
train_set, valid_set, test_set = dataset.split()


model: RotatE = models.RotatE(num_entity=dataset.num_entity,
                      num_relation=dataset.num_relation,
                      embedding_dim=2048, max_score=9)

task = tasks.KnowledgeGraphCompletion(model, num_negative=256,
                                      adversarial_temperature=1)

optimizer = torch.optim.Adam(task.parameters(), lr=2e-5)
solver= core.Engine(task, train_set, valid_set, test_set, optimizer,
                     gpus=[0], batch_size=1024)
solver.train(num_epoch=100)
solver.evaluate("valid")

Below is the error:

Traceback (most recent call last):
  File "C:\Users\lenovo\PycharmProjects\pythonProject2\main.py", line 23, in <module>
    solver.train(num_epoch=100)
  File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\core\engine.py", line 155, in train
    loss, metric = model(batch)
  File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\tasks\reasoning.py", line 85, in forward
    pred = self.predict(batch, all_loss, metric)
  File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\tasks\reasoning.py", line 160, in predict
    pred = self.model(self.fact_graph, h_index, t_index, r_index, all_loss=all_loss, metric=metric)
  File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torch\nn\modules\module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\models\embedding.py", line 191, in forward
    score = functional.rotate_score(self.entity, self.relation * self.relation_scale,
  File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\layers\functional\embedding.py", line 266, in rotate_score
    score = RotatEFunction.apply(entity, relation, h_index, t_index, r_index)
  File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\layers\functional\embedding.py", line 108, in forward
    forward = embedding.rotate_forward_cuda
  File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\utils\torch.py", line 27, in __getattr__
    return getattr(self.module, key)
  File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\utils\decorator.py", line 102, in __get__
    result = self.func(obj)
  File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torchdrug\utils\torch.py", line 31, in module
    return cpp_extension.load(self.name, self.sources, self.extra_cflags, self.extra_cuda_cflags,
  File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torch\utils\cpp_extension.py", line 1079, in load
    return _jit_compile(
  File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torch\utils\cpp_extension.py", line 1317, in _jit_compile
    return _import_module_from_library(name, build_directory, is_python_module)
  File "C:\Users\lenovo\.conda\envs\td2\lib\site-packages\torch\utils\cpp_extension.py", line 1699, in _import_module_from_library
    file, path, description = imp.find_module(module_name, [path])
  File "C:\Users\lenovo\.conda\evns\td2\lib\imp.py", line 296, in find_module
    raise ImportError(_ERR_MSG.format(name), name=name)
ImportError: No module named 'embedding'

I've did some research but couldn't figure out why, can anyone help me here?

opened by iamme1234567 1

Error when trying to use the MPNN implementation on QM9

python = 3.9 torch = 1.13 torchdrug = 0.2.0.post1 torchscatter = 2.1.0 torch cluster = 1.6.0

Code to reproduce error: " import torch import pickle from torchdrug import datasets from torchdrug import core, models, tasks

#dataset = datasets.QM9("~/molecule-datasets/",node_position=True) #with open("QM9.pkl", "wb") as fout:

pickle.dump(dataset, fout)

#exit() with open("QM9.pkl", "rb") as fin: dataset = pickle.load(fin)

lengths = [int(0.8 * len(dataset)), int(0.1 * len(dataset))] lengths += [len(dataset) - sum(lengths)] train_set, valid_set, test_set = torch.utils.data.random_split(dataset, lengths)

model = models.MPNN(input_dim=dataset.node_feature_dim, hidden_dim=256, edge_input_dim=dataset.edge_feature_dim, num_layer=1, num_gru_layer=1, num_mlp_layer=2, num_s2s_step=3, short_cut=False, batch_norm=False, activation='relu', concat_hidden=False)

task = tasks.PropertyPrediction(model, task=dataset.tasks)

optimizer = torch.optim.Adam(task.parameters(), lr=1e-3) solver = core.Engine(task, train_set, valid_set, test_set, optimizer, gpus=[0], batch_size=32) solver.train(num_epoch=1) #solver.evaluate("valid") "

Error: File "/home/nhattrup/deep_learning/final_proj/example.py", line 35, in solver.train(num_epoch=1) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torchdrug/core/engine.py", line 155, in train loss, metric = model(batch) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torchdrug/tasks/property_prediction.py", line 96, in forward pred = self.predict(batch, all_loss, metric) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torchdrug/tasks/property_prediction.py", line 134, in predict output = self.model(graph, graph.node_feature.float(), all_loss=all_loss, metric=metric) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torchdrug/models/mpnn.py", line 75, in forward x = self.layer(graph, layer_input) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl return forward_call(*input, **kwargs) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torchdrug/layers/conv.py", line 92, in forward update = self.message_and_aggregate(graph, input) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torchdrug/layers/conv.py", line 61, in message_and_aggregate message = self.message(graph, input) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torchdrug/layers/conv.py", line 650, in message message = torch.einsum("bed, bd -> be", transform, input[node_in]) File "/home/nhattrup/.conda/envs/dl/lib/python3.9/site-packages/torch/functional.py", line 378, in einsum return _VF.einsum(equation, operands) # type: ignore[attr-defined] RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)

One thing I should note is that I ran into similar issues to issue #95 so my pickle file only contains the molecules that were loaded properly (only < 100 couldn't be loaded properly so essentially the entire dataset). Thanks for any help on this.

opened by NicholasHattrup 0

Releases(v0.2.0)

v0.2.0(Sep 19, 2022)
V0.2.0 is a major release with a new family member TorchProtein, a library for machine-learning-guided protein science. Aiming at simplifying the development of protein methods, TorchProtein encapsulates many complicated yet repetitive subroutines into functional modules, including widely-used datasets, flexible data processing operations, advanced encoding models, and diverse protein tasks.

Such comprehensive encapsulation enables users to develop protein machine learning solutions with one easy-to-use library. It avoids the embarrassment of gluing multiple libraries into a pipeline.

With TorchProtein, we can rapidly prototype machine learning solutions to various protein applications within 20 lines of codes, and conduct ablation studies by substituting different parts of the solution with off-the-shelf modules. Furthermore, we can easily adapt these modules to our own needs, and make systematic analyses by comparing the new results to a benchmark provided in the library.

Additionally, TorchProtein is designed to be accessible to everyone. For inexperienced users, like beginners or biological researchers, TorchProtein provides user-friendly APIs to simplify the development of protein machine learning solutions. Meanwhile, for professional users, TorchProtein also preserves enough flexibility to satisfy their demands, supported by features like modular design of the library and on-the-fly graph construction.

Main Features

Simplify Data Processing

It is challenging to transform raw bioinformatic protein datasets into tensor formats for machine learning. To reduce tedious operations, TorchProtein provides us with a data structure data.Protein and its batched extension data.PackedProtein to automate the data processing step.

data.Protein and data.PackedProtein automatically gather protein data from various bio-sources and seamlessly switch between data formats like pdb files, RDKit objects and sequences. Please see the section data structures and operations for transforming from and to sequences and RDKit objects.

# construct a data.Protein instance from a pdb file pdb_file = ... protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol") print(protein) # write a data.Protein instance back to a pdb file new_pdb_file = ... protein.to_pdb(new_pdb_file)

Protein(num_atom=445, num_bond=916, num_residue=57)

data.Protein and data.PackedProtein automatically pre-process all kinds of features of atoms, bonds and residues, by simply setting up several arguments.

pdb_file = ... protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol") # feature print(protein.residue_feature.shape) print(protein.atom_feature.shape) print(protein.bond_feature.shape)

torch.Size([57, 21]) torch.Size([445, 3]) torch.Size([916, 1])

data.Protein and data.PackedProtein automatically keeps track of numerous attributes associated with atoms, bonds, residues and the whole protein.

For example, reference offers a way to register new attributes as node, edge or graph property, and in this way, the new attributes would automatically go along with the node, edge or graph themself. More in-built attributes are listed in the section data structures and operations.

protein = ... with protein.node(): protein.node_id = torch.tensor([i for i in range(0, protein.num_node)]) with protein.edge(): protein.edge_cost = torch.rand(protein.num_edge) with protein.graph(): protein.graph_feature = torch.randn(128)

Even more, reference can be utilized to maintain the correspondence between two well related objects. For example, the mapping atom2residue maintains relationship between atoms and residues, and enables indexing on either of them.

protein = ... # create a mask indices for atoms in a glutamine (GLN) is_glutamine = protein.residue_type[protein.atom2residue] == protein.residue2id["GLN"] mask_indices = is_glutamine.nonzero().squeeze(-1) print(mask_indices) # map the masked atoms back to the glutamine residue residue_type = protein.residue_type[protein.atom2residue[mask_indices]] print([protein.id2residue[r] for r in residue_type.tolist()])

tensor([ 26, 27, 28, 29, 30, 31, 32, 33, 34, 307, 308, 309, 310, 311, 312, 313, 314, 315, 384, 385, 386, 387, 388, 389, 390, 391, 392]) ['GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN', 'GLN']

It is useful to augment protein data by modifying protein graphs or constructing new ones. With the protein operations and the graph construction layers provided in TorchProtein,

we can easily modify proteins on the fly by batching, slicing sequences, masking out side chains, etc. Please see the tutorials for more details on masking.

pdb_file = ... protein = data.Protein.from_pdb(pdb_file, atom_feature="position", bond_feature="length", residue_feature="symbol") # batch proteins = data.Protein.pack([protein, protein, protein]) # slice sequences # use indexing to extract consecutive residues of a particular protein two_residues = protein[[0,2]] two_residues.visualize()

we can construct protein graphs on the fly with GPU acceleration, which offers users flexible choices rather than using fixed pre-processed graphs. Below is an example to build a graph with only alpha carbon atoms, please check tutorials for more cases, such as adding spatial / KNN / sequential edges.

protein = ... # transfer from CPU to GPU protein = protein.cuda() print(protein) # build a graph with only alpha carbon (CA) atoms node_layers = [geometry.AlphaCarbonNode()] graph_construction_model = layers.GraphConstruction(node_layers=node_layers) original_protein = data.Protein.pack([protein]) CA_protein = graph_construction_model(_protein) print("Graph before:", original_protein) print("Graph after:", CA_protein)

Protein(num_atom=445, num_bond=916, num_residue=57, device='cuda:0') Graph before: PackedProtein(batch_size=1, num_atoms=[2639], num_bonds=[5368], num_residues=[350]) Graph after: PackedProtein(batch_size=1, num_atoms=[350], num_bonds=[0], num_residues=[350])

Easy to Prototype Solutions

With TorchProtein, common protein tasks can be finished within 20 lines of codes, such as sequence-based protein property prediction task. Below is an example and more examples of different popular protein tasks and models can be found in Protein Tasks, Models and Tutorials.

import torch from torchdrug import datasets, transforms, models, tasks, core truncate_transform = transforms.TruncateProtein(max_length=200, random=False) protein_view_transform = transforms.ProteinView(view="residue") transform = transforms.Compose([truncate_transform, protein_view_transform]) dataset = datasets.BetaLactamase("~/protein-datasets/", residue_only=True, transform=transform) train_set, valid_set, test_set = dataset.split() model = models.ProteinCNN(input_dim=21, hidden_dims=[1024, 1024], kernel_size=5, padding=2, readout="max") task = tasks.PropertyPrediction(model, task=dataset.tasks, criterion="mse", metric=("mae", "rmse", "spearmanr"), normalization=False, num_mlp_layer=2) optimizer = torch.optim.Adam(task.parameters(), lr=1e-4) solver = core.Engine(task, train_set, valid_set, test_set, optimizer, gpus=[0], batch_size=64) solver.train(num_epoch=10) solver.evaluate("valid")

mean absolute error [scaled_effect1]: 0.249482 root mean squared error [scaled_effect1]: 0.304326 spearmanr [scaled_effect1]: 0.44572

Compatible with Existing Molecular Models in TorchDrug

TorchProtein follows the scientific fact that proteins are macromolecules. The core data structures data.Protein and data.PackedProtein inherit from data.Molecule and data.PackedMolecule respectively. Therefore, we can apply any existing molecule model in TorchDrug to proteins

import torch from torchdrug import layers, datasets, transforms, models, tasks, core from torchdrug.layers import geometry truncate_transform = transforms.TruncateProtein(max_length=200, random=False) protein_view_transform = transforms.ProteinView(view="residue") transform = transforms.Compose([truncate_transform, protein_view_transform]) dataset = datasets.EnzymeCommission("~/protein-datasets/", transform=transform) train_set, valid_set, test_set = dataset.split() model = models.GIN(input_dim=21, hidden_dims=[256, 256, 256, 256], batch_norm=True, short_cut=True, concat_hidden=True) graph_construction_model = layers.GraphConstruction( node_layers=[geometry.AlphaCarbonNode()], edge_layers=[geometry.SpatialEdge(radius=10.0, min_distance=5), geometry.KNNEdge(k=10, min_distance=5), geometry.SequentialEdge(max_distance=2)], edge_feature="residue_type" ) task = tasks.MultipleBinaryClassification(model, graph_construction_model=graph_construction_model, num_mlp_layer=3, task=list(range(len(dataset.tasks))), criterion="bce", metric=("auprc@micro", "f1_max")) optimizer = torch.optim.Adam(task.parameters(), lr=1e-4) solver = core.Engine(task, train_set, valid_set, test_set, optimizer, gpus=[0], batch_size=4) solver.train(num_epoch=10) solver.evaluate("valid")

auprc@micro: 0.187884 f1_max: 0.231008

In Protein-Ligand Interaction (PLI) prediction task, we can utilize a molecular encoder module to extract the representations of molecules. Please check tutorial 2 for more details.

train_set, valid_set, test_set = ... # protein encoder model = models.ProteinCNN(input_dim=21, hidden_dims=[1024, 1024], kernel_size=5, padding=2, readout="max") # molecule encoder model2 = models.GIN(input_dim=66, hidden_dims=[256, 256, 256, 256], batch_norm=True, short_cut=True, concat_hidden=True) task = tasks.InteractionPrediction(model, model2=model2, task=dataset.tasks, criterion="mse", metric=("mae", "rmse", "spearmanr"), normalization=False, num_mlp_layer=2) optimizer = torch.optim.Adam(task.parameters(), lr=1e-4) solver = core.Engine(task, train_set, valid_set, test_set, optimizer, gpus=[0], batch_size=16) solver.train(num_epoch=5) solver.evaluate("valid")

mean absolute error [scaled_effect1]: 0.249482 root mean squared error [scaled_effect1]: 0.304326 spearmanr [scaled_effect1]: 0.44572

Support From the Developer (@DeepGraphLearning/torchdrug-maintainers)

There is always an active supporting team to answer questions and provide helps. Feedbacks of use experience and contributions for development are welcomed.

New Modules

Data Structures and Operations

data.Protein

Representative attributes:

data.Protein.edge_list: list of edges and each edge is represented by a tuple (node_in, node_out, bond_type)

data.Protein.atom_type: atom types

data.Protein.bond_type: bond types

data.Protein.residue_type: residue types

data.Protein.view: default view for this protein. Can be “atom” or “residue”

data.Protein.atom_name: atom names in each residue

data.Protein.atom2residue: atom id to residue id mapping

data.Protein.is_hetero_atom: hetero atom indicator

data.Protein.occupancy: protein occupancy

data.Protein.b_factor: temperature factors

data.Protein.residue_number: residue numbers

data.Protein.insertion_code: insertion codes

data.Protein.chain_id: chain ids

Representative Methods:

data.Protein.from_molecule: create a protein from an RDKit object.

data.Protein.from_sequence: create a protein from a sequence.

data.Protein.from_sequence_fast: a faster version of creating a protein from a sequence.

data.Protein.from_pdb: create a protein from a PDB file.

data.Protein.to_molecule: return an RDKit object of this protein.

data.Protein.to_sequence: return a sequence of this protein.

data.Protein.to_pdb: write this protein to a pdb file.

data.Protein.split: split this protein graph into multiple disconnected protein graphs.

data.Protein.pack: batch a list of data.Protein into data.PackedProtein.

data.Protein.repeat: repeat this protein.

data.Protein.residue2atom: map residue id to atom ids.

data.Protein.residue_mask: return a masked protein based on the specified residues.

data.Protein.subresidue: return a subgraph based on the specified residues.

data.Protein.residue2graph: residue id to protein id mapping.

data.Protein.node_mask: return a masked protein based on the specified nodes.

data.Protein.edge_mask: return a masked protein based on the specified edges.

data.Protein.compact: remove isolated nodes and compact node ids.

data.PackedProtein

Representative attributes:

data.PackedProtein.edge_list: list of edges and each edge is represented by a tuple (node_in, node_out, bond_type)

data.PackedProtein.atom_type: atom types

data.PackedProtein.bond_type: bond types

data.PackedProtein.residue_type: residue types

data.PackedProtein.view: default view for this protein. Can be “atom” or “residue”

data.PackedProtein.num_nodes: number of nodes in each protein graph

data.PackedProtein.num_edges: number of edges in each protein graph

data.PackedProtein.num_residues: number of residues in each protein graph

data.PackedProtein.offsets: node id offsets in different proteins

Representative methods:

data.PackedProtein.node_mask: return a masked packed protein based on the specified nodes.

data.PackedProtein.edge_mask: return a masked packed protein based on the specified edges.

data.PackedProtein.residue_mask: return a masked packed protein based on the specified residues.

data.PackedProtein.graph_mask: return a masked packed protein based on the specified protein graphs.

data.PackedProtein.from_molecule: create a protein from a list of RDKit objects.

data.PackedProtein.from_sequence: create a protein from a list of sequences.

data.PackedProtein.from_sequence_fast: a faster version of creating a protein from a list of sequences.

data.PackedProtein.from_pdb: create a protein from a list of PDB files.

data.PackedProtein.to_molecule: return a list of RDKit objects of this packed protein.

data.PackedProtein.to_sequence: return a list of sequences of this packed protein.

data.PackedProtein.to_pdb: write this packed protein to a list of pdb files.

data.PackedProtein.merge: merge multiple packed proteins into a single packed protein.

data.PackedProtein.repeat: repeat this packed protein.

data.PackedProtein.repeat_interleave: repeat this packed protein, behaving similarly to torch.repeat_interleave_.

data.PackedProtein.residue2graph: residue id to graph id mapping.

Models

GearNet: Geometry Aware Relational Graph Neural Network.

ESM: Evolutionary Scale Modeling (ESM).

ProteinCNN: protein shallow CNN.

ProteinResNet: protein ResNet.

ProteinLSTM: protein LSTM.

ProteinBERT: protein BERT.

Statistic: the statistic feature engineering for protein sequence.

Physicochemical: the physicochemical feature engineering for protein sequence.

Protein Tasks

Sequence-based Protein Property Prediction:

tasks.PropertyPrediction predicts some property of each protein, such as Beta-lactamase activity, stability and solubility for proteins.

tasks.NodePropertyPrediction predicts some property of each residue in proteins, such as the secondary structure (coil, strand or helix) of each residue.

tasks.ContactPrediction predicts whether any pair of residues contact or not in the folded structure.

tasks.InteractionPrediction predicts the binding affinity of two interacting proteins or of a protein and a ligand, i.e. performing PPI affinity prediction or PLI affinity prediction.

Structure-based Protein Property Prediction:

tasks.MultipleBinaryClassification predicts whether a protein owns several specific functions or not with binary labels.

Pre-trained Protein Structure Representations:

Self-Supervised Protein Structure Pre-training: acquires informative protein representations from massive unlabeled protein structures, such as tasks.EdgePrediction, tasks.AttributeMasking, tasks.ContextPrediction, tasks.DistancePrediction, tasks.AnglePrediction, tasks.DihedralPrediction .

Fine-tuning on Downstream Task: fine-tunes the pre-trained protein encoder on downstream tasks, such as any property prediction task mentioned above.

Protein Datasets

Protein Property Prediction Datasets

BetaLactamase : protein sequences with activity labels

Fluorescence: protein sequences with fitness labels

Stability: protein sequences with stability labels

Solubility: protein sequences with solubility labels

BinaryLocalization: protein sequences with membrane-bound or soluble labels

SubcellularLocalization: protein sequences with natural cell location labels

EnzymeCommission: protein sequences and 3D structures with EC number labels for catalysis in biochemical reactions

GeneOntology: protein sequences and 3D structures with GO term labels, including molecular function (MF), biological process (BP) and cellular component (CC)

AlphaFoldDB: protein sequences and 3D structures predicted by AlphaFold

Protein Structure Prediction Datasets

Fold: protein sequences and 3D structures with fold labels determined by the global structural topology

SecondaryStructure: protein sequences and 3D structures with secondary structure labels determined by the local structures

ProteinNet: protein sequences and 3D structures for the contact prediction task

Protein-Protein Interaction Prediction Datasets

HumanPPI: protein sequences with binary interaction labels for human proteins

YeastPPI: protein sequences with binary interaction labels for yeast proteins

PPIAffinity: protein sequences with binding affinity values measured by $p_{K_d}$

Protein Ligand Interaction Prediction Datasets

BindingDB: protein sequences and molecule graphs with binding affinity between pairs of protein and ligand

PDBBind: protein sequences and molecule graphs with binding affinity between pairs of protein and ligand

Data Transform Modules

TruncateProtein: truncate over long protein sequences into a fixed length

ProteinView: convert proteins to a specific view

Graph Construction Layers

SubsequenceNode: take a protein subsequence of a specific length

SubspaceNode: extract a subgraph by only keeping neighboring nodes in a spatial ball for each centered node

RandomEdgeMask: mask out some edges randomly from the protein graph

Tutorials

To help users gain a comprehensive understanding of TorchProtein, we recommend some user-friendly tutorials for its basic usage and examples to various protein-related tasks. These tutorials may also serve as boilerplate codes for users to develop their own applications.

Tutorial 1 - Protein Data Structure for basic usage of TorchProtein, like how to represent proteins and what operations are feasible.

Tutorial 2 - Sequence-based Protein Property Prediction for 5 types of protein sequence understanding tasks provided in TorchProtein.

Tutorial 3 - Structure-based Protein Property Prediction for property prediction tasks based on protein structure representations.

Tutorial 4 - Pre-trained Protein Structure Representations for self-supervised pre-training of protein structure encoders and its fine-tuning on downstream tasks.

Bug Fixes

Fix an error in the decorator @utils.cached (#118)

Fix an index error in data.Graph.split() (#115)

Fix setting attribute node_feature , edge_feature and graph_feature (#116)

Fix incorrect node feature shape for the synthon dataset USPTO50k (#116)

Fix a compatible issue when adding node/edge/graph reference and changing node/edge to atom/bond (#116, #117)

Source code(tar.gz)
Source code(zip)
v0.1.3(Jun 4, 2022)
TorchDrug 0.1.3 release introduces new features like W&B intergration and index reference. It also provides new functions and metrics for common development need. Note 0.1.3 has some compatibility changes and be careful when you update your TorchDrug from an older version.

W&B Integration

Index Reference

New Functions

New Metrics

Improvements

Bug Fixes

Compatibility Changes

W&B Integration

Tracking experiment progress is one of the most important demand from ML researchers and developers. For TorchDrug users, we provide a native integration of W&B platform. By adding only one argument in core.Engine, TorchDrug will automatically copy every hyperparameter and training log to your W&B database (thanks to @manangoel99).

solver = core.Engine(task, train_set, valid_set, test_set, optimizer, logger="wandb")

Now you can track your training and validation performance in your browser, and compare them across different experiments.

Index Reference

Maintaining node and edge attributes could be painful when one applies a lot of transformations to a graph. TorchDrug aims to eliminate such tedious steps by registering custom attributes. This update extends the capacity of custom attributes to index reference. That means, we allow attributes to refer to indexes of nodes, edges or graphs, and they will be automatically maintained in any graph operation.

To use index reference, simply add a context manager when we define the attributes.

with graph.edge(), graph.edge_reference(): graph.inv_edge_index = torch.tensor(inv_edge_index)

Foor more details on index reference, please take a look at our notes. Typical use cases include

A pointer to the inverse edge of each edge.

A pointer to the parent node of each node in a tree.

A pointer to the incoming tree edge of each node in a DFS.

Let us know if you find more interesting usage of index reference!

New Functions

Message passing over line graphs is getting more and more popular in the recent years. This version provides data.Graph.line_graph to efficiently construct line graphs on GPUs. It supports both a single graph or a batch of graphs.

We are constantly focusing on better batching of irregular structures, and the variadic functions in TorchDrug are an efficient way to process batch of variadic-sized tensors without padding. This update introduces 3 new variadic functions.

variadic_meshgrid generates a meshgrid from two variadic tensors. Useful for implementing pairwise operations.

variadic_to_padded converts a variadic tensor to a padded tensor.

padded_to_variadic converts a padded tensor to a variadic tensor.

New Metrics

New metrics include accuracy, matthews_corrcoef, pearsonr, spearmanr. All the metrics are the same as their counterparts in scipy, but they are implemented in PyTorch and support auto differentiation.

Improvements

Add data.Graph.to (#70, thanks to @cthoyt)

Extend tasks.SynthonCompletion for arbitrary atom features (#62)

Speed up lazy data loading (#58, thanks to @wconnell)

Speed up rspmm cuda kernels

Add docker support

Add more documentation for data.Graph and data.Molecule

Bug Fixes

Fix computation of output dimension in several GNNs (#92, thanks to @kanojikajino)

Fix data.PackedGraph.__getitem__ when the batch is empty (#83, thanks to @jannisborn)

Fix patched modules for PyTorch>=1.6.0 (#77)

Fix make_configurable for torch.utils.data (#85)

Fix multi_slice_mask, variadic_max for multi-dimensional input

Fix variadic_topk for input containing infinite values

Compatibility Changes

TorchDrug now supports Python 3.7/3.8/3.9. Starting from this version, TorchDrug requires a minimal PyTorch version of 1.8.0 and a minimal RDKit version of 2020.09.

Argument node_feature and edge_feature are renamed to atom_feature and bond_feature in data.Molecule.from_smiles and data.Molecule.from_molecule. The old interface is still supported with deprecated warnings.
Source code(tar.gz)
Source code(zip)
v0.1.2(Oct 23, 2021)
0.1.2 Release Notes

The recent 0.1.2 release of TorchDrug is an update on Colab tutorials, data structures, functions, datasets and bug fixes. We are grateful to see growing interests and involvement from the community, especially on the retrosynthesis task. Welcome more in the future!

Colab Tutorials

New Data Structures

New Functions

New Datasets

Bug Fixes

Colab Tutorials

To familiarize users with the logic and capacity of TorchDrug, we compile a full set of Colab tutorials, covering from basic usage to different drug discovery tasks. All the tutorials are fully interactive and may serve as boilerplate code for your own applications.

Basic Usage and Pipeline shows the manipulation of data structures like data.Graph and data.Molecule, as well as the training and evaluation pipelines for property prediction models.

Pretrained Molecular Representations demonstrates the steps for self-supervised pretraining of a molecular representation model and finetuning it on downstream tasks.

De novo Molecule Design illustrates the routine of training generative models for molecule generation and finetuning them with reinforcement learning for property optimization. Two popular models, GCPN and GraphAF, are covered in the tutorial.

Retrosynthesis shows how to use the state-of-the-art model, G2Gs, to predict a set reactants for synthesizing a target molecule.

Knowledge Graph Reasoning goes through the steps of training and evaluating models for knowledge graph completion, including both knowledge graph embeddings and neural inductive logic programming.

New Data Structures

A new data structure data.Dictionary that stores key-value mapping of PyTorch tensors on either CPUs or GPUs. It enjoys O(n) memory consumption and O(1) query time, and supports parallelism over batch of queries. This API provides a great opportunity for implementing sparse lookup tables or set operations in a PyTorchic style.

A new method data.Graph.match to efficiently retrieve all edges of specific patterns on either CPUs or GPUs. It scales linearly w.r.t. the number of patterns plus the number of retrieved edges, regardless the size of the graph. Typical usage of this method includes querying the existence of edges, generating random walks or even extracting ego graphs.

New Functions

Batching irregular structures, such as graphs, sets or sequences with different sizes, is a common demand in drug discovery. Instead of clumsy padding-based implementation, TorchDrug provides a family of functions that efficiently manipulate batch of variadic-sized tensors without padding. The update contains the following new variadic functions.

variadic_arange returns a 1-D tensor that contains integer intervals of variadic sizes.

variadic_softmax computes softmax over categories with variadic sizes.

variadic_sort sorts elements in sets with variadic sizes.

variadic_randperm returns random permutations for sets with variadic sizes, where the i-th permutation contains integers from 0 to size[i] - 1.

variadic_sample draws samples with replacement from sets with variadic sizes.

New Datasets

PCQM4M: A large-scale molecule property prediction dataset, originally used in OGB-LSC (thanks to @OPAYA )

Bug Fixes

Fix import of sascorer in plogp evaluation (#18, #31)

Fix atoms with stereo bonds in retrosynthesis (#42, #43)

Fix lazy construction for molecule datasets (#30, thanks to @DaShenZi721 )

Fix ChEMBLFiltered dataset (#36)

Fix ZINC2m dataset (#33)

Fix USPTO50k dataset (#32)

Fix bugs in core.Configurable (#26)

Fix/improve documentation (#16, #28, #41)

Fix installation on macOS (#29)

Source code(tar.gz)
Source code(zip)