We envision models that are pre-trained on a vast range of domain-relevant tasks to become key for molecule property prediction

Overview

HuggingMolecules

License

We envision models that are pre-trained on a vast range of domain-relevant tasks to become key for molecule property prediction. This repository aims to give easy access to state-of-the-art pre-trained models.

Quick tour

To quickly fine-tune a model on a dataset using the pytorch lightning package follow the below example based on the MAT model and the freesolv dataset:

from huggingmolecules import MatModel, MatFeaturizer

# The following import works only from the source code directory:
from experiments.src import TrainingModule, get_data_loaders

from torch.nn import MSELoss
from torch.optim import Adam

from pytorch_lightning import Trainer
from pytorch_lightning.metrics import MeanSquaredError

# Build and load the pre-trained model and the appropriate featurizer:
model = MatModel.from_pretrained('mat_masking_20M')
featurizer = MatFeaturizer.from_pretrained('mat_masking_20M')

# Build the pytorch lightning training module:
pl_module = TrainingModule(model,
                           loss_fn=MSELoss(),
                           metric_cls=MeanSquaredError,
                           optimizer=Adam(model.parameters()))

# Build the data loader for the freesolv dataset:
train_dataloader, _, _ = get_data_loaders(featurizer,
                                          batch_size=32,
                                          task_name='ADME',
                                          dataset_name='hydrationfreeenergy_freesolv')

# Build the pytorch lightning trainer and fine-tune the module on the train dataset:
trainer = Trainer(max_epochs=100)
trainer.fit(pl_module, train_dataloader=train_dataloader)

# Make the prediction for the batch of SMILES strings:
batch = featurizer(['C/C=C/C', '[C]=O'])
output = pl_module.model(batch)

Installation

Create your conda environment and install the rdkit package:

conda create -n huggingmolecules python=3.8.5
conda activate huggingmolecules
conda install -c conda-forge rdkit==2020.09.1

Then install huggingmolecules from the cloned directory:

conda activate huggingmolecules
pip install -e ./src

Project Structure

The project consists of two main modules: src/ and experiments/ modules:

  • The src/ module contains abstract interfaces for pre-trained models along with their implementations based on the pytorch library. This module makes configuring, downloading and running existing models easy and out-of-the-box.
  • The experiments/ module makes use of abstract interfaces defined in the src/ module and implements scripts based on the pytorch lightning package for running various experiments. This module makes training, benchmarking and hyper-tuning of models flawless and easily extensible.

Supported models architectures

Huggingmolecules currently provides the following models architectures:

For ease of benchmarking, we also include wrappers in the experiments/ module for three other models architectures:

The src/ module

The implementations of the models in the src/ module are divided into three modules: configuration, featurization and models module. The relation between these modules is shown on the following examples based on the MAT model:

Configuration examples

from huggingmolecules import MatConfig

# Build the config with default parameters values, 
# except 'd_model' parameter, which is set to 1200:
config = MatConfig(d_model=1200)

# Build the pre-defined config:
config = MatConfig.from_pretrained('mat_masking_20M')

# Build the pre-defined config with 'init_type' parameter set to 'normal':
config = MatConfig.from_pretrained('mat_masking_20M', init_type='normal')

# Save the pre-defined config with the previous modification:
config.save_to_cache('mat_masking_20M_normal.json')

# Restore the previously saved config:
config = MatConfig.from_pretrained('mat_masking_20M_normal.json')

Featurization examples

from huggingmolecules import MatConfig, MatFeaturizer

# Build the featurizer with pre-defined config:
config = MatConfig.from_pretrained('mat_masking_20M')
featurizer = MatFeaturizer(config)

# Build the featurizer in one line:
featurizer = MatFeaturizer.from_pretrained('mat_masking_20M')

# Encode (featurize) the batch of two SMILES strings: 
batch = featurizer(['C/C=C/C', '[C]=O'])

Models examples

from huggingmolecules import MatConfig, MatFeaturizer, MatModel

# Build the model with the pre-defined config:
config = MatConfig.from_pretrained('mat_masking_20M')
model = MatModel(config)

# Load the pre-trained weights 
# (which do not include the last layer of the model)
model.load_weights('mat_masking_20M')

# Build the model and load the pre-trained weights in one line:
model = MatModel.from_pretrained('mat_masking_20M')

# Encode (featurize) the batch of two SMILES strings: 
featurizer = MatFeaturizer.from_pretrained('mat_masking_20M')
batch = featurizer(['C/C=C/C', '[C]=O'])

# Feed the model with the encoded batch:
output = model(batch)

# Save the weights of the model (usually after the fine-tuning process):
model.save_weights('tuned_mat_masking_20M.pt')

# Load the previously saved weights
# (which now includes all layers of the model):
model.load_weights('tuned_mat_masking_20M.pt')

# Load the previously saved weights, but without 
# the last layer of the model ('generator' in the case of the 'MatModel')
model.load_weights('tuned_mat_masking_20M.pt', excluded=['generator'])

# Build the model and load the previously saved weights:
config = MatConfig.from_pretrained('mat_masking_20M')
model = MatModel.from_pretrained('tuned_mat_masking_20M.pt',
                                 excluded=['generator'],
                                 config=config)

Running tests

To run base tests for src/ module, type:

pytest src/ --ignore=src/tests/downloading/

To additionally run tests for downloading module (which will download all models to your local computer and therefore may be slow), type:

pytest src/tests/downloading

The experiments/ module

Requirements

In addition to dependencies defined in the src/ module, the experiments/ module goes along with few others. To install them, run:

pip install -r experiments/requirements.txt

The following packages are crucial for functioning of the experiments/ module:

Neptune.ai

In addition, we recommend installing the neptune.ai package:

  1. Sign up to neptune.ai at https://neptune.ai/.

  2. Get your Neptune API token (see getting-started for help).

  3. Export your Neptune API token to NEPTUNE_API_TOKEN environment variable.

  4. Install neptune-client: pip install neptune-client.

  5. Enable neptune.ai in the experiments/configs/setup.gin file.

  6. Update neptune.project_name parameters in experiments/configs/bases/*.gin files.

Running scripts:

We recommend running experiments scripts from the source code. For the moment there are three scripts implemented:

  • experiments/scripts/train.py - for training with the pytorch lightning package
  • experiments/scripts/tune_hyper.py - for hyper-parameters tuning with the optuna package
  • experiments/scripts/benchmark.py - for benchmarking based on the hyper-parameters tuning (grid-search)

In general running scripts can be done with the following syntax:

python -m experiments.scripts. /
       -d  / 
       -m  /
       -b 

Then the script .py runs with functions/methods parameters values defined in the following gin-config files:

  1. experiments/configs/bases/.gin
  2. experiments/configs/datasets/.gin
  3. experiments/configs/models/.gin

If the binding flag -b is used, then bindings defined in overrides corresponding bindings defined in above gin-config files.

So for instance, to fine-tune the MAT model (pre-trained on masking_20M task) on the freesolv dataset using GPU 1, simply run:

python -m experiments.scripts.train /
       -d freesolv / 
       -m mat /
       -b model.pretrained_name=\"mat_masking_20M\"#train.gpus=[1]

or equivalently:

python -m experiments.scripts.train /
       -d freesolv / 
       -m mat /
       --model.pretrained_name mat_masking_20M /
       --train.gpus [1]

Local dataset

To use a local dataset, create an appropriate gin-config file in the experiments/configs/datasets directory and specify the data.data_path parameter within. For details see the get_data_split implementation.

Benchmarking

For the moment there is one benchmark available. It works as follows:

  • experiments/scripts/benchmark.py: on the given dataset we fine-tune the given model on 10 learning rates and 6 seeded data splits (60 fine-tunings in total). Then we choose that learning rate that minimizes an averaged (on 6 data splits) validation metric (metric computed on the validation dataset, e.g. RMSE). The result is the averaged value of test metric for the chosen learning rate.

Running a benchmark is essentially the same as running any other script from the experiments/ module. So for instance to benchmark the vanilla MAT model (without pre-training) on the Caco-2 dataset using GPU 0, simply run:

python -m experiments.scripts.benchmark /
       -d caco2 / 
       -m mat /
       --model.pretrained_name None /
       --train.gpus [0]

However, the above script will only perform 60 fine-tunings. It won't compute the final benchmark result. To do that wee need to run:

python -m experiments.scripts.benchmark --results_only /
       -d caco2 / 
       -m mat

The above script won't perform any fine-tuning, but will only compute the benchmark result. If we had neptune enabled in experiments/configs/setup.gin, all data necessary to compute the result will be fetched from the neptune server.

Benchmark results

We performed the benchmark described in Benchmarking as experiments/scripts/benchmark.py for various models architectures and pre-training tasks.

Summary

We report mean/median ranks of tested models across all datasets (both regression and classification ones). For detailed results see Regression and Classification sections.

model mean rank rank std
MAT 200k 5.6 3.5
MAT 2M 5.3 3.4
MAT 20M 4.1 2.2
GROVER Base 3.8 2.7
GROVER Large 3.6 2.4
ChemBERTa 7.4 2.8
MolBERT 5.9 2.9
D-MPNN 6.3 2.3
D-MPNN 2d 6.4 2.0
D-MPNN mc 5.3 2.1

Regression

As the metric we used MAE for QM7 and RMSE for the rest of datasets.

model FreeSolv Caco-2 Clearance QM7 Mean rank
MAT 200k 0.913 ± 0.196 0.405 ± 0.030 0.649 ± 0.341 87.578 ± 15.375 5.25
MAT 2M 0.898 ± 0.165 0.471 ± 0.070 0.655 ± 0.327 81.557 ± 5.088 6.75
MAT 20M 0.854 ± 0.197 0.432 ± 0.034 0.640 ± 0.335 81.797 ± 4.176 5.0
Grover Base 0.917 ± 0.195 0.419 ± 0.029 0.629 ± 0.335 62.266 ± 3.578 3.25
Grover Large 0.950 ± 0.202 0.414 ± 0.041 0.627 ± 0.340 64.941 ± 3.616 2.5
ChemBERTa 1.218 ± 0.245 0.430 ± 0.013 0.647 ± 0.314 177.242 ± 1.819 8.0
MolBERT 1.027 ± 0.244 0.483 ± 0.056 0.633 ± 0.332 177.117 ± 1.799 8.0
Chemprop 1.061 ± 0.168 0.446 ± 0.064 0.628 ± 0.339 74.831 ± 4.792 5.5
Chemprop 2d 1 1.038 ± 0.235 0.454 ± 0.049 0.628 ± 0.336 77.912 ± 10.231 6.0
Chemprop mc 2 0.995 ± 0.136 0.438 ± 0.053 0.627 ± 0.337 75.575 ± 4.683 4.25

1 chemprop with additional rdkit_2d_normalized features generator
2 chemprop with additional morgan_count features generator

Classification

We used ROC AUC as the metric.

model HIA Bioavailability PPBR Tox21 (NR-AR) BBBP Mean rank
MAT 200k 0.943 ± 0.015 0.660 ± 0.052 0.896 ± 0.027 0.775 ± 0.035 0.709 ± 0.022 5.8
MAT 2M 0.941 ± 0.013 0.712 ± 0.076 0.905 ± 0.019 0.779 ± 0.056 0.713 ± 0.022 4.2
MAT 20M 0.935 ± 0.017 0.732 ± 0.082 0.891 ± 0.019 0.779 ± 0.056 0.735 ± 0.006 3.4
Grover Base 0.931 ± 0.021 0.750 ± 0.037 0.901 ± 0.036 0.750 ± 0.085 0.735 ± 0.006 4.0
Grover Large 0.932 ± 0.023 0.747 ± 0.062 0.901 ± 0.033 0.757 ± 0.057 0.757 ± 0.057 4.2
ChemBERTa 0.923 ± 0.032 0.666 ± 0.041 0.869 ± 0.032 0.779 ± 0.044 0.717 ± 0.009 7.0
MolBERT 0.942 ± 0.011 0.737 ± 0.085 0.889 ± 0.039 0.761 ± 0.058 0.742 ± 0.020 4.6
Chemprop 0.924 ± 0.069 0.724 ± 0.064 0.847 ± 0.052 0.766 ± 0.040 0.726 ± 0.008 7.0
Chemprop 2d 0.923 ± 0.015 0.712 ± 0.067 0.874 ± 0.030 0.775 ± 0.041 0.724 ± 0.006 6.8
Chemprop mc 0.924 ± 0.082 0.740 ± 0.060 0.869 ± 0.033 0.772 ± 0.041 0.722 ± 0.008 6.2
Comments
  • Same prediction for every molecules with pretrained and finetuned model.

    Same prediction for every molecules with pretrained and finetuned model.

    Hello, By using quick_tour.py, results are always same for different molecules. I am confused about it. The result is also same for the fine-tuned model.

    image

    opened by Changgun-Choi 8
  • ValueError: ('hydrationfreeenergy_freesolv', 'does not match to available values. Please double check.')

    ValueError: ('hydrationfreeenergy_freesolv', 'does not match to available values. Please double check.')

    Hi,

    I tried to run quick_tour.py under the huggingmolecules directory, but it gave the following error message:

    (huggingmolecules) zhen@precision:~/huggingmolecules$ python quick_tour.py
    WARNING:root:Argument blacklist is deprecated. Please use denylist.
    WARNING:root:Argument blacklist is deprecated. Please use denylist.
    WARNING:root:Argument blacklist is deprecated. Please use denylist.
    WARNING:root:Argument blacklist is deprecated. Please use denylist.
    WARNING:root:Argument blacklist is deprecated. Please use denylist.
    ['lipophilicity_astrazeneca', 'solubility_aqsoldb', 'caco2_wang', 'hia_hou', 'pgp_broccatelli', 'bioavailability_ma', 'vdss_lombardo', 'cyp2c19_veith', 'cyp2d6_veith', 'cyp3a4_veith', 'cyp1a2_veith', 'cyp2c9_veith', 'cyp2c9_substrate_carbonmangels', 'cyp2d6_substrate_carbonmangels', 'cyp3a4_substrate_carbonmangels', 'bbb_martins', 'ppbr_az', 'half_life_obach', 'clearance_hepatocyte_az', 'clearance_microsome_az']
    Traceback (most recent call last):
      File "quick_tour.py", line 28, in <module>
        train_dataloader, _, _ = get_data_loaders(featurizer,
      File "/home/zhen/huggingmolecules/experiments/src/training/training_utils.py", line 169, in get_data_loaders
        split = get_data_split(task_name=task_name, dataset_name=dataset_name)
      File "/home/zhen/miniconda3/envs/huggingmolecules/lib/python3.8/site-packages/gin/config.py", line 1069, in gin_wrapper
        utils.augment_exception_message_and_reraise(e, err_str)
      File "/home/zhen/miniconda3/envs/huggingmolecules/lib/python3.8/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
        raise proxy.with_traceback(exception.__traceback__) from None
      File "/home/zhen/miniconda3/envs/huggingmolecules/lib/python3.8/site-packages/gin/config.py", line 1046, in gin_wrapper
        return fn(*new_args, **new_kwargs)
      File "/home/zhen/huggingmolecules/experiments/src/training/training_utils.py", line 211, in get_data_split
        split = _get_data_split_from_tdc(task_name, dataset_name, assay_name,
      File "/home/zhen/huggingmolecules/experiments/src/training/training_utils.py", line 231, in _get_data_split_from_tdc
        data = task(name=dataset_name, label_name=assay_name)
      File "/home/zhen/miniconda3/envs/huggingmolecules/lib/python3.8/site-packages/tdc/single_pred/dataloader.py", line 38, in __init__
        super().__init__(name, path, label_name, print_stats,
      File "/home/zhen/miniconda3/envs/huggingmolecules/lib/python3.8/site-packages/tdc/single_pred/single_pred_dataset.py", line 17, in __init__
        entity1, y, entity1_idx = property_dataset_load(name, path, label_name, dataset_names)
      File "/home/zhen/miniconda3/envs/huggingmolecules/lib/python3.8/site-packages/tdc/utils.py", line 141, in property_dataset_load
        name = download_wrapper(name, path, dataset_names)
      File "/home/zhen/miniconda3/envs/huggingmolecules/lib/python3.8/site-packages/tdc/utils.py", line 44, in download_wrapper
        name = fuzzy_search(name, dataset_names)
      File "/home/zhen/miniconda3/envs/huggingmolecules/lib/python3.8/site-packages/tdc/utils.py", line 37, in fuzzy_search
        s =  get_closet_match(dataset_names, name)[0]
      File "/home/zhen/miniconda3/envs/huggingmolecules/lib/python3.8/site-packages/tdc/utils.py", line 734, in get_closet_match
        raise ValueError(test_token,
    ValueError: ('hydrationfreeenergy_freesolv', 'does not match to available values. Please double check.')
      In call to configurable 'data' (<function get_data_split at 0x7fc43fa588b0>)
    

    Could you please take a look?

    BTW, I didn't see the 'hydrationfreeenergy_freesolv' dataset when running the quick_tour.py file. Do you know where does the program store the downloaded dataset? Thank you very much!

    opened by LiuCMU 3
  • ModuleNotFoundError: No module named 'pytorch_lightning.metrics.functional.classification'

    ModuleNotFoundError: No module named 'pytorch_lightning.metrics.functional.classification'

    Hi, when executing quick_tour.py, it throws me the following error in line 9:

    Traceback (most recent call last):
      File "quick_tour.py", line 9, in <module>
        from experiments.src import TrainingModule, get_data_loaders
      File "/home/jack/huggingmolecules/experiments/src/__init__.py", line 1, in <module>
        from experiments.src.training.training_lightning_module import TrainingModule
      File "/home/jack/huggingmolecules/experiments/src/training/__init__.py", line 1, in <module>
        from .training_train_model import train_model
      File "/home/jack/huggingmolecules/experiments/src/training/training_train_model.py", line 6, in <module>
        from .training_lightning_module import TrainingModule
      File "/home/jack/huggingmolecules/experiments/src/training/training_lightning_module.py", line 8, in <module>
        from experiments.src.training.training_metrics import BatchWeightedLoss
      File "/home/jack/huggingmolecules/experiments/src/training/training_metrics.py", line 7, in <module>
        from pytorch_lightning.metrics.functional.classification import auroc
    ModuleNotFoundError: No module named 'pytorch_lightning.metrics.functional.classification'
    

    I tried at my local desktop and a remote server, but both gave the same error. Attached is my environment yaml file. Do you have any thoughts about the issue? Thank you! huggingmolecules.txt

    opened by LiuCMU 3
  • no setup.py

    no setup.py

    What's the reason for not installing it from python with pip yet?

    Update: oh i see setup.py is in src/ - if you move the setup.py in the root folder then people can do pip install git+git_repo_url

    opened by ioneliabuzatu 2
  • How to finetune grover?

    How to finetune grover?

    Do I see this correctly: your groover.forward returns the logits of the node-view fully connected neural network $p_{i, \text{node-view}}$ and the logits of the edge-view fully connected neural network $p_{i, \text{edge-view}}$ (appendix A.1 in https://arxiv.org/pdf/2007.02835.pdf) in https://github.com/gmum/huggingmolecules/blob/main/src/huggingmolecules/models/models_grover.py#L118

    Thus, to finetune groover one trains on (for a single example i) $L(p_{i, \text{edge-view}}, y_i) + L(p_{i, \text{node-view}}, y_i) + |p_{i, \text{edge-view}} - p_{i, \text{node-view}}|_2 ?

    Then to make predictions one uses $\sigma(p_{i, \text{edge-view}}) + \sigma(p_{i, \text{node-view}})?

    opened by wendlerc 2
  • How to perform classification instead of default regression?

    How to perform classification instead of default regression?

    The quick tour shows only regression, and when I substitute in a classification loss function and metric, the model is still behaving like a regressor. Is there something with which to specify a 'classification' task?

    Thank you!

    opened by pjspol 1
  • R-MAT RDKit matrix shapes

    R-MAT RDKit matrix shapes

    The following command

    python -m experiments.scripts.train -d freesolv -m rmat --model.pretrained_name rmat_4M_rdkit --train.gpus [0]

    yields

    RuntimeError: mat1 and mat2 shapes cannot be multiplied (16x3272 and 3072x1)

    Given that 3272 - 3072 = 200 and there are 200 RDKit features, I’m assuming that I need to specify something about extra features but I don’t know what. Any help would be much appreciated!

    opened by pjspol 1
  • Loading fine-tuned model checkpoint (last.ckpt) for prediction

    Loading fine-tuned model checkpoint (last.ckpt) for prediction

    Hi,

    Thanks for providing code for multiple projects. I am having a trouble to load the fine-tuned model checkpoint to make a prediction. I used 2 different ways to do that in both times I got an error. Any help is appreciated.

    trainer = Trainer()
    path_ckpt = 'experiments_results/Tune_MatModel_ADME_freesolv/trial_1/Tune_MatModel_ADME_freesolv/last.ckpt'
    
    checkpoint = torch.load(path_ckpt, map_location=torch.device('cpu'))
    pl_model = TrainingModule.load_state_dict(checkpoint['state_dict'])
    
    # Version 1:
    results = trainer.test(ckpt_path=path_ckpt, test_dataloaders=test_loader)
    print(results)
    
    #Version 2
    trainer2 = Trainer()
    results2 = trainer2.test(model=pl_model, test_dataloaders=test_loader)
    

    TrainingModule is my PyTorch-lightning model. The error I am getting for pl_model = TrainingModule.load_state_dict(checkpoint['state_dict']) is as below:

    File "<stdin>", line 1, in <module>
    TypeError: load_state_dict() missing 1 required positional argument: 'state_dict'
    

    And for results = trainer.test(ckpt_path=path_ckpt, test_dataloaders=test_loader) is as below:

    model.load_state_dict(ckpt['state_dict'])
    AttributeError: 'NoneType' object has no attribute 'load_state_dict'
    

    I did some search on these errors and it seems the errors appear due to hyper parameters not being saved explicitly. Is that right?

    Best, Sarkhan

    opened by sbadirli 1
  • Loading Tox21 through the dataloader.

    Loading Tox21 through the dataloader.

    Hey guys,

    I was trying to load the Tox21 dataset through the get_data_loaders() function with the following call:

    Tox21_dataloader, _, _ = get_data_loaders(featurizer, batch_size=32, task_name='Tox', dataset_name ='Tox21')

    Said call however, returned the following error:

    ValueError: Please select a label name. You can use tdc.utils.retrieve_label_name_list('tox21') to retrieve all available label names. In call to configurable 'data' (<function get_data_split at 0x7fb3f49b3e50>)

    When adding the assay_name = 'nr-ar' parameter to the call, as hinted by tox21-nr-ar.gin, like this:

    Tox21_dataloader, _, _ = get_data_loaders(featurizer, batch_size=32, task_name='Tox', dataset_name ='Tox21', assay_name = 'NR-AR')

    I got the following error:

    TypeError: get_data_loaders() got an unexpected keyword argument 'assay_name'

    When I unpacked the get_data_loaders() function arguments using inspect.getfullargspec(get_data_loaders) with the inspect python package, I got:

    FullArgSpec(args=['featurizer'], varargs=None, varkw=None, defaults=None, kwonlyargs=['batch_size', 'num_workers', 'cache_encodings', 'task_name', 'dataset_name'], kwonlydefaults={'num_workers': 0, 'cache_encodings': False, 'task_name': None, 'dataset_name': None}, annotations={'return': typing.Tuple[torch.utils.data.dataloader.DataLoader, torch.utils.data.dataloader.DataLoader, torch.utils.data.dataloader.DataLoader], 'featurizer': <class 'src.huggingmolecules.featurization.featurization_api.PretrainedFeaturizerMixin'>, 'batch_size': <class 'int'>, 'num_workers': <class 'int'>, 'cache_encodings': <class 'bool'>, 'task_name': <class 'str'>, 'dataset_name': <class 'str'>})

    Which doesn't seem to have an option to specify additional parameters in the call.

    Maybe I'm overthinking this. Could I get a few pointers on how to successfully load the NR-AR Tox21 dataset with a base get_data_loaders() call?

    Regards,

    César Miguel

    opened by cmvcordova 1
  • Clean-up experiments&example and keep only essentials things for benchmark

    Clean-up experiments&example and keep only essentials things for benchmark

    • No Optuna
    • Clean verbose code
    • Comments
    • Minimal (no more code than necessary for this single objective)
    • Include also "Quick Tour" as an example
    opened by kudkudak 0
  • Add top-level setup.py

    Add top-level setup.py

    With this, we can now pip install directly from the repo dir.

    See https://github.com/gmum/huggingmolecules/issues/35.

    We can potentially package up experiments as an extra installable (i.e., via huggingmolecules[experiments]). I will do in a follow-up if necessary.

    opened by chajath 2
  • Make the pakcage pip-friendly and publish to PyPI

    Make the pakcage pip-friendly and publish to PyPI

    I see that the main rationale for not allowing pip install was rdkit https://github.com/gmum/huggingmolecules/issues/26

    Now that rdkit is available on PyPI for pip install, is it the time to reconsider packaging and publishing huggingmolecules? Seems like a straightforward change, and I'm willing to contribute if you are ok with the idea.

    opened by chajath 0
  • Datasets

    Datasets

    When I run the command python -m experiments.scripts.train -d bioavailability -m mat --model.pretrained_name mat_masking_20M --train.gpus 1 --train.num_epochs 100, the error will raise. But I can run successfully on freesolv dataset. When I use bioavailability and PPBR, the errors will appear.


    [21:05:19] UFFTYPER: Unrecognized atom type: Au6 (7)
    [21:05:19] UFFTYPER: Unrecognized atom type: Au6 (7)
    LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3,4,5]
    
      | Name    | Type              | Params
    ----------------------------------------------
    0 | model   | MatModel          | 42.1 M
    1 | loss_fn | BCEWithLogitsLoss | 0     
    ----------------------------------------------
    42.1 M    Trainable params
    0         Non-trainable params
    42.1 M    Total params
    168.231   Total estimated model params size (MB)
    Validation sanity check:  50%|██████████████████████████████████████████████████████████████████████████████████████████████████                                                                                                  | 1/2 [00:00<00:00,  3.84it/s]
    WARNING:root:AUROC requires both negative and positive samples. Returning None
    
    Traceback (most recent call last):
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/runpy.py", line 194, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/runpy.py", line 87, in _run_code
        exec(code, run_globals)
      File "/home/zhuyun/huggingmolecules/experiments/scripts/train.py", line 13, in <module>
        train_model()
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/gin/config.py", line 1605, in gin_wrapper
        utils.augment_exception_message_and_reraise(e, err_str)
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/gin/utils.py", line 41, in augment_exception_message_and_reraise
        raise proxy.with_traceback(exception.__traceback__) from None
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/gin/config.py", line 1582, in gin_wrapper
        return fn(*new_args, **new_kwargs)
      File "/home/zhuyun/huggingmolecules/experiments/src/training/training_train_model.py", line 65, in train_model
        trainer.fit(pl_module,
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
        self._run(model)
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
        self.dispatch()
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
        self.accelerator.start_training(self)
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
        self.training_type_plugin.start_training(trainer)
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
        self._results = trainer.run_stage()
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
        return self.run_train()
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 842, in run_train
        self.run_sanity_check(self.lightning_module)
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1107, in run_sanity_check
        self.run_evaluation()
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 988, in run_evaluation
        self.evaluation_loop.evaluation_epoch_end(outputs)
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 200, in evaluation_epoch_end
        self.trainer.logger_connector.evaluation_epoch_end()
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 251, in evaluation_epoch_end
        self.cached_results.has_batch_loop_finished = True
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 404, in has_batch_loop_finished
        self.update_logger_connector()
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 345, in update_logger_connector
        epoch_log_metrics = self.get_epoch_log_metrics()
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 417, in get_epoch_log_metrics
        return self.run_epoch_by_func_name("get_epoch_log_metrics")
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 410, in run_epoch_by_func_name
        results = [func() for func in results]
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 410, in <listcomp>
        results = [func() for func in results]
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 130, in get_epoch_log_metrics
        return self.get_epoch_from_func_name("get_epoch_log_metrics")
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 123, in get_epoch_from_func_name
        self.run_epoch_func(results, opt_metrics, func_name, *args, **kwargs)
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 111, in run_epoch_func
        metrics_to_log = func(*args, add_dataloader_idx=self.has_several_dataloaders, **kwargs)
      File "/home/zhuyun/anaconda3/envs/huggingmolecules/lib/python3.8/site-packages/pytorch_lightning/core/step_result.py", line 289, in get_epoch_log_metrics
        result[dl_key] = self[k].compute().detach()
    AttributeError: 'NoneType' object has no attribute 'detach'
      In call to configurable 'train' (<function train_model at 0x7f15a2a6eca0>)
    
    opened by ZhuYun97 2
Owner
GMUM
Group of Machine Learning Research, Jagiellonian University
GMUM
A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

Awesome Pretrained StyleGAN2 A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution. Note the readme is a

Justin 1.1k Dec 24, 2022
Few-Shot Graph Learning for Molecular Property Prediction

Few-shot Graph Learning for Molecular Property Prediction Introduction This is the source code and dataset for the following paper: Few-shot Graph Lea

Zhichun Guo 94 Dec 12, 2022
MolRep: A Deep Representation Learning Library for Molecular Property Prediction

MolRep: A Deep Representation Learning Library for Molecular Property Prediction Summary MolRep is a Python package for fairly measuring algorithmic p

AI-Health @NSCC-gz 83 Dec 24, 2022
Fast and scalable uncertainty quantification for neural molecular property prediction, accelerated optimization, and guided virtual screening.

Evidential Deep Learning for Guided Molecular Property Prediction and Discovery Ava Soleimany*, Alexander Amini*, Samuel Goldman*, Daniela Rus, Sangee

Alexander Amini 75 Dec 15, 2022
Official implementation of "Motif-based Graph Self-Supervised Learning forMolecular Property Prediction"

Motif-based Graph Self-Supervised Learning for Molecular Property Prediction Official Pytorch implementation of NeurIPS'21 paper "Motif-based Graph Se

zaixi 71 Dec 20, 2022
Monocular Depth Estimation - Weighted-average prediction from multiple pre-trained depth estimation models

merged_depth runs (1) AdaBins, (2) DiverseDepth, (3) MiDaS, (4) SGDepth, and (5) Monodepth2, and calculates a weighted-average per-pixel absolute dept

Pranav 39 Nov 21, 2022
Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

Kimio Kuramitsu 1 Dec 13, 2021
TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

TorchGeo is a PyTorch domain library, similar to torchvision, that provides datasets, transforms, samplers, and pre-trained models specific to geospatial data.

Microsoft 1.3k Dec 30, 2022
Repository providing a wide range of self-supervised pretrained models for computer vision tasks.

Hierarchical Pretraining: Research Repository This is a research repository for reproducing the results from the project "Self-supervised pretraining

Colorado Reed 53 Nov 9, 2022
Explainer for black box models that predict molecule properties

Explaining why that molecule exmol is a package to explain black-box predictions of molecules. The package uses model agnostic explanations to help us

White Laboratory 172 Dec 19, 2022
Price-Prediction-For-a-Dream-Home - A machine learning based linear regression trained model for house price prediction.

Price-Prediction-For-a-Dream-Home ROADMAP TO THIS LINEAR REGRESSION BASED HOUSE PRICE PREDICTION PREDICTION MODEL Import all the dependencies of the p

DIKSHA DESWAL 1 Dec 29, 2021
Using pretrained GROVER to extract the atomic fingerprints from molecule

Extracting atomic fingerprints from molecules using pretrained Graph Neural Network models (GROVER).

Xuan Vu Nguyen 1 Jan 28, 2022
face property detection pytorch

This is the face property train code of project face-detection-project

i am x 2 Oct 18, 2021
chainladder - Property and Casualty Loss Reserving in Python

chainladder (python) chainladder - Property and Casualty Loss Reserving in Python This package gets inspiration from the popular R ChainLadder package

Casualty Actuarial Society 130 Dec 7, 2022
Code, Data and Demo for Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting

InversePrompting Paper: Controllable Generation from Pre-trained Language Models via Inverse Prompting Code: The code is provided in the "chinese_ip"

THUDM 101 Dec 16, 2022
《K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters》(2020)

K-Adapter: Infusing Knowledge into Pre-Trained Models with Adapters This repository is the implementation of the paper "K-Adapter: Infusing Knowledge

Microsoft 118 Dec 13, 2022
Pytorch implementation of our paper under review — Lottery Jackpots Exist in Pre-trained Models

Lottery Jackpots Exist in Pre-trained Models (Paper Link) Requirements Python >= 3.7.4 Pytorch >= 1.6.1 Torchvision >= 0.4.1 Reproduce the Experiment

Yuxin Zhang 27 Jun 28, 2022
This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT).

Dynamic-Vision-Transformer (Pytorch) This repo contains the official code and pre-trained models for the Dynamic Vision Transformer (DVT). Not All Ima

null 210 Dec 18, 2022