GT4SD (Generative Toolkit for Scientific Discovery)
The GT4SD (Generative Toolkit for Scientific Discovery) is an open-source platform to accelerate hypothesis generation in the scientific discovery process. It provides a library for making state-of-the-art generative AI models easier to use.
Installation
pip
You can install gt4sd
directly from GitHub:
pip install git+https://github.com/GT4SD/gt4sd-core
Development setup & installation
If you would like to contribute to the package, we recommend the following development setup: Clone the gt4sd-core repository:
git clone [email protected]:GT4SD/gt4sd-core.git
cd gt4ds-core
conda env create -f conda.yml
conda activate gt4sd
pip install -e .
Learn more in CONTRIBUTING.md
Supported packages
Beyond implementing various generative modeling inference and training pipelines GT4SD is designed to provide a high-level API that implement an harmonized interface for several existing packages:
- GuacaMol: inference pipelines for the baselines models.
- MOSES: inference pipelines for the baselines models.
- TAPE: encoder modules compatible with the protein language models.
- PaccMann: inference pipelines for all algorithms of the PaccMann family as well as traiing pipelines for the generative VAEs.
- transformers: training and inference pipelines for generative models from the HuggingFace Models
Using GT4SD
Running inference pipelines
Running an algorithm is as easy as typing:
from gt4sd.algorithms.conditional_generation.paccmann_rl.core import (
PaccMannRLProteinBasedGenerator, PaccMannRL
)
target = 'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT'
# algorithm configuration with default parameters
configuration = PaccMannRLProteinBasedGenerator()
# instantiate the algorithm for sampling
algorithm = PaccMannRL(configuration=configuration, target=target)
items = list(algorithm.sample(10))
print(items)
Or you can use the ApplicationRegistry
to run an algorithm instance using a serialized representation of the algorithm:
from gt4sd.algorithms.registry import ApplicationsRegistry
target = 'MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT'
algorithm = ApplicationsRegistry.get_application_instance(
target=target,
algorithm_type='conditional_generation',
domain='materials',
algorithm_name='PaccMannRL',
algorithm_application='PaccMannRLProteinBasedGenerator',
generated_length=32,
# include additional configuration parameters as **kwargs
)
items = list(algorithm.sample(10))
print(items)
Running training pipelines via the CLI command
GT4SD provides a trainer client based on the gt4sd-trainer
CLI command. The trainer currently supports training pipelines for language modeling (language-modeling-trainer
), PaccMann (paccmann-vae-trainer
) and Granular (granular-trainer
, multimodal compositional autoencoders).
$ gt4sd-trainer --help
usage: gt4sd-trainer [-h] --training_pipeline_name TRAINING_PIPELINE_NAME
[--configuration_file CONFIGURATION_FILE]
optional arguments:
-h, --help show this help message and exit
--training_pipeline_name TRAINING_PIPELINE_NAME
Training type of the converted model, supported types:
granular-trainer, language-modeling-trainer, paccmann-
vae-trainer. (default: None)
--configuration_file CONFIGURATION_FILE
Configuration file for the trainining. It can be used
to completely by-pass pipeline specific arguments.
(default: None)
To launch a training you have two options.
You can either specify the training pipeline and the path of a configuration file that contains the needed training parameters:
gt4sd-trainer --training_pipeline_name ${TRAINING_PIPELINE_NAME} --configuration_file ${CONFIGURATION_FILE}
Or you can provide directly the needed parameters as argumentsL
gt4sd-trainer --training_pipeline_name language-modeling-trainer --type mlm --model_name_or_path mlm --training_file /pah/to/train_file.jsonl --validation_file /path/to/valid_file.jsonl
To get more info on a specific training pipeleins argument simply type:
gt4sd-trainer --training_pipeline_name ${TRAINING_PIPELINE_NAME} --help
References
If you use gt4sd
in your projects, please consider citing the following:
@software{GT4SD,
author = {GT4SD Team},
month = {2},
title = {{GT4SD (Generative Toolkit for Scientific Discovery)}},
url = {https://github.com/GT4SD/gt4sd-core},
version = {main},
year = {2022}
}
License
The gt4sd
codebase is under MIT license. For individual model usage, please refer to the model licenses found in the original packages.