Associated Repository for "Translation between Molecules and Natural Language"

Last update: Dec 15, 2022

Related tags

Text Data & NLP MolT5

Overview

MolT5: Translation between Molecules and Natural Language

Associated repository for "Translation between Molecules and Natural Language".

Table of Contents

HuggingFace model checkpoints
T5X-based model checkpoints
Pretraining (MolT5-based models)
Finetuning (MolT5-based models)
Datasets
Citation

HuggingFace model checkpoints

All of our HuggingFace checkpoints are located here.

Pretrained MolT5-based checkpoints include:

molt5-small (~77 million parameters)
molt5-base (~250 million parameters)
molt5-large (~800 million parameters)

You can also easily find our fine-tuned caption2smiles and smiles2caption models. For example, molt5-large-smiles2caption is a molt5-large model that has been further fine-tuned for the task of molecule captioning (i.e., smiles2caption).

Example usage for molecule captioning (i.e., smiles2caption):

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("laituan245/molt5-large-smiles2caption", model_max_length=512)
model = T5ForConditionalGeneration.from_pretrained('laituan245/molt5-large-smiles2caption')

input_text = 'C1=CC2=C(C(=C1)[O-])NC(=CC2=O)C(=O)O'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids, num_beams=5, max_length=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Example usage for molecule generation (i.e., caption2smiles):

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("laituan245/molt5-large-caption2smiles", model_max_length=512)
model = T5ForConditionalGeneration.from_pretrained('laituan245/molt5-large-caption2smiles')

input_text = 'The molecule is a monomethoxybenzene that is 2-methoxyphenol substituted by a hydroxymethyl group at position 4. It has a role as a plant metabolite. It is a member of guaiacols and a member of benzyl alcohols.'
input_ids = tokenizer(input_text, return_tensors="pt").input_ids

outputs = model.generate(input_ids, num_beams=5, max_length=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

T5X-based model checkpoints

Pretraining (MolT5-based models)

We used the open-sourced t5x framework for pretraining MolT5-based models.

For pre-training MolT5-based models, please first go over this document. In our work, our pretraining task is a mixture of c4_v220_span_corruption and also our own task called zinc_span_corruption. The pretraining mixture is called zinc_and_c4_mix. The code snippet below illustrates how to define zinc_and_c4_mix (e.g., you can just add this code snippet to tasks.py). Our Gin config files for pretraining are located in configs/pretrain. Data files can be downloaded from here.

...
import tensorflow.compat.v2 as tf
...
seqio.TaskRegistry.add(
    'zinc_span_corruption',
    source=seqio.TFExampleDataSource(
        split_to_filepattern={
            'test': # Path to zinc_smiles_test.tfrecords,
            'validation': # Path to zinc_smiles_val.tfrecords,
            'train': # Path to zinc_smiles_train.tfrecords,
        },
        feature_description={
            'text': tf.io.FixedLenFeature([], dtype=tf.string),
        }),
    preprocessors=[
        functools.partial(
            preprocessors.rekey, key_map={
                'inputs': None,
                'targets': 'text'
            }),
        seqio.preprocessors.tokenize,
        preprocessors.span_corruption,
        seqio.preprocessors.append_eos_after_trim,
    ],
    output_features=DEFAULT_OUTPUT_FEATURES,
    metric_fns=[])

seqio.MixtureRegistry.add('zinc_and_c4_mix', [('zinc_span_corruption', 1),
                                              ('c4_v220_span_corruption', 1)])
)

Finetuning (MolT5-based models)

We also used the t5x framework for finetuning MolT5-based models. Please first go over this document. Our Gin config files for finetuning are located in configs/finetune. For each of the Gin file, you need to set the INITIAL_CHECKPOINT_PATH variables (please use one of the checkpoints mentioned in this section). Note that there are two new tasks, which are named caption2smiles and smiles2caption. The code snippet below illustrates how to define the tasks. Data files can be downloaded from here.

...
# Metrics
_TASK_EVAL_METRICS_FNS = [
    metrics.bleu,
    metrics.rouge,
    metrics.sequence_accuracy
]

# Data Source
DATA_SOURCE = seqio.TFExampleDataSource(
    split_to_filepattern={
        'train': # Path to chebi_20_train.tfrecords,
        'validation': # Path to chebi_20_dev.tfrecords,
        'test': # Path to chebi_20_test.tfrecords
    },
    feature_description={
        'caption': tf.io.FixedLenFeature([], dtype=tf.string),
        'smiles': tf.io.FixedLenFeature([], dtype=tf.string),
        'cid': tf.io.FixedLenFeature([], dtype=tf.string),
    }
)

# Molecular Captioning (smiles2caption)
seqio.TaskRegistry.add(
    'smiles2caption',
    source=DATA_SOURCE,
    preprocessors=[
        functools.partial(
            preprocessors.rekey,
            key_map={
                'inputs': 'smiles',
                'targets': 'caption'
            }),
        seqio.preprocessors.tokenize,
        seqio.preprocessors.append_eos_after_trim,
    ],
    output_features=DEFAULT_OUTPUT_FEATURES,
    metric_fns=_TASK_EVAL_METRICS_FNS,
)

# Molecular Captioning (caption2smiles)
seqio.TaskRegistry.add(
    'caption2smiles',
    source=DATA_SOURCE,
    preprocessors=[
        functools.partial(
            preprocessors.rekey,
            key_map={
                'inputs': 'caption',
                'targets': 'smiles'
            }),
        seqio.preprocessors.tokenize,
        seqio.preprocessors.append_eos_after_trim,
    ],
    output_features=DEFAULT_OUTPUT_FEATURES,
    metric_fns=_TASK_EVAL_METRICS_FNS,
)

Datasets

ChEBI-20 (txt format)
ZINC (tfrecords format)
ChEBI-20 (tfrecords format)

Citation

If you found our work useful, please cite:

@article{edwards2022translation,
  title={Translation between Molecules and Natural Language},
  author={Edwards, Carl and Lai, Tuan and Ros, Kevin and Honke, Garrett and Ji, Heng},
  journal={arXiv preprint arXiv:2204.11817},
  year={2022}
}

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

XL-Sum This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Lang

189 Jan 2, 2023

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

KR-BERT-SimCSE Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT. Training Unsupervised python train_unsupervised.py --mi

27 Dec 12, 2022

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

Proteno This is the data release associated with the corresponding NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deploymen

37 Dec 4, 2022

Comments

Questions About Using Text2Mol Evaluation Metric

Hi, thanks for sharing the code.

I am trying to use your proposed Text2Mol Evaluation Metric. However, I encountered the following errors when I run the code.

Traceback (most recent call last): File "text_text2mol_metric.py", line 60, in cids_to_smiles = pickle.load(f) _pickle.UnpicklingError: invalid load key, 'v'

And I try to replace the 'cid_to_smiles.pkl' with the downloaded 3GB version, then another issue arises: Chem.MolFromSmiles(cids_to_smiles[cid]) generates Nonetype.

I don't know if the problem comes from the version of Python or other packages. Could you please help me fix it, thank you in advance.

opened by Frankie123421 10
Download cid_to_smiles.pkl

Great Job! I can't download the cid_to_smiles.pkl file due to the 1GB free limit on Github LFS. Could you provide other download links, such as Google Drive? Thanks a lot!

opened by ddz16 6
Environment setting and re-train with other drug description

Hi Sir or Madam, I am very stunning by your work, and I would like to re-train the pre-train model with other description of drugs. However, I cannot find how to install. Could you please update the repo so that I may do my set?

Gracias!

opened by lichman0405 1

An error occurred when running mol_translation_metrics.py

I was just trying to run the example command python mol_translation_metrics.py --input_file caption2smiles_example.txt.

The error log

0 processed.
Traceback (most recent call last):
  File "C:\Users\laitu\Downloads\MolT5\evaluation\mol_translation_metrics.py", line 66, in <module>
    mscore = meteor_score([gt], out)
  File "C:\Users\laitu\Anaconda3\lib\site-packages\nltk\translate\meteor_score.py", line 397, in meteor_score
    return max(
  File "C:\Users\laitu\Anaconda3\lib\site-packages\nltk\translate\meteor_score.py", line 398, in <genexpr>
    single_meteor_score(
  File "C:\Users\laitu\Anaconda3\lib\site-packages\nltk\translate\meteor_score.py", line 326, in single_meteor_score
    enum_hypothesis, enum_reference = _generate_enums(
  File "C:\Users\laitu\Anaconda3\lib\site-packages\nltk\translate\meteor_score.py", line 33, in _generate_enums
    raise TypeError(
TypeError: "hypothesis" expects pre-tokenized hypothesis (Iterable[str]): C1=CC=C2C(=C1)C(=O)C3=C(C2=O)C=C(C=C3)C(=O)[O-]

bug

opened by laituan245 1

Associated Repository for "Translation between Molecules and Natural Language"

Related tags

Overview

MolT5: Translation between Molecules and Natural Language

HuggingFace model checkpoints

T5X-based model checkpoints

Pretraining (MolT5-based models)

Finetuning (MolT5-based models)

Datasets

Citation

You might also like...

This repository contains the code, data, and models of the paper titled "XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages" published in Findings of the Association for Computational Linguistics: ACL 2021.

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems

This repository is home to the Optimus data transformation plugins for various data processing needs.

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

This repository contains the code for EMNLP-2021 paper "Word-Level Coreference Resolution"

Repository for Graph2Pix: A Graph-Based Image to Image Translation Framework

Plugin repository for Macast

The repository for the paper: Multilingual Translation via Grafting Pre-trained Language Models

Comments

Questions About Using Text2Mol Evaluation Metric

Download cid_to_smiles.pkl

Environment setting and re-train with other drug description

An error occurred when running mol_translation_metrics.py

Owner

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper

Predict an emoji that is associated with a text

Repository to hold code for the cap-bot varient that is being presented at the SIIC Defence Hackathon 2021.

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

This repository contains the code for "Generating Datasets with Pretrained Language Models".

This repository will contain the code for the CVPR 2021 paper "GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields"

This repository describes our reproducible framework for assessing self-supervised representation learning from speech

NL-Augmenter 🦎 → 🐍 A Collaborative Repository of Natural Language Transformations

This repository contains all the source code that is needed for the project : An Efficient Pipeline For Bloom’s Taxonomy Using Natural Language Processing and Deep Learning