FS-Mol: A Few-Shot Learning Dataset of Molecules

Related tags

Deep Learning FS-Mol
Overview

FS-Mol: A Few-Shot Learning Dataset of Molecules

This repository contains data and code for FS-Mol: A Few-Shot Learning Dataset of Molecules.

Installation

  1. Clone or download this repository

  2. Install dependencies

    cd FS-Mol
    
    conda env create -f environment.yml
    conda activate fsmol
    

The code for the Molecule Attention Transformer baseline is added as a submodule of this repository. Hence, in order to be able to run MAT, one has to clone our repository via git clone --recurse-submodules. Alternatively, one can first clone our repository normally, and then set up submodules via git submodule update --init. If the MAT submodule is not set up, all the other parts of our repository should continue to work.

Data

The dataset is available as a download, FS-Mol Data, split into train, valid and test folders. Additionally, we specify which tasks are to be used with the file datasets/fsmol-0.1.json, a default list of tasks for each data fold. We note that the complete dataset contains many more tasks. Should use of all possible training tasks available be desired, the training script argument --task_list_file datasets/entire_train_set.json should be used. The task lists will be used to version FS-Mol in future iterations as more data becomes available via ChEMBL.

Tasks are stored as individual compressed JSONLines files, with each line corresponding to the information to a single datapoint for the task. Each datapoint is stored as a JSON dictionary, following a fixed structure:

{
    "SMILES": "SMILES_STRING",
    "Property": "ACTIVITY BOOL LABEL",
    "Assay_ID": "CHEMBL ID",
    "RegressionProperty": "ACTIVITY VALUE",
    "LogRegressionProperty": "LOG ACTIVITY VALUE",
    "Relation": "ASSUMED RELATION OF MEASURED VALUE TO TRUE VALUE",
    "AssayType": "TYPE OF ASSAY",
    "fingerprints": [...],
    "descriptors": [...],
    "graph": {
        "adjacency_lists": [
           [... SINGLE BONDS AS PAIRS ...],
           [... DOUBLE BONDS AS PAIRS ...],
           [... TRIPLE BONDS AS PAIRS ...]
        ],
        "node_types": [...ATOM TYPES...],
        "node_features": [...NODE FEATURES...],
    }
}

FSMolDataset

The fs_mol.data.FSMolDataset class provides programmatic access in Python to the train/valid/test tasks of the few-shot dataset. An instance is created from the data directory by FSMolDataset.from_directory(/path/to/dataset). More details and examples of how to use FSMolDataset are available in fs_mol/notebooks/dataset.ipynb.

Evaluating a new Model

We have provided an implementation of the FS-Mol evaluation methodology in fs_mol.utils.eval_utils.eval_model(). This is a framework-agnostic python method, and we demonstrate how to use it for evaluating a new model in detail in notebooks/evaluation.ipynb.

Note that our baseline test scripts (fs_mol/baseline_test.py, fs_mol/maml_test.py, fs_mol/mat_test, fs_mol/multitask_test.py and fs_mol/protonet_test.py) use this method as well and can serve as examples on how to integrate per-task fine-tuning in TensorFlow (maml_test.py), fine-tuning in PyTorch (mat_test.py) and single-task training for scikit-learn models (baseline_test.py). These scripts also support the --task_list_file parameter to choose different sets of test tasks, as required.

Baseline Model Implementations

We provide implementations for three key few-shot learning methods: Multitask learning, Model-Agnostic Meta-Learning, and Prototypical Networks, as well as evaluation on the Single-Task baselines and the Molecule Attention Transformer (MAT) paper, code.

All results and associated plots are found in the baselines/ directory.

These baseline methods can be run on the FS-Mol dataset as follows:

kNNs and Random Forests -- Single Task Baselines

Our kNN and RF baselines are obtained by permitting grid-search over a industry-standard parameter set, detailed in the script baseline_test.py.

The baseline single-task evaluation can be run as follows, with a choice of kNN or randomForest model:

python fs_mol/baseline_test.py /path/to/data --model {kNN, randomForest}

Molecule Attention Transformer

The Molecule Attention Transformer (MAT) paper, code.

The Molecule Attention Transformer can be evaluated as:

python fs_mol/mat_test.py /path/to/pretrained-mat /path/to/data

GNN-MAML pre-training and evaluation

The GNN-MAML model consists of a GNN operating on the molecular graph representations of the dataset. The model consists of a $8$-layer GNN with node-embedding dimension $128$. The GNN uses "Edge-MLP" message passing. The model was trained with a support set size of $16$ according to the MAML procedure Finn 2017. The hyperparameters used in the model checkpoint are default settings of maml_train.py.

The current defaults were used to train the final versions of GNN-MAML available here.

python fs_mol/maml_train.py /path/to/data 

Evaluation is run as:

python fs_mol/maml_test.py /path/to/data --trained_model /path/to/gnn-maml-checkpoint

GNN-MT pre-training and evaluation

The GNN-MT model consists of a GNN operating on the molecular graph representations of the dataset. The model consists of a $10$-layer GNN with node-embedding dimension $128$. The model uses principal neighbourhood aggregation (PNA) message passing. The hyperparameters used in the model checkpoint are default settings of multitask_train.py. This method has similarities to the approach taken for the task-only training contained within Hu 2019

python fs_mol/multitask_train.py /path/to/data 

Evaluation is run as:

python fs_mol/multitask_test.py /path/to/gnn-mt-checkpoint /path/to/data

Prototypical Networks (PN) pre-training and evaluation

The prototypical networks method Snell 2017 extracts representations of support set datapoints and uses these to classify positive and negative examples. We here used the Mahalonobis distance as a metric for query point distance to class prototypes.

python fs_mol/protonet_train.py /path/to/data 

Evaluation is run as:

python fs_mol/protonet_test.py /path/to/pn-checkpoint /path/to/data

Available Model Checkpoints

We provide pre-trained models for GNN-MAML, GNN-MT and PN, these are downloadable from the links to figshare.

Model Name Description Checkpoint File
GNN-MAML Support set size 16. 8-layer GNN. Edge MLP message passing. MAML-Support16_best_validation.pkl
GNN-MT 10-layer GNN. PNA message passing multitask_best_model.pt
PN 10-layer GGN, PNA message passing. ECFP+GNN, Mahalonobis distance metric PN-Support64_best_validation.pt

Specifying, Training and Evaluating New Model Implementations

Flexible definition of few-shot models and single task models is defined as demonstrated in the range of train and test scripts in fs_mol.

We give a detailed example of how to use the abstract class AbstractTorchFSMolModel in notebooks/integrating_torch_models.ipynb to integrate a new general PyTorch model, and note that the evaluation procedure described below is demonstrated on sklearn models in fs_mol/baseline_test.py and on a Tensorflow-based GNN model in fs_mol/maml_test.py.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Comments
  • chore(*): update env to stop breaking MAML train, fix notebook datase…

    chore(*): update env to stop breaking MAML train, fix notebook datase…

    Tiny changes -- easier to add azure to env than remove all references. Also there was an issue raised by a user re: path to the dataset in one of the notebooks.

    opened by megstanley 3
  • Dataset filtering details

    Dataset filtering details

    In your FS-Mol paper, it is said that only assays with 32 to 5000 compounds are kept, and the remaining training dataset then contains 4938 assays. However, if I try to filter out those from your provided dataset loading code, I'm left with ~24k assays.

    from fs_mol.data import FSMolDataset, DataFold
    dataset = FSMolDataset.from_directory(FS_MOL_DATASET_PATH)
    train_task_iterable = dataset.get_task_reading_iterable(DataFold.TRAIN)
    assay_sizes = []
    for t in train_task_iterable:
        assay_sizes.append(len(t.samples))
    print(len(assay_sizes[np.where(np.logical_and(assay_sizes >= 32, assay_sizes <= 5000))[0]]))
    # prints 23832
    
    

    Is there something obvious that I'm missing?

    opened by gregorkrz 2
  • Possible Error in Dataset Notebook

    Possible Error in Dataset Notebook

    Hello,

    In your example notebook for datasets (https://github.com/microsoft/FS-Mol/blob/main/notebooks/dataset.ipynb) the dataset path seems to be wrong. You have put FS_MOL_DATASET_PATH = os.path.join(os.environ['HOME'], "Datasets", "FS-Mol"), but for it to work I had to change this to FS_MOL_DATASET_PATH = os.path.join(os.environ['HOME'], "Datasets", "FS-Mol", "datasets") (i.e. use the "/datasets" dir instead of the base dir in the repo). Perhaps you could update this if it is a mistake, or clarify it if it is not a mistake?

    opened by AustinT 1
  • Adding the baselines results csvs and data

    Adding the baselines results csvs and data

    Adding baselines summary csvs that are used by notebooks/visualize to plot everything. Update this PR 'ed branch with final PN results when available.

    Also small changes to plotting utils to allow consistent task highlighting/less naff colours etc.

    also moved the target_info csvs

    opened by megstanley 1
  • Adding plotting notebook

    Adding plotting notebook

    Adding plotting notebook, updating the utils.

    The file paths in here need to be changed when the dataset/results csvs are moved in to the repo

    NOTE: obvious TODO I will do before merging -- move this to the notebooks/ directory

    opened by megstanley 1
  • Dataset documentation

    Dataset documentation

    Changes to README.md to point to a notebook explaining how the dataset works, extensions to docstrings for fs_mol/data.

    Part 1 of docs, to be followed with Part 2: training a fresh model and evaluating.

    opened by megstanley 1
  • fix(data/maml.py): type mismatch between tf2gnn input signature and data

    fix(data/maml.py): type mismatch between tf2gnn input signature and data

    Type casts to match the input to the tf2gnn specified input types here: https://github.com/microsoft/tf2-gnn/blob/182eb6b337cecf1f0d6dce237a4a8ff4e5599e67/tf2_gnn/layers/gnn.py#L220

    opened by megstanley 0
  • Unify GNN/graph readout between GNNMultitask and Protonet models

    Unify GNN/graph readout between GNNMultitask and Protonet models

    This is a bit unwiedly because it touches a lot of components, but in particular it does three related things:

    • Unify all code related to "take graph, provide graph representation" into shared classes, which are used in both GNNMultitask as well as in the ProtoNet case. This includes not only model code, but also configuration objects and command line argument-handling.
    • Fix the oddity in which our models had to transform data to torch tensors (because inputs sometimes where np arrays); they now all take torch.Tensor objects, and the data loading infrastructure is fixed to provide these.
    • Add two (small) new features: loading of a pre-trained GNN in ProtoNets (which is enabled by the shared code, as it takes the GNNMultitask-pretrained model), and normalization on GNN outputs (off by default).
    opened by mmjb 0
  • feat: pipeline pushing GitHub updates to MSR private fork

    feat: pipeline pushing GitHub updates to MSR private fork

    This automatically syncs updates on the public repo with the internal repo (by merging) on all branches; if merge fails, the pipeline will fail and require manual intervention.

    opened by mmjb 0
  • Mean metric function

    Mean metric function

    Fixing the aggregation/taking mean over metrics so that within task means are taken before means over multiple tasks are taken.

    I think I got all the instances where this occurs. Let me know if I missed one.

    This is on top of the multiway validation PR for now.

    opened by megstanley 0
  • fix(multitask_train): excess arg to train loop

    fix(multitask_train): excess arg to train loop

    Fix to prevent multitask_train from breaking due to an unused argument in the train_loop() call.

    Do we want to occasionally save intermediate best validation models?

    opened by megstanley 0
  • Evaluation methodology for regression task

    Evaluation methodology for regression task

    Hi,

    Thanks for making this dataset available! I'm wondering if there are any scripts for evaluating models on the task of predicting continuous values, i.e. RegressionProperty, or maybe a reference that uses FS-Mol for this task? In the utils/metrics directory I only see binary evaluation tools.

    Thanks!

    opened by PeterEckmann1 0
Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
N-Omniglot is a large neuromorphic few-shot learning dataset

N-Omniglot [Paper] || [Dataset] N-Omniglot is a large neuromorphic few-shot learning dataset. It reconstructs strokes of Omniglot as videos and uses D

null 11 Dec 5, 2022
This repository contains the code for using the H3DS dataset introduced in H3D-Net: Few-Shot High-Fidelity 3D Head Reconstruction

H3DS Dataset This repository contains the code for using the H3DS dataset introduced in H3D-Net: Few-Shot High-Fidelity 3D Head Reconstruction Access

Crisalix 72 Dec 10, 2022
Pytorch Implementation for CVPR2018 Paper: Learning to Compare: Relation Network for Few-Shot Learning

LearningToCompare Pytorch Implementation for Paper: Learning to Compare: Relation Network for Few-Shot Learning Howto download mini-imagenet and make

Jackie Loong 246 Dec 19, 2022
E2EDNA2 - An automated pipeline for simulation of DNA aptamers complexed with small molecules and short peptides

E2EDNA2 - An automated pipeline for simulation of DNA aptamers complexed with small molecules and short peptides

null 11 Nov 8, 2022
Official implementation of "Generating 3D Molecules for Target Protein Binding"

Generating 3D Molecules for Target Protein Binding This is the official implementation of the GraphBP method proposed in the following paper. Meng Liu

DIVE Lab, Texas A&M University 74 Dec 7, 2022
Few-shot Learning of GPT-3

Few-shot Learning With Language Models This is a codebase to perform few-shot "in-context" learning using language models similar to the GPT-3 paper.

Tony Z. Zhao 224 Dec 28, 2022
Library of various Few-Shot Learning frameworks for text classification

FewShotText This repository contains code for the paper A Neural Few-Shot Text Classification Reality Check Environment setup # Create environment pyt

Thomas Dopierre 47 Jan 3, 2023
Few-Shot Graph Learning for Molecular Property Prediction

Few-shot Graph Learning for Molecular Property Prediction Introduction This is the source code and dataset for the following paper: Few-shot Graph Lea

Zhichun Guo 94 Dec 12, 2022
Few-shot Relation Extraction via Bayesian Meta-learning on Relation Graphs

Few-shot Relation Extraction via Bayesian Meta-learning on Relation Graphs This is an implemetation of the paper Few-shot Relation Extraction via Baye

MilaGraph 36 Nov 22, 2022
True Few-Shot Learning with Language Models

This codebase supports using language models (LMs) for true few-shot learning: learning to perform a task using a limited number of examples from a single task distribution.

Ethan Perez 124 Jan 4, 2023
Adaptive Prototype Learning and Allocation for Few-Shot Segmentation (CVPR 2021)

ASGNet The code is for the paper "Adaptive Prototype Learning and Allocation for Few-Shot Segmentation" (accepted to CVPR 2021) [arxiv] Overview data/

Gen Li 91 Dec 23, 2022
Code for 'Self-Guided and Cross-Guided Learning for Few-shot segmentation. (CVPR' 2021)'

SCL Introduction Code for 'Self-Guided and Cross-Guided Learning for Few-shot segmentation. (CVPR' 2021)' We evaluated our approach using two baseline

null 34 Oct 8, 2022
Spatial Contrastive Learning for Few-Shot Classification (SCL)

This repo contains the official implementation of Spatial Contrastive Learning for Few-Shot Classification (SCL), which presents of a novel contrastive learning method applied to few-shot image classification in order to learn more general purpose embeddings, and facilitate the test-time adaptation to novel visual categories.

Yassine 34 Dec 25, 2022
Prototypical Networks for Few shot Learning in PyTorch

Prototypical Networks for Few shot Learning in PyTorch Simple alternative Implementation of Prototypical Networks for Few Shot Learning (paper, code)

Orobix 835 Jan 8, 2023
Pytorch implementation of the paper "Optimization as a Model for Few-Shot Learning"

Optimization as a Model for Few-Shot Learning This repo provides a Pytorch implementation for the Optimization as a Model for Few-Shot Learning paper.

Albert Berenguel Centeno 238 Jan 4, 2023
Implementation of the paper "Self-Promoted Prototype Refinement for Few-Shot Class-Incremental Learning"

Self-Promoted Prototype Refinement for Few-Shot Class-Incremental Learning This is the implementation of the paper "Self-Promoted Prototype Refinement

Kai Zhu 78 Dec 2, 2022
Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning

structshot Code and data for paper "Simple and Effective Few-Shot Named Entity Recognition with Structured Nearest Neighbor Learning", Yi Yang and Arz

ASAPP Research 47 Dec 27, 2022
LibFewShot: A Comprehensive Library for Few-shot Learning.

LibFewShot Make few-shot learning easy. Supported Methods Meta MAML(ICML'17) ANIL(ICLR'20) R2D2(ICLR'19) Versa(NeurIPS'18) LEO(ICLR'19) MTL(CVPR'19) M

VIG@R&L 603 Jan 5, 2023
Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation

Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation The code repository for "Audio-Visual Generalized Few-Shot Learning with

Kaiaicy 3 Jun 27, 2022