Making self-supervised learning work on molecules by using their 3D geometry to pre-train GNNs. Implemented in DGL and Pytorch Geometric.

Overview

3D Infomax improves GNNs for Molecular Property Prediction

Video | Paper

We pre-train GNNs to understand the geometry of molecules given only their 2D molecular graph which they can use for better molecular property predictions. Below is a 3 step guide for how to use the code and how to reproduce our results. If you have questions, don't hesitate to open an issue or ask me via [email protected] or social media. I am happy to hear from you!

This repository additionally adapts different self-supervised learning methods to graphs such as "Bootstrap your own Latent", "Barlow Twins", or "VICReg".

Step 1: Setup Environment

We will set up the environment using Anaconda. Clone the current repo

git clone https://github.com/HannesStark/3DInfomax

Create a new environment with all required packages using environment.yml (this can take a while). While in the project directory run:

conda env create

Activate the environment

conda activate graphssl

Step 2: 3D Pre-train a model

Let's pre-train a GNN with 50 000 molecules and their structures from the QM9 dataset (you can also skip to Step 3 and use the pre-trained model weights provided in this repo). For other datasets see the Data section below.

python train.py --config=configs_clean/pre-train_QM9.yml

This will first create the processed data of dataset/QM9/qm9.csv with the 3D information in qm9_eV.npz. Then your model starts pre-training and all the logs are saved in the runs folder which will also contain the pre-trained model as best_checkpoint.pt that can later be loaded for fine-tuning.

You can start tensorboard and navigate to localhost:6006 in your browser to monitor the training process:

tensorboard --logdir=runs --port=6006

Explanation:

The config files in configs_clean provide additional examples and blueprints to train different models. The files always contain a model_type that should be pre-trained (2D network) and a model3d_type (3D network) where you can specify the parameters of these networks. To find out more about all the other parameters in the config file, have a look at their description by running python train.py --help.

Step 3: Fine-tune a model

During pre-training a directory is created in the runs directory that contains the pre-trained model. We provide an example of such a directory with already pre-trained weights runs/PNA_qmugs_NTXentMultiplePositives_620000_123_25-08_09-19-52 which we can fine-tune for predicting QM9's homo property as follows.

python train.py --config=configs_clean/tune_QM9_homo.yml

You can monitor the fine-tuning process on tensorboard as well and in the end the results will be printed to the console but also saved in the runs directory that was created for fine-tuning in the file evaluation_test.txt.

The model which we are fine-tuning from is specified in configs_clean/tune_QM9_homo.yml via the parameter:

pretrain_checkpoint: runs/PNA_qmugs_NTXentMultiplePositives_620000_123_25-08_09-19-52/best_checkpoint_35epochs.pt

Multiple seeds:

This is a second fine-tuning example where we predict non-quantum properties of the OGB datasets and train multiple seeds (we always use the seeds 1, 2, 3, 4, 5, 6 in our experiments):

python train.py --config=configs_clean/tune_freesolv.yml

After all runs are done, the averaged results are saved in the runs directory of each seed in the file multiple_seed_test_statistics.txt

Data

You can pre-train or fine-tune on different datasets by specifying the dataset: parameter in a .yml file such as dataset: drugs to use GEOM-Drugs.

The QM9 dataset and the OGB datasets are already provided with this repository. The QMugs and GEOM-Drugs datasets need to be downloaded and placed in the correct location.

GEOM-Drugs: Download GEOM-Drugs here ( the rdkit_folder.tar.gz file), unzip it, and place it into dataset/GEOM.

QMugs: Download QMugs here (the structures.tar and summary.csv files), unzip the structures.tar, and place the resulting structures folder and the summary.csv file into a new folder QMugs that you have to create NEXT TO the repository root. Not in the repository root (sorry for this).

Comments
  • having trouble training for GEOM-Mol + trained models

    having trouble training for GEOM-Mol + trained models

    File "/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear return torch._C._nn.linear(input, weight, bias) RuntimeError: CUDA out of memory. Tried to allocate 410.00 MiB (GPU 0; 11.17 GiB total capacity; 9.92 GiB already allocated; 336.44 MiB free; 10.30 GiB reserved in total by PyTorch) Any idea?

    Also would it be possible for you to put up trained models for both QM9 and Geom-Drugs

    opened by rohanvarm 5
  • Fine-tuning a model with `BACEGeomol` dataset - error when collating

    Fine-tuning a model with `BACEGeomol` dataset - error when collating

    Hello, I am trying to fine-tune the pre-trained model you provided here runs/PNA_qmugs_NTXentMultiplePositives_620000_123_25-08_09-19-52/best_checkpoint_35epochs.pt with a csv of the bace dataset. When I use the BACEGeomol class to load and process the data, I get the following error when trying to collate the data with torch_geometric.

    Traceback (most recent call last):
      File "train.py", line 702, in <module>
        train(args)
      File "train.py", line 286, in train
        return train_geomol(args, device, metrics_dict)
      File "train.py", line 313, in train_geomol
        train = dataset(split='train', device=device)
      File "/home/ubuntu/code/3DInfomax/datasets/bace_geomol_feat.py", line 59, in __init__
        super(BACEGeomol, self).__init__(root, transform, pre_transform)
      File "/home/ubuntu/anaconda3/envs/3DInfomax/lib/python3.7/site-packages/torch_geometric/data/in_memory_dataset.py", line 57, in __init__
        super().__init__(root, transform, pre_transform, pre_filter)
      File "/home/ubuntu/anaconda3/envs/3DInfomax/lib/python3.7/site-packages/torch_geometric/data/dataset.py", line 88, in __init__
        self._process()
      File "/home/ubuntu/anaconda3/envs/3DInfomax/lib/python3.7/site-packages/torch_geometric/data/dataset.py", line 171, in _process
        self.process()
      File "/home/ubuntu/code/3DInfomax/datasets/bace_geomol_feat.py", line 127, in process
        data, slices = self.collate(data_list)
      File "/home/ubuntu/anaconda3/envs/3DInfomax/lib/python3.7/site-packages/torch_geometric/data/in_memory_dataset.py", line 116, in collate
        add_batch=False,
      File "/home/ubuntu/anaconda3/envs/3DInfomax/lib/python3.7/site-packages/torch_geometric/data/collate.py", line 85, in collate
        increment)
      File "/home/ubuntu/anaconda3/envs/3DInfomax/lib/python3.7/site-packages/torch_geometric/data/collate.py", line 179, in _collate
        key, [v[key] for v in values], data_list, stores, increment)
      File "/home/ubuntu/anaconda3/envs/3DInfomax/lib/python3.7/site-packages/torch_geometric/data/collate.py", line 179, in <listcomp>
        key, [v[key] for v in values], data_list, stores, increment)
    KeyError: 0
    

    My questions are:

    1. I see there are a lot of different dataset classes. Can I use the BACEGeomol dataset class to fine-tune that model, or should I be using a different dataset class? I'm also not sure because I see different functions to featurize molecules in the different classes.
    2. Have you seen this error before, and do you know what might be causing the bug?

    Thank you!

    opened by davidfarinajr 4
  • Having trouble pre-training with example code

    Having trouble pre-training with example code

    Hi, After installing all the required packages, I follow the Step 2 in Readme.md to run the following code:

    python train.py --config=configs_clean/pre-train_QM9.yml

    However, I got the error as follow: Traceback (most recent call last): File "train.py", line 699, in train(args) File "train.py", line 270, in train return train_qm9(args, device, metrics_dict) File "train.py", line 562, in train_qm9 dist_embedding=args.dist_embedding, num_radial=args.num_radial) File "/data2/3DInfomax/datasets/qm9_dataset.py", line 187, in init self.dist_embedder = dist_emb(num_radial=6).to(device) File "/data2/3DInfomax/commons/spherical_encoding.py", line 183, in init self.reset_parameters() File "/data2/3DInfomax/commons/spherical_encoding.py", line 186, in reset_parameters torch.arange(1, self.freq.numel() + 1, out=self.freq).mul_(PI) RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

    I didn't modify the code. Any idea for aforementioned error?

    opened by Data-reindeer 4
  • ResolvePackageNotFound in Windows

    ResolvePackageNotFound in Windows

    Hello, when using Anaconda in Windows 10: I met the following question: Resolve Package Not Found:

    • dgl-cuda 10.2
    • torchaudio
    • pytorch-geometry
    • torchvision

    I wonder whether it means I really need a computer with an NVIDIA card so that I can install cuda and cudnn, and then I can install packages above? Great thanks.

    opened by MianWang11111 2
  • Pretrained 3d model

    Pretrained 3d model

    Hello, After reading the paper and part of the code (might be difficult for me to understand), I understand that you have a 3d model but not pre-trained, right? If I misunderstood, could you tell me how I should use the 3d pre-trained model. Thanks for your help!

    opened by ivandon15 1
  • Where is the

    Where is the "bbbpscaffold123.pkl"? And what is the "bbbpscaffold123.pkl"?

    image I found the dataset of "BBBP.csv". Where is the "bbbpscaffold123.pkl"? And what is the "bbbpscaffold123.pkl"? Thank you very much for helping me.

    opened by MianWang11111 1
  • Linked video freezes

    Linked video freezes

    Hi, this doesn't concern the repo, but the linked talk (youtube video) freezes after ~22 min. Is that the complete recording, or is there a fix? Thanks.

    opened by nfzd 1
  • DglPCQM4MDataset ImportError for inference

    DglPCQM4MDataset ImportError for inference

    Hello!

    Hope you are well! When I run inference on the included example, I get the following error:

    ImportError: cannot import name 'DglPCQM4MDataset' from 'ogb.lsc' 
    

    obg is installed and is version 1.3.3. Any thoughts as to why this error would occur? Thanks.

    opened by jadolfbr 2
Owner
Hannes Stärk
MIT Research Intern • Geometric DL + Graphs :heart: • M. Sc. Informatics from TU Munich
Hannes Stärk
This is an open-source toolkit for Heterogeneous Graph Neural Network(OpenHGNN) based on DGL [Deep Graph Library] and PyTorch.

This is an open-source toolkit for Heterogeneous Graph Neural Network(OpenHGNN) based on DGL [Deep Graph Library] and PyTorch.

BUPT GAMMA Lab 519 Jan 2, 2023
Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR

UniSpeech The family of UniSpeech: UniSpeech (ICML 2021): Unified Pre-training for Self-Supervised Learning and Supervised Learning for ASR UniSpeech-

Microsoft 282 Jan 9, 2023
Boost learning for GNNs from the graph structure under challenging heterophily settings. (NeurIPS'20)

Beyond Homophily in Graph Neural Networks: Current Limitations and Effective Designs Jiong Zhu, Yujun Yan, Lingxiao Zhao, Mark Heimann, Leman Akoglu,

GEMS Lab: Graph Exploration & Mining at Scale, University of Michigan 70 Dec 18, 2022
Official DGL implementation of "Rethinking High-order Graph Convolutional Networks"

SE Aggregation This is the implementation for Rethinking High-order Graph Convolutional Networks. Here we show the codes for citation networks as an e

Tianqi Zhang (张天启) 32 Jul 19, 2022
[CVPR 2021] "The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models" Tianlong Chen, Jonathan Frankle, Shiyu Chang, Sijia Liu, Yang Zhang, Michael Carbin, Zhangyang Wang

The Lottery Tickets Hypothesis for Supervised and Self-supervised Pre-training in Computer Vision Models Codes for this paper The Lottery Tickets Hypo

VITA 59 Dec 28, 2022
Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts

t5-japanese Codes to pre-train T5 (Text-to-Text Transfer Transformer) models pre-trained on Japanese web texts. The following is a list of models that

Kimio Kuramitsu 1 Dec 13, 2021
Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency[ECCV 2020]

Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency(ECCV 2020) This is an official python implementati

null 304 Jan 3, 2023
PyTorch implementation of "Contrast to Divide: self-supervised pre-training for learning with noisy labels"

Contrast to Divide: self-supervised pre-training for learning with noisy labels This is an official implementation of "Contrast to Divide: self-superv

null 55 Nov 23, 2022
PlaidML is a framework for making deep learning work everywhere.

A platform for making deep learning work everywhere. Documentation | Installation Instructions | Building PlaidML | Contributing | Troubleshooting | R

PlaidML 4.5k Jan 2, 2023
《Train in Germany, Test in The USA: Making 3D Object Detectors Generalize》(CVPR 2020)

Train in Germany, Test in The USA: Making 3D Object Detectors Generalize This paper has been accpeted by Conference on Computer Vision and Pattern Rec

Xiangyu Chen 101 Jan 2, 2023
E2EDNA2 - An automated pipeline for simulation of DNA aptamers complexed with small molecules and short peptides

E2EDNA2 - An automated pipeline for simulation of DNA aptamers complexed with small molecules and short peptides

null 11 Nov 8, 2022
ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation

ST++ This is the official PyTorch implementation of our paper: ST++: Make Self-training Work Better for Semi-supervised Semantic Segmentation. Lihe Ya

Lihe Yang 147 Jan 3, 2023
Code for "FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle Detection", ICRA 2021

FGR This repository contains the python implementation for paper "FGR: Frustum-Aware Geometric Reasoning for Weakly Supervised 3D Vehicle Detection"(I

Yi Wei 31 Dec 8, 2022
Official implementation of "Generating 3D Molecules for Target Protein Binding"

Generating 3D Molecules for Target Protein Binding This is the official implementation of the GraphBP method proposed in the following paper. Meng Liu

DIVE Lab, Texas A&M University 74 Dec 7, 2022
A Pytorch implementation of MoveNet from Google. Include training code and pre-train model.

Movenet.Pytorch Intro MoveNet is an ultra fast and accurate model that detects 17 keypoints of a body. This is A Pytorch implementation of MoveNet fro

Mr.Fire 241 Dec 26, 2022
deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

null 63 Oct 17, 2022
UniLM AI - Large-scale Self-supervised Pre-training across Tasks, Languages, and Modalities

Pre-trained (foundation) models across tasks (understanding, generation and translation), languages (100+ languages), and modalities (language, image, audio, vision + language, audio + language, etc.)

Microsoft 7.6k Jan 1, 2023
Implementation of Geometric Vector Perceptron, a simple circuit for 3d rotation equivariance for learning over large biomolecules, in Pytorch. Idea proposed and accepted at ICLR 2021

Geometric Vector Perceptron Implementation of Geometric Vector Perceptron, a simple circuit with 3d rotation equivariance for learning over large biom

Phil Wang 59 Nov 24, 2022