AugLiChem - The augmentation library for chemical systems.

Overview

AugLiChem

Build Status codecov

Welcome to AugLiChem! The augmentation library for chemical systems. This package supports augmentation for both crystaline and molecular systems, as well as provides automatic downloading for our benchmark datasets, and easy to use model implementations. In depth documentation about how to use AugLiChem, make use of transformations, and train models is given on our website.

Installation

AugLiChem is a python3.8+ package.

Linux

It is recommended to use an environment manager such as conda to install AugLiChem. Instructions can be found here. If using conda, creating a new environment is ideal and can be done simply by running the following command:

conda create -n auglichem python=3.8

Then activating the new environment with

conda activate auglichem

AugLiChem is built primarily with pytorch and that should be installed independently according to your system specifications. After activating your conda environment, pytorch can be installed easily and instructions are found here.

torch_geometric needs to be installed with conda install pyg -c pyg -c conda-forge.

Once you have pytorch and torch_geometric installed, installing AugLiChem can be done using PyPI:

pip install auglichem

MacOS ARM64 Architecture

A more involved install is required to run on the new M1 chips since some of the packages do not have official support yet. We are working on a more elegant solution given the current limitations.

First, download this repo.

If you do not have it yet,, conda for ARM64 architecture needs to be installed. This can be done with Miniforge (which contains conda installer) which is installed by following the guide here

Once you have miniforge compatible with ARM64 architecture, a new environment with rdkit can be i nstalled. If you do not specify python=3.8 it will default to python=3.9.6 as of the time of writing th is.

conda create -n auglichem python=3.8 rdkit

Now activate the environment:

conda activate auglichem

From here, individual packages can be installed:

conda install -c pytorch pytorch

conda install -c fastchan torchvision

conda install scipy

conda install cython

conda install scikit-learn

pip install torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cpu.html

pip install torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cpu.html

pip install torch-geometric

Before installing the package, you must go into setup.py in the main directory and comment out rdkit-pypi and tensorboard from the install_requires list since they are already installed. Not commenting these packages out will result in an error during installation.

Finally, run:

pip install .

Usage guides are provided in the examples/ directory and provide useful guides for using both the molecular and crystal sides of the package. Make sure to install jupyter before working with examples, using conda install jupyter. After installing the package as described above, the example notebooks can be downloaded separately and run locally.

Authors

Rishikesh Magar*, Yuyang Wang*, Cooper Lorsung*, Hariharan Ramasubramanian, Chen Liang, Peiyuan Li, Amir Barati Farimani

*Equal contribution

Paper

Our paper can be found here

Citation

If you use AugLiChem in your work, please cite:

@misc{magar2021auglichem,
      title={AugLiChem: Data Augmentation Library ofChemical Structures for Machine Learning}, 
      author={Rishikesh Magar and Yuyang Wang and Cooper Lorsung and Chen Liang and Hariharan Ramasubramanian and Peiyuan Li and Amir Barati Farimani},
      year={2021},
      eprint={2111.15112},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

License

AugLiChem is MIT licensed, as found in the LICENSE file. Please note that some of the dependencies AugLiChem uses may be licensed under different terms.

Comments
  • Importing Different Dataset

    Importing Different Dataset

    Hi,

    First of all, thanks for creating this tool.

    I am using OMDB dataset for bandgap prediction with using Schnetpack and ALIGNN models. OMDB dataset is similar to QM9, but it includes much larger molecules which have on average 82 atoms per unit cell:

    https://omdb.mathub.io/dataset

    The dataset includes bandgap values and xyz file including all structures.

    The problem is that the dataset is not as big as QM9. It only includes 12500 molecules and this prevents better MAE. Is there any way to import this dataset into AugLiChem and augment it to train Schnetpack?

    Best regards,

    opened by MiracAydin1 7
  • Supplementary Information

    Supplementary Information

    Hello autohors,

    I have seen your article entitled "AugLiChem: "Data Augmentation Library of Chemical Structures for Machine Learning." On page 12, the publication states, "More comprehensive findings of RF and SVM are available in the Supplementary Information (Tables S1 and S2)." I have thoroughly reviewed all of the information and materials, but I am unable to locate any further resources. Could you perhaps give me with further materials?

    I am looking forward to your reply as soon as possible.Thank you!

    Betst regards,

    opened by liuyunwu 4
  • Importing Custom XYZ dataset

    Importing Custom XYZ dataset

    Dear All,

    I would like to import custom XYZ dataset to Auglichem. Importing CIF dataset is working fine but when I try to import XYZ dataset for molecule, I realized that MoleculeDatasetWrapper function only supports SMILES format and it gives error for XYZ files. Also, CrystalDatasetWrapper does not support XYZ as expected.

    Is there any way to import XYZ files and augment them?

    Best regards,

    opened by MiracAydin1 3
  • Installation on W10 with conda

    Installation on W10 with conda

    I have been trying to install AugLiChem on W10 following installation and surprisingly all went smooth. Now two things:

    1. pip install auglichem seems to exit clean but from from auglichem.molecule import Compose, RandomAtomMask, RandomBondDelete, MotifRemoval, MotifRemoval did not work properly for a torch problem (torch_geometric seems to cause problems)

    Here the cell response


    `FileNotFoundError Traceback (most recent call last) Cell In [2], line 1 ----> 1 from auglichem.molecule import Compose, RandomAtomMask, RandomBondDelete, MotifRemoval 2 from auglichem.molecule.data import MoleculeDatasetWrapper 3 from auglichem.molecule.models import GCN, AttentiveFP, GINE, DeepGCN

    File ~\anaconda3\envs\auglichem\lib\site-packages\auglichem\molecule_init_.py:1 ----> 1 from auglichem.molecule._transforms import RandomAtomMask, RandomBondDelete, MotifRemoval 2 from auglichem.molecule._compositions import Compose, OneOf

    File ~\anaconda3\envs\auglichem\lib\site-packages\auglichem\molecule_transforms.py:15 12 from rdkit.Chem.BRICS import BRICSDecompose 14 import torch ---> 15 import torch_geometric 16 from torch_geometric.data import Data as PyG_Data 18 from auglichem.utils import ATOM_LIST, CHIRALITY_LIST, BOND_LIST, BONDDIR_LIST

    File ~\anaconda3\envs\auglichem\lib\site-packages\torch_geometric_init_.py:4 1 from types import ModuleType 2 from importlib import import_module ----> 4 import torch_geometric.data 5 import torch_geometric.loader 6 import torch_geometric.transforms

    File ~\anaconda3\envs\auglichem\lib\site-packages\torch_geometric\data_init_.py:1 ----> 1 from .data import Data 2 from .hetero_data import HeteroData 3 from .temporal import TemporalData

    File ~\anaconda3\envs\auglichem\lib\site-packages\torch_geometric\data\data.py:9 7 import torch 8 from torch import Tensor ----> 9 from torch_sparse import SparseTensor 11 from torch_geometric.data.storage import (BaseStorage, EdgeStorage, 12 GlobalStorage, NodeStorage) 13 from torch_geometric.deprecation import deprecated

    File ~\anaconda3\envs\auglichem\lib\site-packages\torch_sparse_init_.py:19 17 spec = cuda_spec or cpu_spec 18 if spec is not None: ---> 19 torch.ops.load_library(spec.origin) 20 else: # pragma: no cover 21 raise ImportError(f"Could not find module '{library}_cpu' in " 22 f"{osp.dirname(file)}")

    File ~\anaconda3\envs\auglichem\lib\site-packages\torch_ops.py:110, in _Ops.load_library(self, path) 105 path = torch._utils_internal.resolve_library_path(path) 106 with dl_open_guard(): 107 # Import the shared library into the process, thus running its 108 # static (global) initialization code in order to register custom 109 # operators with the JIT. --> 110 ctypes.CDLL(path) 111 self.loaded_libraries.add(path)

    File ~\anaconda3\envs\auglichem\lib\ctypes_init_.py:382, in CDLL.init(self, name, mode, handle, use_errno, use_last_error, winmode) 379 self._FuncPtr = _FuncPtr 381 if handle is None: --> 382 self._handle = _dlopen(self._name, mode) 383 else: 384 self._handle = handle

    FileNotFoundError: Could not find module 'C:\Users\Andrea.zaliani\anaconda3\envs\auglichem\Lib\site-packages\torch_sparse_convert_cuda.pyd' (or one of its dependencies). Try using the full path with constructor syntax.

    1. with git clone one does not receive ./data_downloads folder to run notebooks Where one can find the csv needed?

    Best and thanks andrea

    `

    opened by agiani99 2
  • Dev cooper

    Dev cooper

    Added seed setting for reproducibility in dataset splitting and AFP. Added unit testing for scaffold and random split, as well as AFP inititalization. Added multi-task training capability to molecule dataset, added a training script that supports multi task classification

    opened by CoopLo 2
  • Dev cooper

    Dev cooper

    Crystal data set loading runs without error using the updated structure. Almost no testing or debugging has been done yet. Automatic data downloading is not yet supported because the data is hosted on a private box folder.

    opened by CoopLo 2
  • Mean and standard deviation of scaled error

    Mean and standard deviation of scaled error

    Dear authors, I'm reading your paper "Improving Molecular Contrastive Learning via Faulty Negative Mitigation and Decomposed Fragment Contrast "is the figure 2b on page 16 of the paper encountered scaled error, I don't know how to draw this graph according to the experimental data, I wonder if you could tell me in detail what tools you used and what formula you scaled the data according to?

    I am eagerly looking forward to your reply. Thank you very much! Best, Yunwu

    opened by liuyunwu 1
  • Augmentation of fingerprints/SMILES

    Augmentation of fingerprints/SMILES

    Hi,

    I read the paper and it states FP-break and FP-concat techniques for augmenting fingerprints. Is there an example that demonstrates how can I use this package for FP-break/FP-concat?

    opened by ansariyusuf 1
  • Dev cooper

    Dev cooper

    Added example of model saving/loading. Added checksum verification for molecule datasets where it is available. Added explicit augmentation and file saving for crystal data, and added flag with warning to allow for call-time augmentation. Tests now check data downloading as well.

    opened by CoopLo 1
  • Dev cooper

    Dev cooper

    Updated names to be Molecule/CrystalDataset and Molecule/CrystalDatasetWrapper. Added basic example notebooks. Full pipeline works in limited cases (Pre-downloaded Lanthanides, and a few Molecular datasets). Codebase still in need of renaming/refactoring so that crystal and molecular sides are consistent, and need docstrings.

    opened by CoopLo 1
  • Dev cooper

    Dev cooper

    Updated molecule side. It now runs both with and without augmentations, along with rudimentary unit tests. Unit tests cover basics of downloading data automatically, atom/bond masking through standalone functions and through the MolData object. Additional testing is necessary for the data loaders and models.

    opened by CoopLo 1
Owner
BaratiLab
BaratiLab
Image transformations designed for Scene Text Recognition (STR) data augmentation. Published at ICCV 2021 Workshop on Interactive Labeling and Data Augmentation for Vision.

Data Augmentation for Scene Text Recognition (ICCV 2021 Workshop) (Pronounced as "strog") Paper Arxiv Why it matters? Scene Text Recognition (STR) req

Rowel Atienza 152 Dec 28, 2022
Optimising chemical reactions using machine learning

Summit Summit is a set of tools for optimising chemical processes. We’ve started by targeting reactions. What is Summit? Currently, reaction optimisat

Sustainable Reaction Engineering Group 75 Dec 14, 2022
Systemic Evolutionary Chemical Space Exploration for Drug Discovery

SECSE SECSE: Systemic Evolutionary Chemical Space Explorer Chemical space exploration is a major task of the hit-finding process during the pursuit of

null 64 Dec 16, 2022
Python-based Informatics Kit for Analysing Chemical Units

INSTALLATION Python-based Informatics Kit for the Analysis of Chemical Units Step 1: Make a conda environment: conda create -n pikachu python=3.9 cond

null 47 Dec 23, 2022
Official PyTorch implementation of the ICRA 2021 paper: Adversarial Differentiable Data Augmentation for Autonomous Systems.

Adversarial Differentiable Data Augmentation This repository provides the official PyTorch implementation of the ICRA 2021 paper: Adversarial Differen

Manli 3 Oct 15, 2022
Code for Private Recommender Systems: How Can Users Build Their Own Fair Recommender Systems without Log Data? (SDM 2022)

Private Recommender Systems: How Can Users Build Their Own Fair Recommender Systems without Log Data? (SDM 2022) We consider how a user of a web servi

joisino 20 Aug 21, 2022
Fast image augmentation library and easy to use wrapper around other libraries. Documentation: https://albumentations.ai/docs/ Paper about library: https://www.mdpi.com/2078-2489/11/2/125

Albumentations Albumentations is a Python library for image augmentation. Image augmentation is used in deep learning and computer vision tasks to inc

null 11.4k Jan 9, 2023
Image augmentation library in Python for machine learning.

Augmentor is an image augmentation library in Python for machine learning. It aims to be a standalone library that is platform and framework independe

Marcus D. Bloice 4.8k Jan 7, 2023
This is the official implementation of TrivialAugment and a mini-library for the application of multiple image augmentation strategies including RandAugment and TrivialAugment.

Trivial Augment This is the official implementation of TrivialAugment (https://arxiv.org/abs/2103.10158), as was used for the paper. TrivialAugment is

AutoML-Freiburg-Hannover 94 Dec 30, 2022
A library for augmentation of a YOLO-formated dataset

YOLO Dataset Augmentation lib Инструкция по использованию этой библиотеки Запуск всех файлов осуществлять из консоли. GoogleCrawl_to_Dataset.py Это ск

Egor Orel 1 Dec 10, 2022
A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie_recs Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Coll

ShopRunner 97 Jan 3, 2023
A library of multi-agent reinforcement learning components and systems

Mava: a research framework for distributed multi-agent reinforcement learning Table of Contents Overview Getting Started Supported Environments System

InstaDeep Ltd 463 Dec 23, 2022
NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

NVIDIA Merlin NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs. It enables data scientists, machine

null 419 Jan 3, 2023
A library for preparing, training, and evaluating scalable deep learning hybrid recommender systems using PyTorch.

collie Collie is a library for preparing, training, and evaluating implicit deep learning hybrid recommender systems, named after the Border Collie do

ShopRunner 96 Dec 29, 2022
A GridMixup augmentation, inspired by GridMask and CutMix

GridMixup A GridMixup augmentation, inspired by GridMask and CutMix Easy install pip install git+https://github.com/IlyaDobrynin/GridMixup.git Overvie

IlyaDo 42 Dec 28, 2022
PyTorch code for the paper: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning This is the PyTorch implementation of our paper: FeatMatch: Feature-Based Augmentat

null 43 Nov 19, 2022
Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021)

Transferable Semantic Augmentation for Domain Adaptation Code release for "Transferable Semantic Augmentation for Domain Adaptation" (CVPR 2021) Paper

null 66 Dec 16, 2022
Repository for the paper "PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation", CVPR 2021.

PoseAug: A Differentiable Pose Augmentation Framework for 3D Human Pose Estimation Code repository for the paper: PoseAug: A Differentiable Pose Augme

Pyjcsx 328 Dec 17, 2022
DABO: Data Augmentation with Bilevel Optimization

DABO: Data Augmentation with Bilevel Optimization [Paper] The goal is to automatically learn an efficient data augmentation regime for image classific

ElementAI 24 Aug 12, 2022