Python package for machine learning for healthcare using a OMOP common data model

Overview

omop-learn

What is omop-learn?

This library was developed in order to facilitate rapid prototyping in Python of predictive machine-learning models using longitudinal medical data from an OMOP CDM-standard database. omop-learn supports the easy definition of predictive clinical tasks, featurizations of OMOP data, and cohorts of relevance. We further provide methods using sparse tensor implementations to rapidly manipulate the collected features in the rawest form possible, allowing for dynamic transformations of the data.

Two machine-learning models are included with the library. First, a windowed linear model, which uses various backwards-facing windows to aggregate features over different timescales, then feeds these features into a regularized logistic regression model. This model was inspired by the work of Razavian et. al. '15, and despite its simplicity is often competitive with state-of-the-art algorithms. We also include SARD (Self-Attention with Reverse Distillation), a novel deep-learning algorithm that uses self-attention to allow medical events to contextualize themselves using other events in a patient's timeline. SARD also makes use of reverse distillation, a training technique we introduce that effectively initializes a deep model using a high-performing linear proxy, in this case the windowed linear model described above -- for the details of this method and the SARD architecture, please see our paper Kodialam et al. AAAI '21.

Documentation

For a more detailed summary of omop-learn's data collection pipeline, and for documentation of functions, please see the full documentation for this repo, which also describes the process of creating one's own cohorts, predictive tasks, and features.

Dependencies

The following libraries are necessary to run omop-learn:

  • numpy
  • sqlalchemy
  • pandas
  • torch
  • sklearn
  • matplotlib
  • ipywidgets
  • IPython.display
  • gensim.models
  • scipy.sparse
  • sparse

Note that sparse is the PyData Sparse library, documented here

Running omop-learn

We provide several example notebooks, which all use an example task of predicting mortality over a six-month window for patients over the age of 70.

  • End of Life Linear Model Example.ipynb and End of Life Deep Model Example.ipynb run the windowed linear and deep SARD models respectively -- note that your machine must be able to access a GPU in order to run the deep models.
  • End of Life Linear Model Example (With Nontemporal Features).ipynb demonstrates how to add nontemporal features.
  • End of Life Linear Model Ancestors Example.ipynb demonstrates how to add feature ancestors.
  • End of Life Linear Model Example More Prediction Times.ipynb uses a larger dataset with predictions from any date within a time range.

To run the models, first set up the file config.py with connection information for your Postgres server containing an OMOP CDM database. Then, simply run through the cells of the notebook in order. Further documentation of the exact steps taken to define a task, collect data, and run a predictive model are embedded within the notebooks.

Contributors and Acknowledgements

Omop-learn was written by Rohan Kodialam and Jake Marcus, with additional contributions by Rebecca Boiarsky, Ike Lage, and Shannon Hwang.

This package was developed as part of a collaboration with Independence Blue Cross and would not have been possible without the advice and support of Aaron Smith-McLallen, Ravi Chawla, Kyle Armstrong, Luogang Wei, and Jim Denyer.

Comments
  • Questions - OMOP-LEARN

    Questions - OMOP-LEARN

    hello,

    I went through the documentation and have a couple of questions

    Q1) does omop-learn only support binary classification task as of now?

    Q2) is there any youtube tutorial which walks us through a simple example? I know there is a python notebook for eol.

    opened by Ak784 3
  • CDM schema name is hard-coded in cohort generation SQL scripts

    CDM schema name is hard-coded in cohort generation SQL scripts

    Description of issue

    The name of the cdm schema will be different in each environment but it is hard coded as 'cdm' in the cohort generation SQL scripts. One example is here.

    Proposed solution

    Add a parameter to the SQL files for the name of the cdm schema and replace all occurrences of the cdm schema name with the parameter.

    opened by ablack3 2
  • No setup.cfg to be able to install into another project

    No setup.cfg to be able to install into another project

    python3 -m pip install git+https://github.com/clinicalml/omop-learn.git does not work without a setup.cfg file. This would be advantageous to be able to install the project into other git repos.

    opened by crogers923 1
  • new cohort sql file for v5.3.1. support and a cdm version field in the config file

    new cohort sql file for v5.3.1. support and a cdm version field in the config file

    Hi everyone,

    My name is Egill, a postdoc from Erasmus MC in Rotterdam and I was testing your package against some synthetic data. I will eventually also test it on real data.

    I needed to make some slight changes to the sql code so it would run on v5.3.1 of the cdm. I think the most common version in use is v5.3.1 so it would be good to support that. The only relevant difference from v5.3.1 to v6 is that the death_dates are not in the person table but in a death table. So I added a new cohort sql file to support that. Then I added a cdm version field in the config.

    Then when running the notebooks I just added something like:

    if config.CDM_VERSION == 'v5.3.1': 
        cohort_script_path = config.SQL_PATH_COHORTS + '/gen_EOL_cohort_v531.sql'  
    else:  
        cohort_script_path = config.SQL_PATH_COHORTS + '/gen_EOL_cohort.sql'  
    

    I didn't include the notebooks in the pull request since then I would overwrite the figures which are there, and since the data I used was synthetic my figures were not so good to have as an example.

    Cheers, Egill

    opened by egillax 1
  • Add ORM Example Code

    Add ORM Example Code

    Adding examples of ORM based code for an EOL cohort, as well as feature generation. Other minor changes based on running on IBC systems, including bug fixes.

    opened by neilmdixit 0
  • Not all databases support COPY TO syntax

    Not all databases support COPY TO syntax

    Redshift does not seem to support queries of the form

    copy (select * from schema.table)
    to sdtout
    with csv HEADER
    

    which are used in the FeatureGenerator

    It seems like cross-database COPY TO syntax might be problematic. Is the intention for this code to run on several database platforms or just postgres?

    enhancement 
    opened by ablack3 1
Owner
Sontag Lab
Machine learning algorithms and applications to health care.
Sontag Lab
A data preprocessing package for time series data. Design for machine learning and deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

Allen Chiang 152 Jan 7, 2023
machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service

This is a machine learning model deployment project of Iris classification model in a minimal UI using flask web framework and deployed it in Azure cloud using Azure app service. We initially made this project as a requirement for an internship at Indian Servers. We are now making it open to contribution.

Krishna Priyatham Potluri 73 Dec 1, 2022
SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow

SmartSim makes it easier to use common Machine Learning (ML) libraries like PyTorch and TensorFlow, in High Performance Computing (HPC) simulations and workloads.

Cray Labs 139 Jan 1, 2023
A handy tool for common machine learning models' hyper-parameter tuning.

Common machine learning models' hyperparameter tuning This repo is for a collection of hyper-parameter tuning for "common" machine learning models, in

Kevin Hu 2 Jan 27, 2022
Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

David Kundih 3 Oct 19, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 8, 2023
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 9, 2023
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

Vowpal Wabbit 8.1k Dec 30, 2022
CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

CML with cloud compute This repository contains a sample project using CML with Terraform (via the cml-runner function) to launch an AWS EC2 instance

Iterative 19 Oct 3, 2022
Python package for stacking (machine learning technique)

vecstack Python package for stacking (stacked generalization) featuring lightweight functional API and fully compatible scikit-learn API Convenient wa

Igor Ivanov 671 Dec 25, 2022
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

null 6.2k Jan 1, 2023
ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions

A library for debugging/inspecting machine learning classifiers and explaining their predictions

null 154 Dec 17, 2022
Data Version Control or DVC is an open-source tool for data science and machine learning projects

Continuous Machine Learning project integration with DVC Data Version Control or DVC is an open-source tool for data science and machine learning proj

Azaria Gebremichael 2 Jul 29, 2021
A simple machine learning package to cluster keywords in higher-level groups.

Simple Keyword Clusterer A simple machine learning package to cluster keywords in higher-level groups. Example: "Senior Frontend Engineer" --> "Fronte

Andrea D'Agostino 10 Dec 18, 2022
Model Agnostic Confidence Estimator (MACEST) - A Python library for calibrating Machine Learning models' confidence scores

Model Agnostic Confidence Estimator (MACEST) - A Python library for calibrating Machine Learning models' confidence scores

Oracle 95 Dec 28, 2022
Built on python (Mathematical straight fit line coordinates error predictor machine learning foundational model)

Sum-Square_Error-Business-Analytical-Tool- Built on python (Mathematical straight fit line coordinates error predictor machine learning foundational m

om Podey 1 Dec 3, 2021
To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

Astitva Veer Garg 1 Jan 11, 2022
A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

Daniel Formoso 5.7k Dec 30, 2022