Patient Selection for Diabetes Drug Testing

Project Overview

EHR data is becoming a key source of real-world evidence (RWE) for the pharmaceutical industry and regulators to make decisions on clinical trials. You are a data scientist for an exciting unicorn healthcare startup that has created a groundbreaking diabetes drug that is ready for clinical trial testing. It is a very unique and sensitive drug that requires administering the drug over at least 5-7 days of time in the hospital(X number of days based off of distribution that I will see in data and cutoff point) with frequent monitoring/testing and patient medication adherence training with a mobile application. You have been provided a patient dataset from a client partner and are tasked with building a predictive model that can identify which type of patients the company should focus their efforts testing this drug on. Target patients are people that are likely to be in the hospital for this duration of time and will not incur significant additional costs for administering this drug to the patient and monitoring.

In order to achieve your goal you must first build a regression model that can predict the estimated hospitalization time for a patient and also provide an uncertainty estimate range for that prediction so that you can rank the predictions based off of the uncertainty range.

Expected Hospitalization Time Regression and Uncertainty Estimation Model: Utilizing a synthetic dataset(upsampled, denormalized, with line level augmentation) built off of the UCI Diabetes readmission dataset, students will build a regression model that predicts the expected days of hospitalization time and an uncertainty range estimation.

This project will demonstrate the importance of building the right data representation at the encounter level, with appropriate filtering and preprocessing/feature engineering of key medical code sets. This project will also require students to analyze and interpret their model for biases across key demographic groups. Lastly, students will utilize the TF probability library to provide uncertainty range estimates in the regression output predictions to prioritize and triage prediction uncertainty levels.

In the end you will be creating a demographic bias analysis to detect if your model has any bias which we know can be a huge issue in working with healthcare data!

Project Instructions

Project Instructions & Prerequisites
Learning Objectives
Steps to Completion

1. Project Instructions

Context: EHR data is becoming a key source of real-world evidence (RWE) for the pharmaceutical industry and regulators to make decisions on clinical trials. You are a data scientist for an exciting unicorn healthcare startup that has created a groundbreaking diabetes drug that is ready for clinical trial testing. It is a very unique and sensitive drug that requires administering the drug over at least 5-7 days of time in the hospital with frequent monitoring/testing and patient medication adherence training with a mobile application. You have been provided a patient dataset from a client partner and are tasked with building a predictive model that can identify which type of patients the company should focus their efforts testing this drug on. Target patients are people that are likely to be in the hospital for this duration of time and will not incur significant additional costs for administering this drug to the patient and monitoring.

In order to achieve your goal you must build a regression model that can predict the estimated hospitalization time for a patient and use this to select/filter patients for your study.

Expected Hospitalization Time Regression Model: Utilizing a synthetic dataset(denormalized at the line level augmentation) built off of the UCI Diabetes readmission dataset, students will build a regression model that predicts the expected days of hospitalization time and then convert this to a binary prediction of whether to include or exclude that patient from the clinical trial.

Dataset

Due to healthcare PHI regulations (HIPAA, HITECH), there are limited number of publicly available datasets and some datasets require training and approval. So, for the purpose of this exercise, we are using a dataset from UC Irvine that has been modified for this course. Please note that it is limited in its representation of some key features such as diagnosis codes which are usually an unordered list in 835s/837s (the HL7 standard interchange formats used for claims and remits).

https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008 Data Schema The dataset reference information can be https://github.com/udacity/nd320-c1-emr-data-starter/tree/master/project/data_schema_references. There are two CSVs that provide more details on the fields and some of the mapped values.

Project Submission

When submitting this project, make sure to run all the cells before saving the notebook. Save the notebook file as "student_project_submission.ipynb" and save another copy as an HTML file by clicking "File" -> "Download as.."->"html". Include the "utils.py" and "student_utils.py" files in your submission. The student_utils.py should be where you put most of your code that you write and the summary and text explanations should be written inline in the notebook. Once you download these files, compress them into one zip file for submission in the Udacity Classroom.

Prerequisites

Intermediate level knowledge of Python
Basic knowledge of probability and statistics
Basic knowledge of machine learning concepts
Installation of Tensorflow 2.0 and other dependencies(conda environment.yml or virtualenv requirements.txt file provided)

Environment Setup

For step by step instructions on creating your environment, please go to https://github.com/udacity/nd320-c1-emr-data-starter/blob/master/README.md

Learning Objectives

By the end of the project, you will be able to:

Use the Tensorflow Dataset API to scalably extract, transform, and load datasets and build datasets aggregated at the line, encounter, and patient data levels(longitudinal)
Analyze EHR datasets to check for common issues (data leakage, statistical properties, missing values, high cardinality) by performing exploratory data analysis.
Create categorical features from Key Industry Code Sets (ICD, CPT, NDC) and reduce dimensionality for high cardinality features by using embeddings
Create derived features(bucketing, cross-features, embeddings) utilizing Tensorflow feature columns on both continuous and categorical input features
Use the Tensorflow Probability library to train a model that provides uncertainty range predictions that allow for risk adjustment/prioritization and triaging of predictions
Analyze and determine biases for a model for key demographic groups by evaluating performance metrics across groups by using the Aequitas framework

3. Steps to Completion

Please follow all of the direction in the Jupyter Notebook file in classroom workspace or from the Github Repo if you decide to use your own environment to complete the project.

You complete the following steps there:

Data Analysis
Create Categorical Features with TF Feature Columns
Create Continuous/Numerical Features with TF Feature Columns
Build Deep Learning Regression Model with Sequential API and TF Probability Layers
Evaluating Potential Model Biases with Aequitas Toolkit

Project Submission

Once you have completed your project please

Make sure the project meets all of the specifications on the Project Rubric
If you are working in directly in our workspaces, you can submit your project directly there
If you are working in your own environment or if you have issues submitting directly in the workspace, please zip up your flies and submit them that way.

Best of luck on the project. Remember that you can use the resources provided in the student hub or talk with you mentor if you have questions too.

Implementation of the final project of the course DDA6309 Probabilistic Graphical Model

Task-aware Joint CWS and POS (TCwsPos) This is the implementation of the final project of the course DDA6309 Probabilistic Graphical Models, The Chine

1 Dec 26, 2021

Final project for machine learning (CSC 590). Detection of hepatitis C and progression through blood samples.

Hepatitis C Blood Based Detection Final project for machine learning (CSC 590). Dataset from Kaggle. Using data from previous hepatitis C blood panels

1 Dec 28, 2021

Cmsc11 arcade - Final Project for CMSC11

cmsc11_arcade Final Project for CMSC11 Developers: Limson, Mark Vincent Peñafiel

1 Jan 18, 2022

Code, final versions, and information on the Sparkfun Graphical Datasheets

Graphical Datasheets Code, final versions, and information on the SparkFun Graphical Datasheets. Generated Cells After Running Script Example Complete

102 Jan 5, 2023

The reference baseline of final exam for XMU machine learning course

Mini-NICO Baseline The baseline is a reference method for the final exam of machine learning course. Requirements Installation we use /python3.7 /torc

3 Dec 29, 2021

A repository for storing njxzc final exam review material

文档地址，请戳我 👈 👈 👈 ☀️ 1.Reason 大三上期末复习软件工程的时候，发现其他高校在GitHub上开源了他们学校的期末试题，我很受触动。期末

2 Jan 18, 2022

Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Ceph.

Project Aquarium Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Cep

73 Jul 21, 2022

This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

Reinforcement-trading This project uses Reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can

1.4k Dec 22, 2022

Erpnext app for make employee salary on payroll entry based on one or more project with percentage for all project equal 100 %

Project Payroll this app for make payroll for employee based on projects like project on 30 % and project 2 70 % as account dimension it makes genral

8 Jan 2, 2023

The final project of "Applying AI to EHR Data" of "AI for Healthcare" nanodegree - Udacity.

Related tags

Overview

Patient Selection for Diabetes Drug Testing

Project Overview

Project Instructions

1. Project Instructions

Dataset

Project Submission

Prerequisites

Environment Setup

3. Steps to Completion

Project Submission

You might also like...

Implementation of the final project of the course DDA6309 Probabilistic Graphical Model

Final project for machine learning (CSC 590). Detection of hepatitis C and progression through blood samples.

Cmsc11 arcade - Final Project for CMSC11

Code, final versions, and information on the Sparkfun Graphical Datasheets

The reference baseline of final exam for XMU machine learning course

A repository for storing njxzc final exam review material

Project Aquarium is a SUSE-sponsored open source project aiming at becoming an easy to use, rock solid storage appliance based on Ceph.

This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

Erpnext app for make employee salary on payroll entry based on one or more project with percentage for all project equal 100 %

Owner

Omar Laham

A transformer-based method for Healthcare Image Captioning in Vietnamese

Udacity's CS101: Intro to Computer Science - Building a Search Engine

Deep Learning for Computer Vision final project

Final project for Intro to CS class.

Final term project for Bayesian Machine Learning Lecture (XAI-623)

Computer Vision Script to recognize first person motion, developed as final project for the course "Machine Learning and Deep Learning"

NAVER BoostCamp Final Project

Final project code: Implementing MAE with downscaled encoders and datasets, for ESE546 FA21 at University of Pennsylvania

Final project code: Implementing BicycleGAN, for CIS680 FA21 at University of Pennsylvania

Final Project for the CS238: Decision Making Under Uncertainty course at Stanford University in Autumn '21.