Library of Stan Models for Survival Analysis

Overview

Build Status Coverage Status PyPI version

survivalstan: Survival Models in Stan

author: Jacki Novik

Overview

Library of Stan Models for Survival Analysis

Features:

  • Variety of standard survival models
    • Weibull, Exponential, and Gamma parameterizations
    • PEM models with variety of baseline hazards
    • PEM model with varying-coefficients (by group)
    • PEM model with time-varying-effects
  • Extensible framework - bring your own Stan code, or edit the models above
  • Uses pandas data frames & patsy formulas
  • Graphical posterior predictive checking (currently PEM models only)
  • Plot posterior estimates of key parameters using seaborn
  • Annotate posterior draws of parameter estimates, format as pandas dataframes
  • Works with extensions to pystan, such as stancache or pystan-cache

Support

Documentation is available online.

For help, please reach out to us on gitter.

Installation / Usage

Install using pip, as:

$ pip install survivalstan

Or, you can clone the repo:

$ git clone https://github.com/hammerlab/survivalstan.git
$ pip install .

Contributing

Please contribute to survivalstan development by letting us know if you encounter any bugs or have specific feature requests.

In addition, we welcome contributions of:

  • Stan code for survival models
  • Worked examples, as jupyter notebooks or markdown documents

Usage examples

There are several examples included in the example-notebooks, roughly one corresponding to each model.

If you are not sure where to start, Test pem_survival_model with simulated data.ipynb contains the most explanatory text. Many of the other notebooks are sparse on explanation, but do illustrate variations on the different models.

For basic usage:

import survivalstan
import stanity
import seaborn as sb
import matplotlib.pyplot as plt
import statsmodels

## load flchain test data from R's `survival` package
dataset = statsmodels.datasets.get_rdataset(package = 'survival', dataname = 'flchain' )
d  = dataset.data.query('futime > 7')
d.reset_index(level = 0, inplace = True)

## e.g. fit Weibull survival model
testfit_wei = survivalstan.fit_stan_survival_model(
	model_cohort = 'Weibull model',
	model_code = survivalstan.models.weibull_survival_model,
	df = d,
	time_col = 'futime',
	event_col = 'death',
	formula = 'age + sex',
	iter = 3000,
	chains = 4,
	make_inits = survivalstan.make_weibull_survival_model_inits
	)

## coefplot for Weibull coefficient estimates
sb.boxplot(x = 'value', y = 'variable', data = testfit_wei['coefs'])

## or, use plot_coefs
survivalstan.utils.plot_coefs([testfit_wei])

## print summary of MCMC draws from posterior for each parameter
print(testfit_wei['fit'])


## e.g. fit Piecewise-exponential survival model 
dlong = survivalstan.prep_data_long_surv(d, time_col = 'futime', event_col = 'death')
testfit_pem = survivalstan.fit_stan_survival_model(
	model_cohort = 'PEM model',
	model_code = survivalstan.models.pem_survival_model,
	df = dlong,
	sample_col = 'index',
	timepoint_end_col = 'end_time',
	event_col = 'end_failure',
	formula = 'age + sex',
	iter = 3000,
	chains = 4,
	)

## print summary of MCMC draws from posterior for each parameter
print(testfit_pem['fit'])

## coefplot for PEM model results
sb.boxplot(x = 'value', y = 'variable', data = testfit_pem['coefs'])

## plot baseline hazard (only PEM models)
survivalstan.utils.plot_coefs([testfit_pem], element='baseline')

## posterior-predictive checking (only PEM models)
survivalstan.utils.plot_pp_survival([testfit_pem])

## e.g. compare models using PSIS-LOO
stanity.loo_compare(testfit_wei['loo'], testfit_pem['loo'])

## compare coefplots 
sb.boxplot(x = 'value', y = 'variable', hue = 'model_cohort',
    data = testfit_pem['coefs'].append(testfit_wei['coefs']))
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

## (or, use survivalstan.utils.plot_coefs)
survivalstan.utils.plot_coefs([testfit_wei, testfit_pem])

Comments
  • Feature/add patsy formulas

    Feature/add patsy formulas

    Some code refactoring in order to support using patsy formulas for LHS as well as RHS spec. Closes #52

    Models can now use the following syntax:

    survivalstan.fit_stan_survival_model(
        formula = 'surv(event_status=end_failure, time=t) ~ age + sex',
        data = d,
        model = survivalstan.models.exp_survival_model,
         ...
    )
    

    Optional parameters to surv include group= and subject=.

    The previous syntax is still also supported:

    survivalstan.fit_stan_survival_model(
        event_col = 'end_failure', 
        time_col = 't',
        formula = ' ~ age + sex',
        ...
    ) 
    

    Moved a fair amount of data-processing into the survivalstan.formulas module, which extends patsy syntax to support the above. For example, the patsy.ModelDesc module includes the stan-data dict which gets passed to pystan.stan for sampling.

    This ipynb tests early versions of this code and may help demonstrate how the elements in formulas.py fit together.

    opened by jburos 15
  • stanity requirement

    stanity requirement

    git cloneing this repo and then pip install -e . does not result in stanity being installed, even though it seems like it should based on the setup.py code. Not sure why?

    bug 
    opened by tavinathanson 6
  • v0.1.2 release

    v0.1.2 release

    Enhancements:

    • Improve plot_coefs & supporting functions
      • support a wider variety of coefficient types (element="baseline", element="grp_coefs", etc)
      • add trans option, so that one can plot exp(beta) or beta
      • support different types of grp_coefs, such as vector of vectors vs matrix
    • some bug-fixes
      • make fit_stan_survival_model more robust to failures in post-processing
    • updated TVC model
      • based on random-walk baseline hazard model
      • use random-walk (cumulative delta) to model TVCs
      • estimate hazard on log-scale for stability

    New features:

    • Added pem-gamma model
    • Graphical posterior-predictive checking, via:
      • survivalstan.utils.plot_pp_survival(models=[my_model])
      • (optionally overlay observed values using plot_observed_survival(..))
    • Get posterior-predictive data via:
      • survivalstan.utils.prep_pp_data
    • updated tvc, pem-randomwalk & pem-gamma to support ppcheck functions
    opened by jburos 5
  • 0.1.1 release

    0.1.1 release

    Several fixes & cleanup items, in preparation for first real release.

    bug fixes

    1. Fix bug estimating null model (#8)

    enhancements

    1. add utils for plot_coefs
      • modify plot_coefs to accept arbitrary parameter names to extract (e.g. baseline_raw)
    2. add utils for extract_params_long
    3. refactor some parts of fit_stan_survival_model
    4. precompute t_dur & t_obs & pass these in as data instead of using transformed data block

    new models

    1. modified default pem_survival_model - is the unstructured model.
    2. added exp_survival_model.stan
    3. added pem_survival_model_unstructured.stan
    4. added pem_survival_model_randomwalk.stan
    5. remove varying_coefs2 and varying_coefs3 models for now.
    opened by jburos 5
  • more options

    more options

    Few small changes:

    1. Don't fail if psis-loo doesn't run after model estimation
    2. Add drop_intercept param to fit_stan_survival_model, since not all models want intercept terms
    3. New example model notebook containing gamma survival model, in bring-your-own model form
    opened by jburos 4
  • add patsy formulas for Surv() syntax

    add patsy formulas for Surv() syntax

    This is a revised PR for the add-patsy-formulas feature (PR #60).

    Since then:

    1. Improved code formatting
    2. fix log_t_dur spec (#71)
    3. fix dataframe merge error on pp_surv (#76)

    Comments from previous PR:

    Some code refactoring in order to support using patsy formulas for LHS as well as RHS spec. Closes #52

    Models can now use the following syntax:

    survivalstan.fit_stan_survival_model(
        formula = 'surv(event_status=end_failure, time=t) ~ age + sex',
       data = d,
       model = survivalstan.models.exp_survival_model,
        ...
    )
    

    Optional parameters to surv include group= and subject=.

    The previous syntax is still also supported:

    survivalstan.fit_stan_survival_model(
       event_col = 'end_failure', 
       time_col = 't',
       formula = ' ~ age + sex',
       ...
    ) 
    

    Moved a fair amount of data-processing into the survivalstan.formulas module, which extends patsy syntax to support the above. For example, the patsy.ModelDesc module includes the stan-data dict which gets passed to pystan.stan for sampling.

    This ipynb tests early versions of this code and may help demonstrate how the elements in formulas.py fit together.

    opened by jburos 3
  • Time and event column types

    Time and event column types

    Following the PEM example with my own data, I got unreasonable results using survivalstan.utils.plot_observed_survival.

    After casting my time column to float (was int before) and the event column to boolean (was 0/1 before), everything worked.

    Is this intentional?

    bug 
    opened by adam-haber 3
  • refactor code to anticipate joint modeling work

    refactor code to anticipate joint modeling work

    Changes in this PR:

    1. pull data-prep into a separate class: SurvivalStanData. Preparation of these data is done via that class's __init__ method.
      • is a class because meta-data about the processing step (e.g. coefficient names, etc) are needed in the fit-stan object to support variety of post-processing.
    2. new parameter input_data, allowing user to prepare the SurvivalStanData object separately & pass in as an argument.
    3. minor bugfix for case where data are provided in "long" format & contain only a single failure time.
    opened by jburos 3
  • Dataframe Merge error in utils.plot_pp_survival (Test pem_survival_model with simulated data notebook)

    Dataframe Merge error in utils.plot_pp_survival (Test pem_survival_model with simulated data notebook)

    In the notebook, when I ran the code : survivalstan.utils.plot_pp_survival([testfit], by='sex') I got the following error:

    ValueError Traceback (most recent call last) in () ----> 1 survivalstan.utils.plot_pp_survival([testfit], by='sex')

    ~/.conda/envs/test-env/lib/python3.5/site-packages/survivalstan/utils.py in plot_pp_survival(models, time_element, event_element, num_ticks, step_size, ticks_at, time_col, event_col, fill, by, alpha, pal, subplot, **kwargs) 326 pp_surv = prep_pp_survival_data(models, time_element=time_element, 327 event_element=event_element, time_col=time_col, --> 328 event_col=event_col, by=by) 329 print(pp_surv.head()) 330 if by:

    ~/.conda/envs/test-env/lib/python3.5/site-packages/survivalstan/utils.py in prep_pp_survival_data(models, time_element, event_element, time_col, event_col, by, **kwargs) 260 by=None, 261 **kwargs): --> 262 pp_data = prep_pp_data(models, time_element=time_element, event_element=event_element, time_col=time_col, event_col=event_col, **kwargs) 263 groups = ['iter', 'model_cohort'] 264 if by:

    ~/.conda/envs/test-env/lib/python3.5/site-packages/survivalstan/utils.py in prep_pp_data(models, time_element, event_element, event_col, time_col, **kwargs) 250 def prep_pp_data(models, time_element='y_hat_time', event_element='y_hat_event', event_col='event_status', time_col='event_time', **kwargs): 251 data = [_prep_pp_data_single_model(model=model, event_element=event_element, time_element=time_element, event_col=event_col, time_col=time_col, **kwargs) --> 252 for model in models] 253 data = pd.concat(data) 254 data.sort_values([time_col, 'iter'], inplace=True)

    ~/.conda/envs/test-env/lib/python3.5/site-packages/survivalstan/utils.py in (.0) 250 def prep_pp_data(models, time_element='y_hat_time', event_element='y_hat_event', event_col='event_status', time_col='event_time', **kwargs): 251 data = [_prep_pp_data_single_model(model=model, event_element=event_element, time_element=time_element, event_col=event_col, time_col=time_col, **kwargs) --> 252 for model in models] 253 data = pd.concat(data) 254 data.sort_values([time_col, 'iter'], inplace=True)

    ~/.conda/envs/test-env/lib/python3.5/site-packages/survivalstan/utils.py in _prep_pp_data_single_model(model, time_element, event_element, sample_col, time_col, event_col, join_with, sample_id_col) 244 #pp_data['index'] = pp_data['index'].astype(int) 245 if join_with: --> 246 pp_data = pd.merge(pp_data, model[join_with], on=sample_col, suffixes=['','_original']) 247 return pp_data 248

    ~/.conda/envs/test-env/lib/python3.5/site-packages/pandas/core/reshape/merge.py in merge(left, right, how, on, left_on, right_on, left_index, right_index, sort, suffixes, copy, indicator, validate) 59 right_index=right_index, sort=sort, suffixes=suffixes, 60 copy=copy, indicator=indicator, ---> 61 validate=validate) 62 return op.get_result() 63

    ~/.conda/envs/test-env/lib/python3.5/site-packages/pandas/core/reshape/merge.py in init(self, left, right, how, on, left_on, right_on, axis, left_index, right_index, sort, suffixes, copy, indicator, validate) 553 # validate the merge keys dtypes. We may need to coerce 554 # to avoid incompat dtypes --> 555 self._maybe_coerce_merge_keys() 556 557 # If argument passed to validate,

    ~/.conda/envs/test-env/lib/python3.5/site-packages/pandas/core/reshape/merge.py in _maybe_coerce_merge_keys(self) 984 elif (not is_numeric_dtype(lk) 985 and (is_numeric_dtype(rk) and not is_bool_dtype(rk))): --> 986 raise ValueError(msg) 987 elif is_datetimelike(lk) and not is_datetimelike(rk): 988 raise ValueError(msg)

    ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

    Resolution: I debugged the code and in the utils.py, the below function was producing a dataframe pp_data in which the index column had a type str: _prep_pp_data_single_model(model, time_element='y_hat_time', event_element='y_hat_event', sample_col=None, time_col='event_time', event_col='event_status', join_with='df_all', sample_id_col=None):

    To solve this I just changed the type to int by using below line and the code ran fine without any errors: pp_data = pd.merge(pp_event_time, pp_event_status, on=['iter', sample_col, 'model_cohort']) pp_data['index'] = pp_data['index'].astype(int) if join_with: pp_data = pd.merge(pp_data, model[join_with], on=sample_col, suffixes=['','_original']) return pp_data

    bug documentation 
    opened by prabhjot12 2
  • NameError: FIT_FUN

    NameError: FIT_FUN

    Hey all -

    First, thank you for all the work that goes into this. This is making my job so much easier.

    I'm imitating this: https://github.com/hammerlab/survivalstan/blob/master/example-notebooks/Test%20pem_survival_model_timevarying%20with%20simulated%20data.ipynb

    I'm getting error "FIT_FUN". I'm imitating everything fine on new data until this part. I'm using the the time varying weibull survival instead of the PEM in the example. i.e.:

    fit = survivalstan.fit_stan_survival_model(
        model_cohort = 'test model',
        model_code = survivalstan.models.weibull_survival_model_varying_coefs,
        df = df,
        sample_col = 'MRN',
        timepoint_end_col = 'Time',
        event_col = 'Death',
        formula = '~ CFR_Global + Etiology_CMP + Tacro + Cyclo + Syro + Evero + DM + HTN + eGFR + BMI + Age + AMR_Y/N',
        iter = 10000,
        chains = 4,
        seed = 8128,
        FIT_FUN = stancache.cached_stan_file,
        )
    

    I'll keep staring at it, but is there something stupid I'm missing?

    Thanks all... Andre

    opened by drezap 2
  • Unknown compilation issue

    Unknown compilation issue

    I ran into a compilation issue when running the following on my Mac, from the README (seems to work on Linux):

    ## e.g. fit Weibull survival model
    testfit_wei = survivalstan.fit_stan_survival_model(
        model_cohort = 'Weibull model',
        model_code = survivalstan.models.weibull_survival_model,
        df = d,
        time_col = 'futime',
        event_col = 'death',
        formula = 'age + sex',
        iter = 3000,
        chains = 4,
        make_inits = survivalstan.make_weibull_survival_model_inits
        )
    

    I'll try to post the stack trace and associated errors when I reproduce.

    bug 
    opened by tavinathanson 2
  • model fit works when running on shell, but runs into error when executing as script

    model fit works when running on shell, but runs into error when executing as script

    i'm trying to build a bayes model object using the survivalstan package, the model object compiles properly when the script below is executed directly on a python shell (copy + paste method into the interpreter). however, when I execute the entire script at once using the command line format, I get a variable error (see below)

    File "C:\ProgramData\Anaconda3\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "C:\Users\thomasc\Documents\WeibullMethod\weibull_bayes_demo.py", line 117, in model = main() File "C:\Users\thomasc\Documents\WeibullMethod\weibull_bayes_demo.py", line 112, in main make_inits = survivalstan.make_weibull_survival_model_inits) File "C:\ProgramData\Anaconda3\lib\site-packages\survivalstan-0.1.2.8-py3.7.egg\survivalstan\survivalstan.py", line 167, in fit_stan_survival_model File "C:\ProgramData\Anaconda3\lib\site-packages\stanity-0.1.4-py3.7.egg\stanity\fit.py", line 37, in fit File "C:\ProgramData\Anaconda3\lib\site-packages\pystan\api.py", line 447, in stan n_jobs=n_jobs, **kwargs) File "C:\ProgramData\Anaconda3\lib\site-packages\pystan\model.py", line 813, in sampling ret_and_samples = _map_parallel(call_sampler_star, call_sampler_args, n_jobs) File "C:\ProgramData\Anaconda3\lib\site-packages\pystan\model.py", line 87, in _map_parallel pool.close()

    how can I prevent this error? I do not wish to have to interact with a python shell directly to train a model. code below to train model, directly same as the example code for this package this compiles if copy and pasted directly into a shell, but fails if saved as a .py script and executed from command line as python