Bayesian Additive Regression Trees For Python

Overview

BartPy

Build Status

Introduction

BartPy is a pure python implementation of the Bayesian additive regressions trees model of Chipman et al [1].

Reasons to use BART

  • Much less parameter optimization required that GBT
  • Provides confidence intervals in addition to point estimates
  • Extremely flexible through use of priors and embedding in bigger models

Reasons to use the library:

  • Can be plugged into existing sklearn workflows
  • Everything is done in pure python, allowing for easy inspection of model runs
  • Designed to be extremely easy to modify and extend

Trade offs:

  • Speed - BartPy is significantly slower than other BART libraries
  • Memory - BartPy uses a lot of caching compared to other approaches
  • Instability - the library is still under construction

How to use:

There are two main APIs for BaryPy:

  1. High level sklearn API
  2. Low level access for implementing custom conditions

If possible, it is recommended to use the sklearn API until you reach something that can't be implemented that way. The API is easier, shared with other models in the ecosystem, and allows simpler porting to other models.

Sklearn API

The high level API works as you would expect

from bartpy.sklearnmodel import SklearnModel
model = SklearnModel() # Use default parameters
model.fit(X, y) # Fit the model
predictions = model.predict() # Make predictions on the train set
out_of_sample_predictions = model.predict(X_test) # Make predictions on new data

The model object can be used in all of the standard sklearn tools, e.g. cross validation and grid search

from bartpy.sklearnmodel import SklearnModel
model = SklearnModel() # Use default parameters
cross_validate(model)
Extensions

BartPy offers a number of convenience extensions to base BART. The most prominent of these is using BART to predict the residuals of a base model. It is most natural to use a linear model as the base, but any sklearn compatible model can be used

from bartpy.extensions.baseestimator import ResidualBART
model = ResidualBART(base_estimator=LinearModel())
model.fit(X, y)

A nice feature of this is that we can combine the interpretability of a linear model with the power of a trees model

Lower level API

BartPy is designed to expose all of its internals, so that it can be extended and modifier. In particular, using the lower level API it is possible to:

  • Customize the set of possible tree operations (prune and grow by default)
  • Control the order of sampling steps within a single Gibbs update
  • Extend the model to include additional sampling steps

Some care is recommended when working with these type of changes. Through time the process of changing them will become easier, but today they are somewhat complex

If all you want to customize are things like priors and number of trees, it is much easier to use the sklearn API

Alternative libraries

References

[1] https://arxiv.org/abs/0806.3286 [2] http://www.gatsby.ucl.ac.uk/~balaji/pgbart_aistats15.pdf [3] https://arxiv.org/ftp/arxiv/papers/1309/1309.1906.pdf [4] https://cran.r-project.org/web/packages/BART/vignettes/computing.pdf

Comments
  • Cannot import bartpy.samplers

    Cannot import bartpy.samplers

    from bartpy.sklearnmodel import SklearnModel
    from bartpy.featureselection import SelectNullDistributionThreshold, SelectSplitProportionThreshold
    from bartpy.diagnostics.features import *
    

    ModuleNotFoundError Traceback (most recent call last) in ----> 1 from bartpy.sklearnmodel import SklearnModel 2 from bartpy.featureselection import SelectNullDistributionThreshold, SelectSplitProportionThreshold 3 from bartpy.diagnostics.features import *

    ~/miniconda3/envs/viz/lib/python3.7/site-packages/bartpy/sklearnmodel.py in 9 from bartpy.data import Data 10 from bartpy.model import Model ---> 11 from bartpy.samplers.leafnode import LeafNodeSampler 12 from bartpy.samplers.modelsampler import ModelSampler, Chain 13 from bartpy.samplers.schedule import SampleSchedule

    ModuleNotFoundError: No module named 'bartpy.samplers'

    from bartpy.samplers import *

    ModuleNotFoundError Traceback (most recent call last) in ----> 1 from bartpy.samplers import *

    ModuleNotFoundError: No module named 'bartpy.samplers'

    opened by dsvolk 6
  • No Module named bartpy.samplers

    No Module named bartpy.samplers

    Open to "import SklearnModel"

    import bartpy
    from bartpy.sklearnmodel import SklearnModel
    
    ModuleNotFoundError                       Traceback (most recent call last)
    ~\AppData\Local\Temp/ipykernel_1232/2717476998.py in <module>
    ----> 1 from bartpy.sklearnmodel import SklearnModel
    
    ~\Anaconda3\envs\env\lib\site-packages\bartpy\sklearnmodel.py in <module>
          9 from bartpy.data import Data
         10 from bartpy.model import Model
    ---> 11 from bartpy.samplers.leafnode import LeafNodeSampler
         12 from bartpy.samplers.modelsampler import ModelSampler, Chain
         13 from bartpy.samplers.schedule import SampleSchedule
    
    ModuleNotFoundError: No module named 'bartpy.samplers'
    
    opened by jckkvs 1
  • UnboundLocalError: local variable 'mutation' referenced before assignment

    UnboundLocalError: local variable 'mutation' referenced before assignment

    Hi,

    I just installed the bartpy locally in my machine and I am trying to run the example code ols.py. I am getting the following error, please advise.

    0%| | 0/50 [00:00<?, ?it/s]2020-01-27 13:38:26.026932 Starting burn

    Traceback (most recent call last):

    File "", line 1, in runfile('/Users/anita/Bayesian/bartpy-master/examples/ols.py', wdir='/Users/anita/Bayesian/bartpy-master/examples')

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 827, in runfile execfile(filename, namespace)

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 110, in execfile exec(compile(f.read(), filename, 'exec'), namespace)

    File "/Users/anita/Bayesian/bartpy-master/examples/ols.py", line 26, in model, x, y = run(0.95, 2., 20, 5)

    File "/Users/anita/Bayesian/bartpy-master/examples/ols.py", line 15, in run model.fit(X, y)

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/sklearnmodel.py", line 134, in fit self.extract = Parallel(n_jobs=self.n_jobs)(self.f_delayed_chains(X, y))

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 921, in call if self.dispatch_one_batch(iterator):

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch self._dispatch(tasks)

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 716, in _dispatch job = self._backend.apply_async(batch, callback=cb)

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 182, in apply_async result = ImmediateResult(func)

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 549, in init self.results = batch()

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 225, in call for func, args, kwargs in self.items]

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 225, in for func, args, kwargs in self.items]

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/sklearnmodel.py", line 31, in run_chain model.store_acceptance_trace)

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/samplers/modelsampler.py", line 43, in samples self.step(model, trace_logger)

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/samplers/modelsampler.py", line 26, in step result = step()

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/samplers/schedule.py", line 48, in yield "Tree", lambda: self.tree_sampler.step(model, tree)

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/samplers/unconstrainedtree/treemutation.py", line 47, in step mutation = self.sample(model, tree)

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/samplers/unconstrainedtree/treemutation.py", line 40, in sample ratio = self.likihood_ratio.log_probability_ratio(model, tree, proposal)

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/samplers/treemutation.py", line 80, in log_probability_ratio return self.log_transition_ratio(tree, mutation) + self.log_likihood_ratio(model, tree, mutation) + self.log_tree_ratio(model, tree, mutation)

    File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/samplers/unconstrainedtree/likihoodratio.py", line 69, in log_likihood_ratio mutation: GrowMutation = mutation

    UnboundLocalError: local variable 'mutation' referenced before assignment

    opened by anipin123 1
  • Support multiple chains

    Support multiple chains

    This seems like a pretty easy win given how parallelizable the different chains are.

    The current solution is to follow sklearn and pymc in using joblib to handle the parallelization. A nice feature of this is that it looks relatively easy to make this solution scale into a full dask cluster approach.

    opened by JakeColtman 1
  • Multiple Parallel chains

    Multiple Parallel chains

    To assess convergence to the true distribution, it is useful to run several chains from different start points. At the moment, this requires running the model multiple times in parallel.

    There are two related parts of this issue:

    1. conceptually support multiple chains in the API
    2. use multiprocessing of the like to parallelize
    opened by JakeColtman 1
  • simple debug; catboost added to requirements.txt

    simple debug; catboost added to requirements.txt

    1. The example "Comparison of model with sine waves" gives an error because of the log_likihood_ratio().

    2. catboos' may be not appropreate for requirements.txt. It is upto you.

    opened by yongsubaek 0
  • Significant Performance Improvements

    Significant Performance Improvements

    A number of architectural changes to make things run much faster. Primary changes:

    • much more aggressive caching
    • switching from masked arrays to more direct numpy logic
    opened by JakeColtman 0
  • Switch to using numpy masks in nodes rather than full copies of the covariate matrix

    Switch to using numpy masks in nodes rather than full copies of the covariate matrix

    The main change in this PR is to reshuffle how Data works. Rather than every split causing a deepcopy of the X and y matrices, it now only creates new masks onto a single instance of X and y. This makes the process much faster and lighter on memory

    There's also an embarrassing splodge of other changes at the same time :(

    opened by JakeColtman 0
  • Make the samplers pure

    Make the samplers pure

    Rather than holding state in the samplers, it's nicer to have them be stateless and pure. This also allows much easier injection of different sampling methods

    opened by JakeColtman 0
  • Difference with the alternative bartMachine

    Difference with the alternative bartMachine

    Hello, It is quite interesting to have an implementation of BART in python. However, when I tried this implementation with its alternative in R "bartMachine", the alternative was giving more promising results. Can you tell, if you had the time to explore it of course, the difference between your implementation and the one of bartMachine?

    Thank you in advance :)

    opened by achamma723 0
  • Binary Outcomes and Random Intercepts

    Binary Outcomes and Random Intercepts

    Are there plans to add dichotomous outcomes and random intercepts to this model? Another question is how one would incorporate the random intercepts to conditional means for prediction of the dichotomous outcome? Thanks!

    https://github.com/vdorie/dbarts/blob/4ed4eafc772e95d788d5a135d9f4e4728b9516ec/R/rbart.R#L5

    opened by jlevy44 0
  • Model predicting NaN

    Model predicting NaN

    Hi,

    Thank you for bart-py!

    My BART model is predicting NaN for some cases. Does anyone know why this happens? or how I can prevent this?

    My data has missing data but to my knowledge, BART can handle this. My data are finite.

    Thank you!

    Code: (Sorry for the lengthy data generation)

    import pandas as pd
    import numpy as np
    import random
    import bartpy
    from bartpy.sklearnmodel import SklearnModel
    import sklearn
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import KFold
    
    # simulate df with 46 features and 9000 rows
    # create binary vars and make a df
    label=np.random.randint(2, size=9000)
    df = pd.DataFrame({'label':label})
    df['a']=np.random.randint(2, size=9000)
    
    # create integers
    df['b'] = np.random.randint(low=50, high=96, size=9000)
    df['b'] = np.random.randint(low=4, high=97, size=9000)
    df['c'] = np.random.randint(low=0, high=1759.22, size=9000)
    df['d'] = np.random.randint(low=0, high=5702.2, size=9000)
    df['e'] = np.random.randint(low=0, high=7172.31, size=9000)
    
    # create numerics
    df['f'] = np.random.uniform(0, 908.56, 9000)
    df['f'] = np.random.uniform(0,908.56, 9000)
    df['g'] = np.random.uniform(0,2508.78, 9000)
    df['h'] = np.random.uniform(0,3757.56, 9000)
    df['i'] = np.random.uniform(0,560.18, 9000)
    df['j'] = np.random.uniform(0,1362.71, 9000)
    df['k'] = np.random.uniform(0,2578.26, 9000)
    df['l'] = np.random.uniform(175.07,997, 9000)
    df['m'] = np.random.uniform(992.39,3972.81, 9000)
    df['n'] = np.random.uniform(1787.24,5823.21, 9000)
    df['o'] = np.random.uniform(-56,53, 9000)
    df['p'] = np.random.uniform(-47,46, 9000)
    df['q'] = np.random.uniform(-1089.03,1546.87, 9000)
    df['r'] = np.random.uniform(-1599.14,898.79, 9000)
    df['s'] = np.random.uniform(-2871.02,5329, 9000)
    df['t'] = np.random.uniform(-4231.44,2481.55, 9000)
    df['u'] = np.random.uniform(-3435.9,5824.22, 9000)
    df['v'] = np.random.uniform(-5086.6,4548.43, 9000)
    df['w'] = np.random.uniform(-406.57,907.91, 9000)
    df['x'] = np.random.uniform(-834.82,840.27, 9000)
    df['y'] = np.random.uniform(-549.2,2506.29, 9000)
    df['z'] = np.random.uniform(-1547.2,2434.18, 9000)
    df['aa'] = np.random.uniform(-426.6,3636.17, 9000)
    df['bb'] = np.random.uniform(-2819.8,3390, 9000)
    df['cc'] = np.random.uniform(-266.75,527.81, 9000)
    df['dd'] = np.random.uniform(-778.64,527.81, 9000)
    df['ee'] = np.random.uniform(-476.09,1358.32, 9000)
    df['ff'] = np.random.uniform(-1890.91,919.3, 9000)
    df['gg'] = np.random.uniform(-1633.23,2577.01, 9000)
    df['hh'] = np.random.uniform(-2427.93,2078.78, 9000)
    df['ii'] = np.random.uniform(-339.67,518.32, 9000)
    df['jj'] = np.random.uniform(-528.07,412, 9000)
    df['kk'] = np.random.uniform(-1460.23,1610.58, 9000)
    df['ll'] = np.random.uniform(-1984.08,1127.82, 9000)
    df['mm'] = np.random.uniform(-2153.38,2402.24, 9000)
    df['nn'] = np.random.uniform(-2311.27,1809.37, 9000)
    df['oo'] = np.random.uniform(16,92, 9000)
    df['pp'] = np.random.uniform(4,24, 9000)
    df['qq'] = np.random.uniform(4,80, 9000)
    df['rr'] = np.random.uniform(0,1, 9000)
    
    # add missings to floats
    # select only numeric columns to apply the missingness to
    cols_list = df.select_dtypes('float64').columns.tolist()
            
    # randomly remove cases from the dataframe
    for col in df[cols_list]:
        df.loc[df.sample(frac=0.02).index, col] = np.nan
    
    # # 80/20 train test split
    X_train, X_test, y_train, y_test = train_test_split(df.drop(['label'],axis=1), df['label'], train_size=0.7, random_state = 99)
    
    # Modelling
    model = SklearnModel(n_jobs = 30) 
    model.fit(X_train, y_train) 
    
    # Predictions
    y_predictions = model.predict(X_test)
    np.isnan(y_predictions).sum() 
    
    opened by ajoules 0
  • Feature Request: Converting to Cython

    Feature Request: Converting to Cython

    Since the code is written in pure python, a simple step at the end to improve performance is to compile it all using Cython. It's relatively simple to set up and the speedups are great for looped code, so it might be worth looking into.

    opened by JackKenney 1
[HELP REQUESTED] Generalized Additive Models in Python

pyGAM Generalized Additive Models in Python. Documentation Official pyGAM Documentation: Read the Docs Building interpretable models with Generalized

daniel servén 747 Jan 5, 2023
It is a forest of random projection trees

rpforest rpforest is a Python library for approximate nearest neighbours search: finding points in a high-dimensional space that are close to a given

Lyst 211 Dec 29, 2022
MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees.

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees. MooGBT optimizes for multiple objectives by defining constraints on sub-objective(s) along with a primary objective. The constraints are defined as upper bounds on sub-objective loss function. MooGBT uses a Augmented Lagrangian(AL) based constrained optimization framework with Gradient Boosted Trees, to optimize for multiple objectives.

Swiggy 66 Dec 6, 2022
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022
Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

Miles Cranmer 924 Jan 3, 2023
Python module for performing linear regression for data with measurement errors and intrinsic scatter

Linear regression for data with measurement errors and intrinsic scatter (BCES) Python module for performing robust linear regression on (X,Y) data po

Rodrigo Nemmen 56 Sep 27, 2022
Decision Tree Regression algorithm implemented on Python from scratch.

Decision_Tree_Regression I implemented the decision tree regression algorithm on Python. Unlike regular linear regression, this algorithm is used when

null 1 Dec 22, 2021
A python library for Bayesian time series modeling

PyDLM Welcome to pydlm, a flexible time series modeling library for python. This library is based on the Bayesian dynamic linear model (Harrison and W

Sam 438 Dec 17, 2022
ArviZ is a Python package for exploratory analysis of Bayesian models

ArviZ (pronounced "AR-vees") is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, data storage, model checking, comparison and diagnostics

ArviZ 1.3k Jan 5, 2023
Diabetes Prediction with Logistic Regression

Diabetes Prediction with Logistic Regression Exploratory Data Analysis Data Preprocessing Model & Prediction Model Evaluation Model Validation: Holdou

AZİZE SULTAN PALALI 2 Oct 23, 2021
This repository contains the code to predict house price using Linear Regression Method

House-Price-Prediction-Using-Linear-Regression The dataset I used for this personal project is from Kaggle uploaded by aariyan panchal. Link of Datase

null 0 Jan 28, 2022
A linear regression model for house price prediction

Linear_Regression_Model A linear regression model for house price prediction. This code is using these packages, so please make sure your have install

ShawnWang 1 Nov 29, 2021
A logistic regression model for health insurance purchasing prediction

Logistic_Regression_Model A logistic regression model for health insurance purchasing prediction This code is using these packages, so please make sur

ShawnWang 1 Nov 29, 2021
Predicting diabetes over a five year period using logistic regression and the Pima First-Nation dataset

Diabetes This script uses the Pima First Nations dataset to create a model to predict whether or not an individual will develop Diabetes Mellitus Type

null 1 Mar 28, 2022
Ml based project which uses regression technique to predict the price.

Price-Predictor Ml based project which uses regression technique to predict the price. I have used various regression models and finds the model with

Garvit Verma 1 Jul 9, 2022
Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

Brett Vogelsang 2 Jan 18, 2022
Multiple Linear Regression using the LinearRegression class from sklearn.linear_model library

Multiple-Linear-Regression-master - A python program to implement Multiple Linear Regression using the LinearRegression class from sklearn.linear model library

Kushal Shingote 1 Feb 6, 2022
Bayesian optimization in JAX

Bayesian optimization in JAX

Predictive Intelligence Lab 26 May 11, 2022
Combines Bayesian analyses from many datasets.

PosteriorStacker Combines Bayesian analyses from many datasets. Introduction Method Tutorial Output plot and files Introduction Fitting a model to a d

Johannes Buchner 19 Feb 13, 2022