Bayesian Additive Regression Trees For Python

Last update: Dec 16, 2022

Related tags

Machine Learning bartpy

Overview

BartPy

Introduction

BartPy is a pure python implementation of the Bayesian additive regressions trees model of Chipman et al [1].

Reasons to use BART

Much less parameter optimization required that GBT
Provides confidence intervals in addition to point estimates
Extremely flexible through use of priors and embedding in bigger models

Reasons to use the library:

Can be plugged into existing sklearn workflows
Everything is done in pure python, allowing for easy inspection of model runs
Designed to be extremely easy to modify and extend

Trade offs:

Speed - BartPy is significantly slower than other BART libraries
Memory - BartPy uses a lot of caching compared to other approaches
Instability - the library is still under construction

How to use:

There are two main APIs for BaryPy:

High level sklearn API
Low level access for implementing custom conditions

If possible, it is recommended to use the sklearn API until you reach something that can't be implemented that way. The API is easier, shared with other models in the ecosystem, and allows simpler porting to other models.

Sklearn API

The high level API works as you would expect

from bartpy.sklearnmodel import SklearnModel
model = SklearnModel() # Use default parameters
model.fit(X, y) # Fit the model
predictions = model.predict() # Make predictions on the train set
out_of_sample_predictions = model.predict(X_test) # Make predictions on new data

The model object can be used in all of the standard sklearn tools, e.g. cross validation and grid search

from bartpy.sklearnmodel import SklearnModel
model = SklearnModel() # Use default parameters
cross_validate(model)

Extensions

BartPy offers a number of convenience extensions to base BART. The most prominent of these is using BART to predict the residuals of a base model. It is most natural to use a linear model as the base, but any sklearn compatible model can be used

from bartpy.extensions.baseestimator import ResidualBART
model = ResidualBART(base_estimator=LinearModel())
model.fit(X, y)

A nice feature of this is that we can combine the interpretability of a linear model with the power of a trees model

Lower level API

BartPy is designed to expose all of its internals, so that it can be extended and modifier. In particular, using the lower level API it is possible to:

Customize the set of possible tree operations (prune and grow by default)
Control the order of sampling steps within a single Gibbs update
Extend the model to include additional sampling steps

Some care is recommended when working with these type of changes. Through time the process of changing them will become easier, but today they are somewhat complex

If all you want to customize are things like priors and number of trees, it is much easier to use the sklearn API

Alternative libraries

References

[1] https://arxiv.org/abs/0806.3286 [2] http://www.gatsby.ucl.ac.uk/~balaji/pgbart_aistats15.pdf [3] https://arxiv.org/ftp/arxiv/papers/1309/1309.1906.pdf [4] https://cran.r-project.org/web/packages/BART/vignettes/computing.pdf

Comments

Cannot import bartpy.samplers
from bartpy.sklearnmodel import SklearnModel from bartpy.featureselection import SelectNullDistributionThreshold, SelectSplitProportionThreshold from bartpy.diagnostics.features import *

ModuleNotFoundError Traceback (most recent call last) in ----> 1 from bartpy.sklearnmodel import SklearnModel 2 from bartpy.featureselection import SelectNullDistributionThreshold, SelectSplitProportionThreshold 3 from bartpy.diagnostics.features import *

~/miniconda3/envs/viz/lib/python3.7/site-packages/bartpy/sklearnmodel.py in 9 from bartpy.data import Data 10 from bartpy.model import Model ---> 11 from bartpy.samplers.leafnode import LeafNodeSampler 12 from bartpy.samplers.modelsampler import ModelSampler, Chain 13 from bartpy.samplers.schedule import SampleSchedule

ModuleNotFoundError: No module named 'bartpy.samplers'

from bartpy.samplers import *

ModuleNotFoundError Traceback (most recent call last) in ----> 1 from bartpy.samplers import *

ModuleNotFoundError: No module named 'bartpy.samplers'
opened by dsvolk 6

No Module named bartpy.samplers

Open to "import SklearnModel"

import bartpy
from bartpy.sklearnmodel import SklearnModel

ModuleNotFoundError                       Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_1232/2717476998.py in <module>
----> 1 from bartpy.sklearnmodel import SklearnModel

~\Anaconda3\envs\env\lib\site-packages\bartpy\sklearnmodel.py in <module>
      9 from bartpy.data import Data
     10 from bartpy.model import Model
---> 11 from bartpy.samplers.leafnode import LeafNodeSampler
     12 from bartpy.samplers.modelsampler import ModelSampler, Chain
     13 from bartpy.samplers.schedule import SampleSchedule

ModuleNotFoundError: No module named 'bartpy.samplers'

opened by jckkvs 1

UnboundLocalError: local variable 'mutation' referenced before assignment

Hi,

I just installed the bartpy locally in my machine and I am trying to run the example code ols.py. I am getting the following error, please advise.

0%| | 0/50 [00:00<?, ?it/s]2020-01-27 13:38:26.026932 Starting burn

Traceback (most recent call last):

File "", line 1, in runfile('/Users/anita/Bayesian/bartpy-master/examples/ols.py', wdir='/Users/anita/Bayesian/bartpy-master/examples')

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 827, in runfile execfile(filename, namespace)

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 110, in execfile exec(compile(f.read(), filename, 'exec'), namespace)

File "/Users/anita/Bayesian/bartpy-master/examples/ols.py", line 26, in model, x, y = run(0.95, 2., 20, 5)

File "/Users/anita/Bayesian/bartpy-master/examples/ols.py", line 15, in run model.fit(X, y)

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/sklearnmodel.py", line 134, in fit self.extract = Parallel(n_jobs=self.n_jobs)(self.f_delayed_chains(X, y))

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 921, in call if self.dispatch_one_batch(iterator):

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 759, in dispatch_one_batch self._dispatch(tasks)

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 716, in _dispatch job = self._backend.apply_async(batch, callback=cb)

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 182, in apply_async result = ImmediateResult(func)

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 549, in init self.results = batch()

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 225, in call for func, args, kwargs in self.items]

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 225, in for func, args, kwargs in self.items]

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/sklearnmodel.py", line 31, in run_chain model.store_acceptance_trace)

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/samplers/modelsampler.py", line 43, in samples self.step(model, trace_logger)

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/samplers/modelsampler.py", line 26, in step result = step()

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/samplers/schedule.py", line 48, in yield "Tree", lambda: self.tree_sampler.step(model, tree)

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/samplers/unconstrainedtree/treemutation.py", line 47, in step mutation = self.sample(model, tree)

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/samplers/unconstrainedtree/treemutation.py", line 40, in sample ratio = self.likihood_ratio.log_probability_ratio(model, tree, proposal)

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/samplers/treemutation.py", line 80, in log_probability_ratio return self.log_transition_ratio(tree, mutation) + self.log_likihood_ratio(model, tree, mutation) + self.log_tree_ratio(model, tree, mutation)

File "/Users/anita/opt/anaconda3/lib/python3.7/site-packages/bartpy-0.0.2-py3.7.egg/bartpy/samplers/unconstrainedtree/likihoodratio.py", line 69, in log_likihood_ratio mutation: GrowMutation = mutation

UnboundLocalError: local variable 'mutation' referenced before assignment

opened by anipin123 1
Support multiple chains

This seems like a pretty easy win given how parallelizable the different chains are.

The current solution is to follow sklearn and pymc in using joblib to handle the parallelization. A nice feature of this is that it looks relatively easy to make this solution scale into a full dask cluster approach.

opened by JakeColtman 1
Multiple Parallel chains
To assess convergence to the true distribution, it is useful to run several chains from different start points. At the moment, this requires running the model multiple times in parallel.

There are two related parts of this issue:

conceptually support multiple chains in the API

use multiprocessing of the like to parallelize
opened by JakeColtman 1
simple debug; catboost added to requirements.txt
The example "Comparison of model with sine waves" gives an error because of the log_likihood_ratio().

catboos' may be not appropreate for requirements.txt. It is upto you.
opened by yongsubaek 0
Significant Performance Improvements
A number of architectural changes to make things run much faster. Primary changes:

much more aggressive caching

switching from masked arrays to more direct numpy logic
opened by JakeColtman 0
Switch to using numpy masks in nodes rather than full copies of the covariate matrix

The main change in this PR is to reshuffle how Data works. Rather than every split causing a deepcopy of the X and y matrices, it now only creates new masks onto a single instance of X and y. This makes the process much faster and lighter on memory

There's also an embarrassing splodge of other changes at the same time :(

opened by JakeColtman 0
Make the samplers pure

Rather than holding state in the samplers, it's nicer to have them be stateless and pure. This also allows much easier injection of different sampling methods

opened by JakeColtman 0
Difference with the alternative bartMachine

Hello, It is quite interesting to have an implementation of BART in python. However, when I tried this implementation with its alternative in R "bartMachine", the alternative was giving more promising results. Can you tell, if you had the time to explore it of course, the difference between your implementation and the one of bartMachine?

Thank you in advance :)

opened by achamma723 0
Binary Outcomes and Random Intercepts

Are there plans to add dichotomous outcomes and random intercepts to this model? Another question is how one would incorporate the random intercepts to conditional means for prediction of the dichotomous outcome? Thanks!

https://github.com/vdorie/dbarts/blob/4ed4eafc772e95d788d5a135d9f4e4728b9516ec/R/rbart.R#L5

opened by jlevy44 0

Model predicting NaN

Hi,

Thank you for bart-py!

My BART model is predicting NaN for some cases. Does anyone know why this happens? or how I can prevent this?

My data has missing data but to my knowledge, BART can handle this. My data are finite.

Thank you!

Code: (Sorry for the lengthy data generation)

import pandas as pd
import numpy as np
import random
import bartpy
from bartpy.sklearnmodel import SklearnModel
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

# simulate df with 46 features and 9000 rows
# create binary vars and make a df
label=np.random.randint(2, size=9000)
df = pd.DataFrame({'label':label})
df['a']=np.random.randint(2, size=9000)

# create integers
df['b'] = np.random.randint(low=50, high=96, size=9000)
df['b'] = np.random.randint(low=4, high=97, size=9000)
df['c'] = np.random.randint(low=0, high=1759.22, size=9000)
df['d'] = np.random.randint(low=0, high=5702.2, size=9000)
df['e'] = np.random.randint(low=0, high=7172.31, size=9000)

# create numerics
df['f'] = np.random.uniform(0, 908.56, 9000)
df['f'] = np.random.uniform(0,908.56, 9000)
df['g'] = np.random.uniform(0,2508.78, 9000)
df['h'] = np.random.uniform(0,3757.56, 9000)
df['i'] = np.random.uniform(0,560.18, 9000)
df['j'] = np.random.uniform(0,1362.71, 9000)
df['k'] = np.random.uniform(0,2578.26, 9000)
df['l'] = np.random.uniform(175.07,997, 9000)
df['m'] = np.random.uniform(992.39,3972.81, 9000)
df['n'] = np.random.uniform(1787.24,5823.21, 9000)
df['o'] = np.random.uniform(-56,53, 9000)
df['p'] = np.random.uniform(-47,46, 9000)
df['q'] = np.random.uniform(-1089.03,1546.87, 9000)
df['r'] = np.random.uniform(-1599.14,898.79, 9000)
df['s'] = np.random.uniform(-2871.02,5329, 9000)
df['t'] = np.random.uniform(-4231.44,2481.55, 9000)
df['u'] = np.random.uniform(-3435.9,5824.22, 9000)
df['v'] = np.random.uniform(-5086.6,4548.43, 9000)
df['w'] = np.random.uniform(-406.57,907.91, 9000)
df['x'] = np.random.uniform(-834.82,840.27, 9000)
df['y'] = np.random.uniform(-549.2,2506.29, 9000)
df['z'] = np.random.uniform(-1547.2,2434.18, 9000)
df['aa'] = np.random.uniform(-426.6,3636.17, 9000)
df['bb'] = np.random.uniform(-2819.8,3390, 9000)
df['cc'] = np.random.uniform(-266.75,527.81, 9000)
df['dd'] = np.random.uniform(-778.64,527.81, 9000)
df['ee'] = np.random.uniform(-476.09,1358.32, 9000)
df['ff'] = np.random.uniform(-1890.91,919.3, 9000)
df['gg'] = np.random.uniform(-1633.23,2577.01, 9000)
df['hh'] = np.random.uniform(-2427.93,2078.78, 9000)
df['ii'] = np.random.uniform(-339.67,518.32, 9000)
df['jj'] = np.random.uniform(-528.07,412, 9000)
df['kk'] = np.random.uniform(-1460.23,1610.58, 9000)
df['ll'] = np.random.uniform(-1984.08,1127.82, 9000)
df['mm'] = np.random.uniform(-2153.38,2402.24, 9000)
df['nn'] = np.random.uniform(-2311.27,1809.37, 9000)
df['oo'] = np.random.uniform(16,92, 9000)
df['pp'] = np.random.uniform(4,24, 9000)
df['qq'] = np.random.uniform(4,80, 9000)
df['rr'] = np.random.uniform(0,1, 9000)

# add missings to floats
# select only numeric columns to apply the missingness to
cols_list = df.select_dtypes('float64').columns.tolist()
        
# randomly remove cases from the dataframe
for col in df[cols_list]:
    df.loc[df.sample(frac=0.02).index, col] = np.nan

# # 80/20 train test split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['label'],axis=1), df['label'], train_size=0.7, random_state = 99)

# Modelling
model = SklearnModel(n_jobs = 30) 
model.fit(X_train, y_train) 

# Predictions
y_predictions = model.predict(X_test)
np.isnan(y_predictions).sum()

opened by ajoules 0

Feature Request: Converting to Cython

Since the code is written in pure python, a simple step at the end to improve performance is to compile it all using Cython. It's relatively simple to set up and the speedups are great for looped code, so it might be worth looking into.

opened by JackKenney 1

Owner

GitHub https://jakecoltman.github.io/bartpy/

[HELP REQUESTED] Generalized Additive Models in Python

pyGAM Generalized Additive Models in Python. Documentation Official pyGAM Documentation: Read the Docs Building interpretable models with Generalized

747 Jan 5, 2023

It is a forest of random projection trees

rpforest rpforest is a Python library for approximate nearest neighbours search: finding points in a high-dimensional space that are close to a given

211 Dec 29, 2022

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees.

MooGBT is a library for Multi-objective optimization in Gradient Boosted Trees. MooGBT optimizes for multiple objectives by defining constraints on sub-objective(s) along with a primary objective. The constraints are defined as upper bounds on sub-objective loss function. MooGBT uses a Augmented Lagrangian(AL) based constrained optimization framework with Gradient Boosted Trees, to optimize for multiple objectives.

66 Dec 6, 2022

Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

84 Nov 25, 2022

Simple, fast, and parallelized symbolic regression in Python/Julia via regularized evolution and simulated annealing

Parallelized symbolic regression built on Julia, and interfaced by Python. Uses regularized evolution, simulated annealing, and gradient-free optimization.

924 Jan 3, 2023

Python module for performing linear regression for data with measurement errors and intrinsic scatter

Linear regression for data with measurement errors and intrinsic scatter (BCES) Python module for performing robust linear regression on (X,Y) data po

56 Sep 27, 2022

Decision Tree Regression algorithm implemented on Python from scratch.

Decision_Tree_Regression I implemented the decision tree regression algorithm on Python. Unlike regular linear regression, this algorithm is used when

1 Dec 22, 2021

A python library for Bayesian time series modeling

PyDLM Welcome to pydlm, a flexible time series modeling library for python. This library is based on the Bayesian dynamic linear model (Harrison and W

438 Dec 17, 2022

ArviZ is a Python package for exploratory analysis of Bayesian models

ArviZ (pronounced "AR-vees") is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, data storage, model checking, comparison and diagnostics

1.3k Jan 5, 2023

Diabetes Prediction with Logistic Regression

Diabetes Prediction with Logistic Regression Exploratory Data Analysis Data Preprocessing Model & Prediction Model Evaluation Model Validation: Holdou

2 Oct 23, 2021

This repository contains the code to predict house price using Linear Regression Method

House-Price-Prediction-Using-Linear-Regression The dataset I used for this personal project is from Kaggle uploaded by aariyan panchal. Link of Datase

0 Jan 28, 2022

A linear regression model for house price prediction

Linear_Regression_Model A linear regression model for house price prediction. This code is using these packages, so please make sure your have install

1 Nov 29, 2021

A logistic regression model for health insurance purchasing prediction

Logistic_Regression_Model A logistic regression model for health insurance purchasing prediction This code is using these packages, so please make sur