Probabilistic Gradient Boosting Machines

Olivier Sprangers

Last update: Dec 28, 2022

Related tags

Deep Learning pgbm

Overview

PGBM

Probabilistic Gradient Boosting Machines (PGBM) is a probabilistic gradient boosting framework in Python based on PyTorch/Numba, developed by Airlab in Amsterdam. It provides the following advantages over existing frameworks:

Probabilistic regression estimates instead of only point estimates. (example)
Auto-differentiation of custom loss functions. (example, example)
Native (multi-)GPU-acceleration. (example, example)
Ability to optimize probabilistic estimates after training for a set of common distributions, without retraining the model. (example)

It is aimed at users interested in solving large-scale tabular probabilistic regression problems, such as probabilistic time series forecasting. For more details, read our paper or check out the examples.

Installation

Run pip install pgbm from a terminal within a Python (virtual) environment of your choice.

Verification

Download & run an example from the examples folder to verify the installation is correct:
- Run this example to verify ability to train & predict on CPU with Torch backend.
- Run this example to verify ability to train & predict on GPU with Torch backend.
- Run this example to verify ability to train & predict on CPU with Numba backend.
Note that when training on the GPU, the custom CUDA kernel will be JIT-compiled when initializing a model. Hence, the first time you train a model on the GPU it can take a bit longer, as PGBM needs to compile the CUDA kernel.
When using the Numba-backend, several functions need to be JIT-compiled. Hence, the first time you train a model using this backend it can take a bit longer.
To run the examples some additional packages such as scikit-learn or matplotlib are required; these should be installed separately via pip or conda.

Dependencies

The core package has the following dependencies which should be installed separately (installing the core package via pip will not automatically install these dependencies).

Torch backend

CUDA Toolkit matching your PyTorch distribution (https://developer.nvidia.com/cuda-toolkit)
PyTorch >= 1.7.0, with CUDA 11.0 for GPU acceleration (https://pytorch.org/get-started/locally/). Verify that PyTorch can find a cuda device on your machine by checking whether torch.cuda.is_available() returns True after installing PyTorch.
PGBM uses a custom CUDA kernel which needs to be compiled, which may require installing a suitable compiler. Installing PyTorch and the full CUDA Toolkit should be sufficient, but open an issue if you find it still not working even after installing these dependencies.

Numba backend

Numba >= 0.53.1 (https://numba.readthedocs.io/en/stable/user/installing.html).

The Numba backend does not support differentiable loss functions and GPU training is also not supported using this backend.

Support

See the examples folder for examples, an overview of hyperparameters and a function reference. In general, PGBM works similar to existing gradient boosting packages such as LightGBM or xgboost (and it should be possible to more or less use it as a drop-in replacement), except that it is required to explicitly define a loss function and loss metric.

In case further support is required, open an issue.

Reference

Olivier Sprangers, Sebastian Schelter, Maarten de Rijke. Probabilistic Gradient Boosting Machines for Large-Scale Probabilistic Regression. Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 21), August 14–18, 2021, Virtual Event, Singapore.

The experiments from our paper can be replicated by running the scripts in the experiments folder. Datasets are downloaded when needed in the experiments except for higgs and m5, which should be pre-downloaded and saved to the datasets folder (Higgs) and to datasets/m5 (m5).

License

This project is licensed under the terms of the Apache 2.0 license.

Acknowledgements

This project was developed by Airlab Amsterdam.

Comments

Error messages when importing PGBM
Describe the bug I have Python 3.8.10 on windows 10 machine. I installed Cuda 11.0. To install pytorch, I used this command: pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu113

Note that the pytorch installation seems to be fine since the command "torch.cuda.is_available()" returns True.

Then, I proceed with the installation of PGBM using pip.

When, I run this command "from pgbm import PGBM". I get the following error messages: Detected CUDA files, patching ldflags Emitting ninja build file C:\Users\imarroquin\AppData\Local\torch_extensions\torch_extensions\Cache\py38_cu113\split_decision\build.ninja... Building extension module split_decision... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) [1/3] C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\bin\nvcc --generate-dependencies-with-compile --dependency-output splitgain_kernel.cuda.o.d -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=split_decision -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Temp\Python_3.8.10\lib\site-packages\torch\include -IC:\Temp\Python_3.8.10\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Temp\Python_3.8.10\lib\site-packages\torch\include\TH -IC:\Temp\Python_3.8.10\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\include" -IC:\Temp\Python_3.8.10\Include -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_52,code=compute_52 -gencode=arch=compute_52,code=sm_52 -c C:\Temp\Python_3.8.10\lib\site-packages\pgbm\splitgain_kernel.cu -o splitgain_kernel.cuda.o FAILED: splitgain_kernel.cuda.o C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\bin\nvcc --generate-dependencies-with-compile --dependency-output splitgain_kernel.cuda.o.d -Xcudafe --diag_suppress=dll_interface_conflict_dllexport_assumed -Xcudafe --diag_suppress=dll_interface_conflict_none_assumed -Xcudafe --diag_suppress=field_without_dll_interface -Xcudafe --diag_suppress=base_class_has_different_dll_interface -Xcompiler /EHsc -Xcompiler /wd4190 -Xcompiler /wd4018 -Xcompiler /wd4275 -Xcompiler /wd4267 -Xcompiler /wd4244 -Xcompiler /wd4251 -Xcompiler /wd4819 -Xcompiler /MD -DTORCH_EXTENSION_NAME=split_decision -DTORCH_API_INCLUDE_EXTENSION_H -IC:\Temp\Python_3.8.10\lib\site-packages\torch\include -IC:\Temp\Python_3.8.10\lib\site-packages\torch\include\torch\csrc\api\include -IC:\Temp\Python_3.8.10\lib\site-packages\torch\include\TH -IC:\Temp\Python_3.8.10\lib\site-packages\torch\include\THC "-IC:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.0\bin\include" -IC:\Temp\Python_3.8.10\Include -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_52,code=compute_52 -gencode=arch=compute_52,code=sm_52 -c C:\Temp\Python_3.8.10\lib\site-packages\pgbm\splitgain_kernel.cu -o splitgain_kernel.cuda.o CreateProcess failed: The system cannot find the file specified. ninja: fatal: ReadFile: The handle is invalid.

Traceback (most recent call last): File "C:\Temp\Python_3.8.10\lib\site-packages\torch\utils\cpp_extension.py", line 1808, in _run_ninja_build subprocess.run( File "C:\Temp\Python_3.8.10\lib\subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "", line 1, in File "C:\Temp\Python_3.8.10\lib\site-packages\pgbm_init_.py", line 1, in from .pgbm import PGBM, PGBMRegressor File "C:\Temp\Python_3.8.10\lib\site-packages\pgbm\pgbm.py", line 41, in load(name="split_decision", File "C:\Temp\Python_3.8.10\lib\site-packages\torch\utils\cpp_extension.py", line 1202, in load return _jit_compile( File "C:\Temp\Python_3.8.10\lib\site-packages\torch\utils\cpp_extension.py", line 1425, in _jit_compile _write_ninja_file_and_build_library( File "C:\Temp\Python_3.8.10\lib\site-packages\torch\utils\cpp_extension.py", line 1537, in _write_ninja_file_and_build_library _run_ninja_build( File "C:\Temp\Python_3.8.10\lib\site-packages\torch\utils\cpp_extension.py", line 1824, in _run_ninja_build raise RuntimeError(message) from e RuntimeError: Error building extension 'split_decision'

To Reproduce Steps to reproduce the behavior:

install pytorch as mentioned above

install PGBM using pip command

Open a DOS terminal, run Python followed by command "from pgbm import PGBM"

Expected behavior No error message(s) when import PGBM package

Screenshots If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

OS: windows 10

Browser [chrome]

Version [106]

Smartphone (please complete the following information):

Device: [e.g. iPhone6]

OS: [e.g. iOS8.1]

Browser [e.g. stock browser, safari]

Version [e.g. 22]

Additional context Add any other context about the problem here.
opened by ivan-marroquin 12
How to pull the parameters (mean and standard deviation) of the distribution fitted?
Hi:

Thank you for the awesome library! I did some tests with it and have a few questions:

How to pull the parameters, such as mean and standard deviation, of the final fitted distribution for each leaf? Such information is extremely helpful when the result is presented and explained to stakeholders. Currently the model just returns some numbers sampled from the distribution but business users are likely to focus on the distribution itself.

Is there anyway to spit out the model's tree structure to a data frame like what get_dump() does for xgboost?

Thank you!
opened by flippercy 11
Could sample_weights and monotone_constraints be added to PGBM?

Hi:

Is it possible to add sample_weights and monotone_constraints to the fit function like what lightgbm has? It will enable the algorithm to process weighted datasets and acknowledge domain knowledge.

Thank you.

opened by flippercy 10
Error message when installing full version of PGBM

Describe the bug I have Python 3.6.5 on a windows 10 machine. I don't use anaconda environment; so I install Python packages using pip command.

I would like to install a full version of PGBM, and following the documentation I used this command:

pip install pgbm[all] --find-links https://download.pytorch.org/whl/cu102/torch_stable.html

and I get this error message: Collecting pgbm[all] Could not find a version that satisfies the requirement pgbm[all] (from versions: ) No matching distribution found for pgbm[all] You are using pip version 9.0.3, however version 21.3.1 is available. You should consider upgrading via the 'python -m pip install --upgrade pip' command.

To Reproduce Run this pip installation command:

pip install pgbm[all] --find-links https://download.pytorch.org/whl/cu102/torch_stable.html

Expected behavior I was expecting to have a normal installation process, in which PGBM and all its dependencies are installed

Many thanks for your help,

Ivan

opened by ivan-marroquin 8
An error with PGBM

Hi @elephaint:

I got the following error when using the sklearn wrapper, PGBMRegressor:

~/.local/lib/python3.7/site-packages/pgbm/pgbm.py in _predict_tree(self, X, mu, variance, estimator) 401 # Choose next node (based on breadth-first) 402 condition = (nodes_predict >= node) * (predictions == 0) --> 403 node = nodes_predict[condition].min() 404 # Select current node information 405 split_node = nodes_predict == node

RuntimeError: min(): Expected reduction dim to be specified for input.numel() == 0. Specify the reduction dim with the 'dim' argument.

Any insight? It is not due to the data because I used the same data to build a PGBM model successfully before. Is it due to some hyperparameters? I got this error when trying to do HPO for PGBM using FLAML (https://github.com/microsoft/FLAML) and the search space I used is:

'max_bin': {'domain': tune.loguniform(lower=32, upper=32767), 'init_value': 256, 'low_cost_init_value': 256}, 'max_leaves': {'domain': tune.uniform(lower=16, upper=128), 'init_value': 64}, 'n_estimators': {'domain': tune.uniform(lower = 50, upper = 500), 'init_value': 200, 'low_cost_init_value': 200}, 'min_data_in_leaf': {'domain': tune.uniform(lower = 1, upper = 1000), 'init_value': 100, 'low_cost_init_value': 100}, 'bagging_fraction': {'domain': tune.uniform(lower = 0.6, upper = 1), 'init_value': 0.7, 'low_cost_init_value': 0.7}, 'feature_fraction': {'domain': tune.uniform(lower = 0.5, upper = 1), 'init_value': 0.9, 'low_cost_init_value': 0.9}, 'learning_rate': {'domain': tune.loguniform(lower = 0.001, upper = 1), 'init_value': 0.1, 'low_cost_init_value': 0.1}, 'min_split_gain': {'domain': tune.loguniform(lower = 0.000000000001, upper = 0.001), 'init_value': 0.00001, 'low_cost_init_value': 0.00001},

Thank you.

opened by flippercy 8
Get error metrics for each trained tree

Is your feature request related to a problem? Please describe. I think it will be beneficial to PGBM package to get the measure error metric on each trained tree. So then, one can generate plots to assess the performance of the ensemble of trees and diagnostic when this ensemble is overfitting, underfitting or doing well on train/validation data sets.

Describe the solution you'd like To provide an example, I use xgboost regressor model (https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn). It offers the possibility to follow the performance of trained trees with a following statements:

evaluation_set= [(gral_train_inputs, gral_train_targets), (test_inputs, test_targets)]

best_trained_model.fit(X= gral_train_inputs, y= gral_train_targets, eval_metric= ['rmse', 'mae'], eval_set= evaluation_set, verbose= False)

performance= best_trained_model.evals_result()

Then, "performance" is a dictionary that contains the measured error metrics on both datasets for all trained trees.

Kind regards,

Ivan

opened by ivan-marroquin 6
Does PGBM interface easily with Feature Importance methods?

Hello!

Thank you for this wonderful model!

I was curious though, just how well does PGBM work with things like sklearns partial dependency plots (https://scikit-learn.org/stable/modules/partial_dependence.html), Feature Importance graphs etc?

opened by wrkhard 6

Prediction Intervals for Scaled Lognormal Distribution

Describe the bug I am returning predictions intervals that do not contain the point predictions. The data I am working with is lognormally distributed and scaled-down, as shown below, by a factor of 10. Am I misunderstanding something fundamental about this type of data distribution? or should the prediction intervals contain the point predictions?

To Reproduce

import torch
import pgbm
import numpy as np
from scipy.stats import halfnorm
import matplotlib.pyplot as plt

scale = True

def objective(yhat, y, sample_weight=None):
    gradient = (yhat - y)
    hessian = torch.ones_like(yhat)
    return gradient, hessian

def rmseloss_metric(yhat, y, sample_weight=None):
    loss = (yhat - y).pow(2).mean().sqrt()
    return loss


params = {
      'min_split_gain':0,
      'min_data_in_leaf':2,
      'max_leaves':6,
      'max_bin':64,
      'learning_rate':0.1,
      'n_estimators':60,
      'verbose':2,
      'feature_fraction':1,
      'bagging_fraction':1,
      'seed':2008,
      'reg_lambda':1,
      'derivatives':'exact', 
      'distribution':'lognormal' 
}

if scale:
    s = np.random.lognormal(0.011, 0.6, 1000) / 10
else: 
    s = np.random.lognormal(0.011, 0.6, 1000)
    
eps = halfnorm.rvs(loc=0.0013, scale=0.001, size=1000, random_state=2008)
x_tmp = ((s*2 + 100)/ 2) + eps
eps = halfnorm.rvs(loc=0.0013, scale=0.001, size=1000, random_state=2007)
x_tmp_2 = (s + 5) + eps

train_data = (
    np.vstack([x_tmp, x_tmp_2]).T, 
    s 
)

model = PGBM()
model.train(
    train_data, 
    objective = objective, 
    metric = rmseloss_metric, 
    params = params
)

# test data
np.random.seed(20)
if scale:
    s_test = np.random.lognormal(0.011, 0.6, 1000) / 10
else: 
    s_test = np.random.lognormal(0.011, 0.6, 1000)
    
eps = halfnorm.rvs(loc=0.0013, scale=0.001, size=1000, random_state=2009)
x_tmp_test = ((s*2 + 100)/ 2) + eps
eps = halfnorm.rvs(loc=0.0013, scale=0.001, size=1000, random_state=2006)
x_tmp_test_2 = (s + 5.6) + eps

x_tmp_t = np.vstack([x_tmp_test, x_tmp_test_2]).T

yhat_dist_pgbm = model.predict_dist(x_tmp_t)
yhat_point_pgbm = model.predict(x_tmp_t)

plt.figure(figsize=(20,10))
plt.rcParams.update({'font.size': 10})
plt.plot(s_test, 'o', label='Actual')
plt.plot(yhat_point_pgbm.numpy(), 'ko', label='Point prediction PGBM')
plt.fill_between(np.arange(len(s_test)), 
                 yhat_dist_pgbm.min(0)[0].numpy(), 
                 yhat_dist_pgbm.max(0)[0].numpy(), 
                 color="#b9cfe7", label='Uncertainty')

plt.title('Probabilistic Predictions - PGBM')
plt.xlabel('Sample')
plt.ylabel('Prediction')
plt.legend(ncol=3);

with scale = True

with scale = False

Screen Shot 2022-04-13 at 9 41 42 AM

opened by Vinnie-Palazeti 4

Is PGBM compatible with SHAP?

Is your feature request related to a problem? Please describe. Thanks for such great package!

I noticed in the examples that there is a way to do feature importance analysis based on split gain. I was wondering if you are planning to include examples of feature importance using SHAP (https://github.com/slundberg/shap)

Describe the solution you'd like I believe that SHAP offers an interesting way on how to evaluate feature importance independent of split gain. This technique is based on game theory which tries to find a fair participation of input attributes to asses their importance.

Kind regards,

Ivan

opened by ivan-marroquin 4
Question: on the computation of mean for point/probabilistic estimates

Hi,

Thanks for making available such great package!

This is not a bug but rather a simple question, I noticed that an "initial estimate" is added to the mean. Could you explain what is the reason? And also, what is the advantage/difference with respect to substract or not to include the "initial estimate"?

Best regards,

Ivan

opened by ivan-marroquin 3
Why monotone_constraints is set as a parameter of fit() rather than PGBMRegressor itself?

Hi @elephaint:

I just realized that for PGBM, monotone_constraints was set as a parameter of fit() while monotone_iterations a parameter of PGBMRegressor.

Any reason to separate them instead of also include monotone_constraints in PGBMRegressor as the scikit-learner wrapper of xgboost/lightgbm does? The current approach makes it difficult to be put on platforms for HPO/automl such as FLAML.

Best,

Yu Cao

opened by flippercy 3

Probabilistic Gradient Boosting Machines

Related tags

Overview

PGBM

Installation

Verification

Dependencies

Torch backend

Numba backend

Support

Reference

License

Acknowledgements

Comments

with scale = True

with scale = False

Owner

Olivier Sprangers

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

House_prices_kaggle - Predict sales prices and practice feature engineering, RFs, and gradient boosting

A PyTorch implementation of Learning to learn by gradient descent by gradient descent

Restricted Boltzmann Machines in Python.

Neural Turing Machines (NTM) - PyTorch Implementation

PyTorch implementation for the Neuro-Symbolic Sudoku Solver leveraging the power of Neural Logic Machines (NLM)

Relaxed-machines - explorations in neuro-symbolic differentiable interpreters

Boosting Monocular Depth Estimation Models to High-Resolution via Content-Adaptive Multi-Resolution Merging

Boosting Adversarial Attacks with Enhanced Momentum (BMVC 2021)

Apply our monocular depth boosting to your own network!

Accommodating supervised learning algorithms for the historical prices of the world's favorite cryptocurrency and boosting it through LightGBM.

Code of paper Interact, Embed, and EnlargE (IEEE): Boosting Modality-specific Representations for Multi-Modal Person Re-identification.

Code for You Only Cut Once: Boosting Data Augmentation with a Single Cut