UpliftML: A Python Package for Scalable Uplift Modeling

Booking.com

Last update: Dec 31, 2022

Related tags

Machine Learning upliftml

Overview

UpliftML: A Python Package for Scalable Uplift Modeling

UpliftML is a Python package for scalable unconstrained and constrained uplift modeling from experimental data. To accommodate working with big data, the package uses PySpark and H2O models as base learners for the uplift models. Evaluation functions expect a PySpark dataframe as input.

Uplift modeling is a family of techniques for estimating the Conditional Average Treatment Effect (CATE) from experimental or observational data using machine learning. In particular, we are interested in estimating the causal effect of a treatment T on the outcome Y of an individual characterized by features X. In experimental data with binary treatments and binary outcomes, this is equivalent to estimating Pr(Y=1 | T=1, X=x) - Pr(Y=1 | T=0, X=x).

In many practical use cases the goal is to select which users to target in order to maximize the overall uplift without exceeding a specified budget or ROI constraint. In those cases, estimating uplift alone is not sufficient to make optimal decisions and we need to take into account the costs and monetary benefit incurred by the treatment.

Uplift modeling is an emerging tool for various personalization applications. Example use cases include marketing campaigns personalization and optimization, personalized pricing in e-commerce, and clinical treatment personalization.

The UpliftML library includes PySpark/H2O implementations for the following:

6 metalearner approaches for uplift modeling: T-learner[1], S-learner[1], X-learner[1], R-learner[2], class variable transformation[3], transformed outcome approach[4].
The Retrospective Estimation[5] technique for uplift modeling under ROI constraints.
Uplift and iROI-based evaluation and plotting functions with bootstrapped confidence intervals. Currently implemented: ATE, ROI, iROI, CATE per category/quantile, CATE lift, Qini/AUUC curves[6], Qini/AUUC score[6], cumulative iROI curves.

For detailed information about the package, read the UpliftML documentation.

Installation

Install the latest release from PyPI:

$ pip install upliftml

Quick Start

from upliftml.models.pyspark import TLearnerEstimator
from upliftml.evaluation import estimate_and_plot_qini
from upliftml.datasets import simulate_randomized_trial
from pyspark.ml.classification import LogisticRegression


# Read/generate the dataset and convert it to Spark if needed
df_pd = simulate_randomized_trial(n=2000, p=6, sigma=1.0, binary_outcome=True)
df_spark = spark.createDataFrame(df_pd)

# Split the data into train, validation, and test sets
df_train, df_val, df_test = df_spark.randomSplit([0.5, 0.25, 0.25])

# Preprocess the datasets (for implementation of get_features_vector, see the full example notebook)
num_features = [col for col in df_spark.columns if col.startswith('feature')]
cat_features = []
df_train_assembled = get_features_vector(df_train, num_features, cat_features)
df_val_assembled = get_features_vector(df_val, num_features, cat_features)
df_test_assembled = get_features_vector(df_test, num_features, cat_features)

# Build a two-model estimator
model = TLearnerEstimator(base_model_class=LogisticRegression,
                          base_model_params={'maxIter': 15},
                          predictors_colname='features',
                          target_colname='outcome',
                          treatment_colname='treatment',
                          treatment_value=1,
                          control_value=0)
model.fit(df_train_assembled, df_val_assembled)

# Apply the model to test data
df_test_eval = model.predict(df_test_assembled)

# Evaluate performance on the test set
qini_values, ax = estimate_and_plot_qini(df_test_eval)

For complete examples with more estimators and evaluation functions, see the demo notebooks in the examples folder.

Contributing

If interested in contributing to the package, get started by reading our contributor guidelines.

License

The project is licensed under Apache 2.0 License

Citation

If you use UpliftML, please cite it as follows:

Irene Teinemaa, Javier Albert, Nam Pham. UpliftML: A Python Package for Scalable Uplift Modeling. https://github.com/bookingcom/upliftml, 2021. Version 0.0.1.

@misc{upliftml,
  author={Irene Teinemaa, Javier Albert, Nam Pham},
  title={{UpliftML}: {A Python Package for Scalable Uplift Modeling}},
  howpublished={https://github.com/bookingcom/upliftml},
  note={Version 0.0.1},
  year={2021}
}

Resources

Documentation:

UpliftML documentation

Tutorials and blog posts:

Related packages:

CausalML: a Python package for uplift modeling and causal inference with machine learning
EconML: a Python package for estimating heterogeneous treatment effects from observational data via machine learning

References

Sören R. Künzel, Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 2019.
Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint arXiv:1712.04912, 2017.
Maciej Jaskowski and Szymon Jaroszewicz. Uplift modeling for clinical trial data. ICML Workshop on Clinical Data Analysis, 2012.
Susan Athey and Guido W. Imbens. Machine learning methods for estimating heterogeneous causal effects. stat, 1050(5), 2015.
Dmitri Goldenberg, Javier Albert, Lucas Bernardi, Pablo Estevez Castillo. Free Lunch! Retrospective Uplift Modeling for Dynamic Promotions Recommendation within ROI Constraints. In Fourteenth ACM Conference on Recommender Systems (pp. 486-491), 2020.
Nicholas J Radcliffe and Patrick D Surry. Real-world uplift modelling with significance based uplift trees. White Paper tr-2011-1, Stochastic Solutions, 2011.

Comments

Added h2o uplift random forest estimator
I added H2O's Uplift Random Forest estimator as a new method. This tree-based algorithm was proposed by Rzepakowski & Jaroszewicz (2012) [1] and comes with three different uplift-specific splitting criteria: Kullback-Leibler, Euclidean Distance, and Chi-squared divergence.

I used H2O's H2OUpliftRandomForestEstimator and

Implemented an UpliftRandomForestEstimator class

Added tests for the UpliftRandomForestEstimator

Added documentation for the UpliftRandomForestEstimator

[1] Rzepakowski, P., & Jaroszewicz, S. (2012). Decision trees for uplift modeling with single and multiple treatments. Knowledge and Information Systems, 32(2), 303-327.
opened by jroessler 1
Added stationary feature selection methods including tests
I added stationary feature selection methods, that is:

Divergence Filter (Filter method) [1]

Net Information Value (Filter method) [2]

Uplift Curve (Filter method) [3]

Permutation with Uplift Random Forest (Wrapper method)

[1] Zhao, Z., Zhang, Y., Harinen, T., & Yung, M. (2022). Feature Selection Methods for Uplift Modeling and Heterogeneous Treatment Effect. In IFIP International Conference on Artificial Intelligence Applications and Innovations (pp. 217-230). Springer, Cham. [2] Larsen, K. (2015). Data Exploration with Weight of Evidence and Information Value in R. Retrieved Date from https://multithreaded. stitchfix. com/blog/2015/08/13/weight-of-evidence. [3] Hu, J. (2022). Customer feature selection from high-dimensional bank direct marketing data for uplift modeling. Journal of Marketing Analytics, 1-12.
opened by jroessler 0
Add `.gitignore` entries for common editors
I noticed Vim temporary .swp files are being tracked by git. I would add:

.swp (Vim working buffer)

*~ (common backup for Linux based editor)

some other common editor artifact I'm missing
opened by ilirmaci 0
Omit the creation of "cvt_label" column from "predict" method of the CVT Model

Removing the line that creates the "cvt_label" column in the "predict" method of the CVT Model as it is not used in this method (I guess it was copied by mistake from the "fit" method) and will eventually prevent the method from being used for the inference stage of an actual machine learning model.

opened by shakednave-vi 0
Development plan for multi-treatment

First of all, thank you for developing this awesome package! PySpark/H2O-based algorithms are like gospels for real world large marketing datasets! Liked Uber's CausalML few years ago, but their implementation is in Pandas/sklearn so we were not able to use it on massive dataset. This package solved that issue perfectly!

Recently I was reading the tutorial on uplift model posted by your team for Web Conference 2021 (WWW’21)(https://booking.ai/uplift-modeling-f9759e3fb51e). At the end of the deck I read about the recent paper of multiple treatment with cost optimization by Uber(https://arxiv.org/abs/1908.05372), which extend the existing R-/X-learner to a multi-treatment setting, with/without control.

If I'm not mistaken, the upliftml package currently can only handle one treatment and on control scenerio for meta-learners. Want to know if your team has any development plan to extend meta-learner to multi-treatment with/without control group? Thanks!

opened by zhiruiwang 0
Simplify evaluation function signatures with keyword arguments

The signatures of our evaluation functions are largely overlapping, in particular when it comes to bootstrapping-related parameters. When changing one of these parameters (e.g. adding another bootstrapping-related param), the current implementation requires one to change the signatures of all of the evaluation functions manually. To reduce this duplication work, we could lump the bootstrapping-related parameters together as bootstrap_kwargs, e.g. eval_function(arg1, arg2, **bootstrap_kwargs).
refactoring

opened by irhete 0
Add option to compute conf. interval from std err

Add use_std_error argument to all procedures that compute a confidence interval through bootstrapping. Default to True because this method is more efficient and more familiar.

Add explanation and retouch literature reference in docstring

Solution to #1

Missing: tests do not cover both computation methods as they are written now. Writing tests for every metric would take a lot of duplication.

opened by ilirmaci 0
CI calculation based on std errors from bootstraps

The current implementation estimates the confidence interval directly from bootstraps. Another option would be to use bootstrapped values to get the standard error and then derive the CI from normal (or student's T) critical values. The former has worse statistical properties than the latter (you need more iterations for it to converge), but the latter expects the distribution of the estimator to be symmetric.

We could implement the std error based solution, set it as default, but keep an option for the user to switch back to the direct CI estimation.

Thanks @ilirmaci for pointing it out!
enhancement

opened by irhete 0

Owner

Booking.com

Open source projects and forks of projects we use internally (for better upstream collaboration)

GitHub

An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

23.3k Dec 31, 2022

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

3k Jan 8, 2023

MICOM is a Python package for metabolic modeling of microbial communities

Welcome MICOM is a Python package for metabolic modeling of microbial communities currently developed in the Gibbons Lab at the Institute for Systems

57 Dec 21, 2022

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community

23.6k Jan 3, 2023

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

6.9k Jan 5, 2023

STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

2.5k Jan 6, 2023

mlpack: a scalable C++ machine learning library --

4.2k Jan 1, 2023

LiuAlgoTrader is a scalable, multi-process ML-ready framework for effective algorithmic trading

LiuAlgoTrader is a scalable, multi-process ML-ready framework for effective algorithmic trading. The framework simplify development, testing, deployment, analysis and training algo trading strategies. The framework automatically analyzes trading sessions, and the analysis may be used to train predictive models.

458 Dec 24, 2022

Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

SDK: Overview of the Kubeflow pipelines service Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on

3.1k Jan 6, 2023

Probabilistic time series modeling in Python

GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

3.3k Jan 3, 2023

A python library for Bayesian time series modeling

PyDLM Welcome to pydlm, a flexible time series modeling library for python. This library is based on the Bayesian dynamic linear model (Harrison and W

438 Dec 17, 2022

Pyomo is an object-oriented algebraic modeling language in Python for structured optimization problems.

Pyomo is a Python-based open-source software package that supports a diverse set of optimization capabilities for formulating and analyzing optimization models. Pyomo can be used to define symbolic problems, create concrete problem instances, and solve these instances with standard solvers.

1.4k Dec 28, 2022

Automated modeling and machine learning framework FEDOT

This repository contains FEDOT - an open-source framework for automated modeling and machine learning (AutoML). It can build custom modeling pipelines for different real-world processes in an automated way using an evolutionary approach. FEDOT supports classification (binary and multiclass), regression, clustering, and time series prediction tasks.

National Center for Cognitive Research of ITMO University

148 Jul 5, 2021

UpliftML: A Python Package for Scalable Uplift Modeling

Related tags

Overview

UpliftML: A Python Package for Scalable Uplift Modeling

Installation

Quick Start

Contributing

License

Citation

Resources

References

Comments

Owner

Booking.com

An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

MICOM is a Python package for metabolic modeling of microbial communities

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

mlpack: a scalable C++ machine learning library --

LiuAlgoTrader is a scalable, multi-process ML-ready framework for effective algorithmic trading

Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

Probabilistic time series modeling in Python

A python library for Bayesian time series modeling

Pyomo is an object-oriented algebraic modeling language in Python for structured optimization problems.

Automated modeling and machine learning framework FEDOT

A Pythonic framework for threat modeling

Python package for stacking (machine learning technique)

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

A Python package for time series classification

ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions

ArviZ is a Python package for exploratory analysis of Bayesian models