UpliftML: A Python Package for Scalable Uplift Modeling

Overview

UpliftML: A Python Package for Scalable Uplift Modeling

upliftml

UpliftML is a Python package for scalable unconstrained and constrained uplift modeling from experimental data. To accommodate working with big data, the package uses PySpark and H2O models as base learners for the uplift models. Evaluation functions expect a PySpark dataframe as input.

Uplift modeling is a family of techniques for estimating the Conditional Average Treatment Effect (CATE) from experimental or observational data using machine learning. In particular, we are interested in estimating the causal effect of a treatment T on the outcome Y of an individual characterized by features X. In experimental data with binary treatments and binary outcomes, this is equivalent to estimating Pr(Y=1 | T=1, X=x) - Pr(Y=1 | T=0, X=x).

In many practical use cases the goal is to select which users to target in order to maximize the overall uplift without exceeding a specified budget or ROI constraint. In those cases, estimating uplift alone is not sufficient to make optimal decisions and we need to take into account the costs and monetary benefit incurred by the treatment.

Uplift modeling is an emerging tool for various personalization applications. Example use cases include marketing campaigns personalization and optimization, personalized pricing in e-commerce, and clinical treatment personalization.

The UpliftML library includes PySpark/H2O implementations for the following:

  • 6 metalearner approaches for uplift modeling: T-learner[1], S-learner[1], X-learner[1], R-learner[2], class variable transformation[3], transformed outcome approach[4].
  • The Retrospective Estimation[5] technique for uplift modeling under ROI constraints.
  • Uplift and iROI-based evaluation and plotting functions with bootstrapped confidence intervals. Currently implemented: ATE, ROI, iROI, CATE per category/quantile, CATE lift, Qini/AUUC curves[6], Qini/AUUC score[6], cumulative iROI curves.

For detailed information about the package, read the UpliftML documentation.

Installation

Install the latest release from PyPI:

$ pip install upliftml

Quick Start

from upliftml.models.pyspark import TLearnerEstimator
from upliftml.evaluation import estimate_and_plot_qini
from upliftml.datasets import simulate_randomized_trial
from pyspark.ml.classification import LogisticRegression


# Read/generate the dataset and convert it to Spark if needed
df_pd = simulate_randomized_trial(n=2000, p=6, sigma=1.0, binary_outcome=True)
df_spark = spark.createDataFrame(df_pd)

# Split the data into train, validation, and test sets
df_train, df_val, df_test = df_spark.randomSplit([0.5, 0.25, 0.25])

# Preprocess the datasets (for implementation of get_features_vector, see the full example notebook)
num_features = [col for col in df_spark.columns if col.startswith('feature')]
cat_features = []
df_train_assembled = get_features_vector(df_train, num_features, cat_features)
df_val_assembled = get_features_vector(df_val, num_features, cat_features)
df_test_assembled = get_features_vector(df_test, num_features, cat_features)

# Build a two-model estimator
model = TLearnerEstimator(base_model_class=LogisticRegression,
                          base_model_params={'maxIter': 15},
                          predictors_colname='features',
                          target_colname='outcome',
                          treatment_colname='treatment',
                          treatment_value=1,
                          control_value=0)
model.fit(df_train_assembled, df_val_assembled)

# Apply the model to test data
df_test_eval = model.predict(df_test_assembled)

# Evaluate performance on the test set
qini_values, ax = estimate_and_plot_qini(df_test_eval)

For complete examples with more estimators and evaluation functions, see the demo notebooks in the examples folder.

Contributing

If interested in contributing to the package, get started by reading our contributor guidelines.

License

The project is licensed under Apache 2.0 License

Citation

If you use UpliftML, please cite it as follows:

Irene Teinemaa, Javier Albert, Nam Pham. UpliftML: A Python Package for Scalable Uplift Modeling. https://github.com/bookingcom/upliftml, 2021. Version 0.0.1.

@misc{upliftml,
  author={Irene Teinemaa, Javier Albert, Nam Pham},
  title={{UpliftML}: {A Python Package for Scalable Uplift Modeling}},
  howpublished={https://github.com/bookingcom/upliftml},
  note={Version 0.0.1},
  year={2021}
}

Resources

Documentation:

Tutorials and blog posts:

Related packages:

  • CausalML: a Python package for uplift modeling and causal inference with machine learning
  • EconML: a Python package for estimating heterogeneous treatment effects from observational data via machine learning

References

  1. Sören R. Künzel, Jasjeet S. Sekhon, Peter J. Bickel, and Bin Yu. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the National Academy of Sciences, 2019.
  2. Xinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv preprint arXiv:1712.04912, 2017.
  3. Maciej Jaskowski and Szymon Jaroszewicz. Uplift modeling for clinical trial data. ICML Workshop on Clinical Data Analysis, 2012.
  4. Susan Athey and Guido W. Imbens. Machine learning methods for estimating heterogeneous causal effects. stat, 1050(5), 2015.
  5. Dmitri Goldenberg, Javier Albert, Lucas Bernardi, Pablo Estevez Castillo. Free Lunch! Retrospective Uplift Modeling for Dynamic Promotions Recommendation within ROI Constraints. In Fourteenth ACM Conference on Recommender Systems (pp. 486-491), 2020.
  6. Nicholas J Radcliffe and Patrick D Surry. Real-world uplift modelling with significance based uplift trees. White Paper tr-2011-1, Stochastic Solutions, 2011.
Comments
  • Added h2o uplift random forest estimator

    Added h2o uplift random forest estimator

    I added H2O's Uplift Random Forest estimator as a new method. This tree-based algorithm was proposed by Rzepakowski & Jaroszewicz (2012) [1] and comes with three different uplift-specific splitting criteria: Kullback-Leibler, Euclidean Distance, and Chi-squared divergence.

    I used H2O's H2OUpliftRandomForestEstimator and

    1. Implemented an UpliftRandomForestEstimator class
    2. Added tests for the UpliftRandomForestEstimator
    3. Added documentation for the UpliftRandomForestEstimator

    [1] Rzepakowski, P., & Jaroszewicz, S. (2012). Decision trees for uplift modeling with single and multiple treatments. Knowledge and Information Systems, 32(2), 303-327.

    opened by jroessler 1
  • Added stationary feature selection methods including tests

    Added stationary feature selection methods including tests

    I added stationary feature selection methods, that is:

    • Divergence Filter (Filter method) [1]
    • Net Information Value (Filter method) [2]
    • Uplift Curve (Filter method) [3]
    • Permutation with Uplift Random Forest (Wrapper method)

    [1] Zhao, Z., Zhang, Y., Harinen, T., & Yung, M. (2022). Feature Selection Methods for Uplift Modeling and Heterogeneous Treatment Effect. In IFIP International Conference on Artificial Intelligence Applications and Innovations (pp. 217-230). Springer, Cham. [2] Larsen, K. (2015). Data Exploration with Weight of Evidence and Information Value in R. Retrieved Date from https://multithreaded. stitchfix. com/blog/2015/08/13/weight-of-evidence. [3] Hu, J. (2022). Customer feature selection from high-dimensional bank direct marketing data for uplift modeling. Journal of Marketing Analytics, 1-12.

    opened by jroessler 0
  • Add `.gitignore` entries for common editors

    Add `.gitignore` entries for common editors

    I noticed Vim temporary .swp files are being tracked by git. I would add:

    • .swp (Vim working buffer)
    • *~ (common backup for Linux based editor)
    • some other common editor artifact I'm missing
    opened by ilirmaci 0
  • Omit the creation of

    Omit the creation of "cvt_label" column from "predict" method of the CVT Model

    Removing the line that creates the "cvt_label" column in the "predict" method of the CVT Model as it is not used in this method (I guess it was copied by mistake from the "fit" method) and will eventually prevent the method from being used for the inference stage of an actual machine learning model.

    opened by shakednave-vi 0
  • Development plan for multi-treatment

    Development plan for multi-treatment

    First of all, thank you for developing this awesome package! PySpark/H2O-based algorithms are like gospels for real world large marketing datasets! Liked Uber's CausalML few years ago, but their implementation is in Pandas/sklearn so we were not able to use it on massive dataset. This package solved that issue perfectly!

    Recently I was reading the tutorial on uplift model posted by your team for Web Conference 2021 (WWW’21)(https://booking.ai/uplift-modeling-f9759e3fb51e). At the end of the deck I read about the recent paper of multiple treatment with cost optimization by Uber(https://arxiv.org/abs/1908.05372), which extend the existing R-/X-learner to a multi-treatment setting, with/without control.

    If I'm not mistaken, the upliftml package currently can only handle one treatment and on control scenerio for meta-learners. Want to know if your team has any development plan to extend meta-learner to multi-treatment with/without control group? Thanks!

    opened by zhiruiwang 0
  • Simplify evaluation function signatures with keyword arguments

    Simplify evaluation function signatures with keyword arguments

    The signatures of our evaluation functions are largely overlapping, in particular when it comes to bootstrapping-related parameters. When changing one of these parameters (e.g. adding another bootstrapping-related param), the current implementation requires one to change the signatures of all of the evaluation functions manually. To reduce this duplication work, we could lump the bootstrapping-related parameters together as bootstrap_kwargs, e.g. eval_function(arg1, arg2, **bootstrap_kwargs).

    refactoring 
    opened by irhete 0
  • Add option to compute conf. interval from std err

    Add option to compute conf. interval from std err

    Add use_std_error argument to all procedures that compute a confidence interval through bootstrapping. Default to True because this method is more efficient and more familiar.

    Add explanation and retouch literature reference in docstring

    Solution to #1

    Missing: tests do not cover both computation methods as they are written now. Writing tests for every metric would take a lot of duplication.

    opened by ilirmaci 0
  • CI calculation based on std errors from bootstraps

    CI calculation based on std errors from bootstraps

    The current implementation estimates the confidence interval directly from bootstraps. Another option would be to use bootstrapped values to get the standard error and then derive the CI from normal (or student's T) critical values. The former has worse statistical properties than the latter (you need more iterations for it to converge), but the latter expects the distribution of the estimator to be symmetric.

    We could implement the std error based solution, set it as default, but keep an option for the user to switch back to the direct CI estimation.

    Thanks @ilirmaci for pointing it out!

    enhancement 
    opened by irhete 0
Owner
Booking.com
Open source projects and forks of projects we use internally (for better upstream collaboration)
Booking.com
An open source framework that provides a simple, universal API for building distributed applications. Ray is packaged with RLlib, a scalable reinforcement learning library, and Tune, a scalable hyperparameter tuning library.

Ray provides a simple, universal API for building distributed applications. Ray is packaged with the following libraries for accelerating machine lear

null 23.3k Dec 31, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 8, 2023
MICOM is a Python package for metabolic modeling of microbial communities

Welcome MICOM is a Python package for metabolic modeling of microbial communities currently developed in the Gibbons Lab at the Institute for Systems

null 57 Dec 21, 2022
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Distributed (Deep) Machine Learning Community 23.6k Jan 3, 2023
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

CatBoost 6.9k Jan 5, 2023
STUMPY is a powerful and scalable Python library for computing a Matrix Profile, which can be used for a variety of time series data mining tasks

STUMPY STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of tim

TD Ameritrade 2.5k Jan 6, 2023
mlpack: a scalable C++ machine learning library --

a fast, flexible machine learning library Home | Documentation | Doxygen | Community | Help | IRC Chat Download: current stable version (3.4.2) mlpack

mlpack 4.2k Jan 1, 2023
LiuAlgoTrader is a scalable, multi-process ML-ready framework for effective algorithmic trading

LiuAlgoTrader is a scalable, multi-process ML-ready framework for effective algorithmic trading. The framework simplify development, testing, deployment, analysis and training algo trading strategies. The framework automatically analyzes trading sessions, and the analysis may be used to train predictive models.

Amichay Oren 458 Dec 24, 2022
Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on Kubernetes simple, portable, and scalable.

SDK: Overview of the Kubeflow pipelines service Kubeflow is a machine learning (ML) toolkit that is dedicated to making deployments of ML workflows on

Kubeflow 3.1k Jan 6, 2023
Probabilistic time series modeling in Python

GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

Amazon Web Services - Labs 3.3k Jan 3, 2023
A python library for Bayesian time series modeling

PyDLM Welcome to pydlm, a flexible time series modeling library for python. This library is based on the Bayesian dynamic linear model (Harrison and W

Sam 438 Dec 17, 2022
Pyomo is an object-oriented algebraic modeling language in Python for structured optimization problems.

Pyomo is a Python-based open-source software package that supports a diverse set of optimization capabilities for formulating and analyzing optimization models. Pyomo can be used to define symbolic problems, create concrete problem instances, and solve these instances with standard solvers.

Pyomo 1.4k Dec 28, 2022
Automated modeling and machine learning framework FEDOT

This repository contains FEDOT - an open-source framework for automated modeling and machine learning (AutoML). It can build custom modeling pipelines for different real-world processes in an automated way using an evolutionary approach. FEDOT supports classification (binary and multiclass), regression, clustering, and time series prediction tasks.

National Center for Cognitive Research of ITMO University 148 Jul 5, 2021
A Pythonic framework for threat modeling

pytm: A Pythonic framework for threat modeling Introduction Traditional threat modeling too often comes late to the party, or sometimes not at all. In

Izar Tarandach 644 Dec 20, 2022
Python package for stacking (machine learning technique)

vecstack Python package for stacking (stacked generalization) featuring lightweight functional API and fully compatible scikit-learn API Convenient wa

Igor Ivanov 671 Dec 25, 2022
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

null 6.2k Jan 1, 2023
A Python package for time series classification

pyts: a Python package for time series classification pyts is a Python package for time series classification. It aims to make time series classificat

Johann Faouzi 1.4k Jan 1, 2023
ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions

A library for debugging/inspecting machine learning classifiers and explaining their predictions

null 154 Dec 17, 2022
ArviZ is a Python package for exploratory analysis of Bayesian models

ArviZ (pronounced "AR-vees") is a Python package for exploratory analysis of Bayesian models. Includes functions for posterior analysis, data storage, model checking, comparison and diagnostics

ArviZ 1.3k Jan 5, 2023