Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models

Overview

Highly interpretable, sklearn-compatible classifier based on decision rules

This is a scikit-learn compatible wrapper for the Bayesian Rule List classifier developed by Letham et al., 2015 (see Letham's original code), extended by a minimum description length-based discretizer (Fayyad & Irani, 1993) for continuous data, and by an approach to subsample large datasets for better performance.

It produces rule lists, which makes trained classifiers easily interpretable to human experts, and is competitive with state of the art classifiers such as random forests or SVMs.

For example, an easily understood Rule List model of the well-known Titanic dataset:

IF male AND adult THEN survival probability: 21% (19% - 23%)
ELSE IF 3rd class THEN survival probability: 44% (38% - 51%)
ELSE IF 1st class THEN survival probability: 96% (92% - 99%)
ELSE survival probability: 88% (82% - 94%)

Letham et al.'s approach only works on discrete data. However, this approach can still be used on continuous data after discretization. The RuleListClassifier class also includes a discretizer that can deal with continuous data (using Fayyad & Irani's minimum description length principle criterion, based on an implementation by navicto).

The inference procedure is slow on large datasets. If you have more than a few thousand data points, and only numeric data, try the included BigDataRuleListClassifier(training_subset=0.1), which first determines a small subset of the training data that is most critical in defining a decision boundary (the data points that are hardest to classify) and learns a rule list only on this subset (you can specify which estimator to use for judging which subset is hardest to classify by passing any sklearn-compatible estimator in the subset_estimator parameter - see examples/diabetes_bigdata_demo.py).

Usage

The project requires pyFIM, scikit-learn, and pandas to run.

The included RuleListClassifier works as a scikit-learn estimator, with a model.fit(X,y) method which takes training data X (numpy array or pandas DataFrame; continuous, categorical or mixed data) and labels y.

The learned rules of a trained model can be displayed simply by casting the object as a string, e.g. print model, or by using the model.tostring(decimals=1) method and optionally specifying the rounding precision.

Numerical data in X is automatically discretized. To prevent discretization (e.g. to protect columns containing categorical data represented as integers), pass the list of protected column names in the fit method, e.g. model.fit(X,y,undiscretized_features=['CAT_COLUMN_NAME']) (entries in undiscretized columns will be converted to strings and used as categorical values - see examples/hepatitis_mixeddata_demo.py).

Usage example:

from RuleListClassifier import *
from sklearn.datasets.mldata import fetch_mldata
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier

feature_labels = ["#Pregnant","Glucose concentration test","Blood pressure(mmHg)","Triceps skin fold thickness(mm)","2-Hour serum insulin (mu U/ml)","Body mass index","Diabetes pedigree function","Age (years)"]
    
data = fetch_mldata("diabetes") # get dataset
y = (data.target+1)/2 # target labels (0 or 1)
Xtrain, Xtest, ytrain, ytest = train_test_split(data.data, y) # split

# train classifier (allow more iterations for better accuracy; use BigDataRuleListClassifier for large datasets)
model = RuleListClassifier(max_iter=10000, class1label="diabetes", verbose=False)
model.fit(Xtrain, ytrain, feature_labels=feature_labels)

print "RuleListClassifier Accuracy:", model.score(Xtest, ytest), "Learned interpretable model:\n", model
print "RandomForestClassifier Accuracy:", RandomForestClassifier().fit(Xtrain, ytrain).score(Xtest, ytest)
"""
**Output:**
RuleListClassifier Accuracy: 0.776041666667 Learned interpretable model:
Trained RuleListClassifier for detecting diabetes
==================================================
IF Glucose concentration test : 157.5_to_inf THEN probability of diabetes: 81.1% (72.5%-72.5%)
ELSE IF Body mass index : -inf_to_26.3499995 THEN probability of diabetes: 5.2% (1.9%-1.9%)
ELSE IF Glucose concentration test : -inf_to_103.5 THEN probability of diabetes: 14.4% (8.8%-8.8%)
ELSE IF Age (years) : 27.5_to_inf THEN probability of diabetes: 59.6% (51.8%-51.8%)
ELSE IF Glucose concentration test : 103.5_to_127.5 THEN probability of diabetes: 15.9% (8.0%-8.0%)
ELSE probability of diabetes: 44.7% (29.5%-29.5%)
=================================================

RandomForestClassifier Accuracy: 0.729166666667
"""
Comments
  • Fault in example

    Fault in example

    You seem to print the lower limits twice in the example, for example

    IF Glucose concentration test : 157.5_to_inf THEN probability of diabetes: 81.1% (72.5%-72.5%)

    They all have this.

    opened by hjonasson 2
  • All predictions are the same

    All predictions are the same

    When I run the hepatitis demo, I get a nice list of rules:

    Trained RuleListClassifier for detecting survival
    ==================================================
    IF ALBUMIN : -inf_to_2.65 THEN probability of survival: 87.5% (59.0%-99.6%)
    ELSE IF BILIRUBIN : 1.41375838926_to_inf THEN probability of survival: 36.8% (22.5%-52.5%)
    
    ELSE IF ALBUMIN : 3.85863309352_to_inf THEN probability of survival: 3.6% (0.4%-9.7%)
    ELSE probability of survival: 18.2% (5.4%-36.3%)
    =================================================
    

    Reported accuracy is 76.9% and this is correct, but all predictions are the same:

    array([[ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818],
           [ 0.81818182,  0.18181818]])
    

    Similarly, when I run on my dataset, there's a nice list of rules, but all predictions are the same (0).

    opened by zygmuntz 2
  • Code seems to have dependency on Entropy package?

    Code seems to have dependency on Entropy package?

    I got this

    from Entropy import entropy, cut_point_information_gain
    

    while trying to run the demo.

    Could you please point me to the right package? Thanks!

    opened by savourylie 1
  • class1label is actually the 0 class

    class1label is actually the 0 class

    I think class1label actually labels the 0 class:

    from RuleListClassifier import *
    import numpy as np
    status = np.array([['dead','dead','dead','dead',
                        'alive','alive','alive','alive']]).T
    isAlive = np.array( [0, 0, 0, 0, 1, 1, 1, 1])
    model = RuleListClassifier(class1label="is_alive")
    model.fit(status, isAlive, feature_labels=["status"])
    print model
    

    Output:

    2 rules mined
    Starting mcmc chains
    Elapsed CPU time 8.980171
    Rhat for convergence: 0.999981686696
    Posterior average length: 2.48247263224
    Posterior average width: 1.0
    Trained RuleListClassifier for detecting is_alive
    ==================================================
    IF dead THEN probability of is_alive: 83.3% (47.8%-99.5%)
    ELSE probability of is_alive: 16.7% (0.5%-52.2%)
    =================================================
    

    You can also see this when looking at the diabetes example data set, and exploring the relationship between the features and labels.

    opened by kenben 1
  • reversed the meaning of class 1

    reversed the meaning of class 1

    The meaning of the class 1 is reversed.

    verified with this small program: from RuleListClassifier import *

    labels = ['temp','size'] X = [[100,12], [12,91], [17,92]]

    Y = [1, 0, 0] clf = RuleListClassifier(max_iter=50000, n_chains=3, class1label='burned', verbose=False) clf.fit(X, Y, feature_labels=labels) print clf

    prints:

    IF temp : 58.5_to_inf AND size : -inf_to_51.5 THEN probability of burned: 33.3% (1.3%-84.2%) ELSE IF temp : -inf_to_58.5 AND size : 51.5_to_inf THEN probability of burned: 75.0% (29.2%-99.2%)

    ELSE probability of burned: 50.0% (2.5%-97.5%)

    i.e. the meaning of the class 1 is reversed

    opened by ghosthugger 1
  • continuous and categorical features together

    continuous and categorical features together

    When I use mixed data (a Pandas frame), I get ValueError: could not convert string to float. So I one-hot encode categoricals and I get Warning: non-categorical data. Trying to discretize. (Please convert categorical values to strings to avoid this.) Is it possible to have continuous and categorical features together in X?

    opened by zygmuntz 1
  • Issue with names of features - Can't have feature with name 'y'

    Issue with names of features - Can't have feature with name 'y'

    I came across an issue where if you have a feature in your training set with the name 'y', this implementation crashes with something like:

    ValueError: Wrong number of items passed 4, placement implies 1

    opened by dylan-slack 0
  • Install sklearn-expertsys package for Python2 on Windows

    Install sklearn-expertsys package for Python2 on Windows

    I installed all the requirements, i.e. pyFIM, scikit-learn, pandas. However i can't install the sklearn-expersys package on Python on Windows. There is no documentation describing how to install it. Can you please help

    opened by giladwa1 0
  • Does this rule list package support multi-class classification?

    Does this rule list package support multi-class classification?

    As the title suggests, does this package support multi-class? I note that the paper claimed that this method supports multi-class. However, in the code, I notice it seems to only support binary classification?

    opened by myaooo 0
  • Can this be made to work when size of each row is variable?

    Can this be made to work when size of each row is variable?

    Hey, I came across your implementation for decision lists and was thinking to give it a try. I have some email dataset on which I want to run your code. But I have a problem, a few of the emails have many To addresses, thus if each row of the data represents an email, all would not have the same number of columns as the number of To addresses is different for each email. Is there a way that this code can deal with that sort of input data.

    Thanks.

    opened by AvinashBukkittu 0
Owner
Tamas Madl
Tamas Madl
TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models.

TensorFlow Decision Forests (TF-DF) is a collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models. The library is a collection of Keras models and supports classification, regression and ranking. TF-DF is a TensorFlow wrapper around the Yggdrasil Decision Forests C++ libraries. Models trained with TF-DF are compatible with Yggdrasil Decision Forests' models, and vice versa.

null 538 Jan 1, 2023
PySpark + Scikit-learn = Sparkit-learn

Sparkit-learn PySpark + Scikit-learn = Sparkit-learn GitHub: https://github.com/lensacom/sparkit-learn About Sparkit-learn aims to provide scikit-lear

Lensa 1.1k Jan 4, 2023
Tool for producing high quality forecasts for time series data that has multiple seasonality with linear or non-linear growth.

Prophet: Automatic Forecasting Procedure Prophet is a procedure for forecasting time series data based on an additive model where non-linear trends ar

Facebook 15.4k Jan 7, 2023
Scikit learn library models to account for data and concept drift.

liquid_scikit_learn Scikit learn library models to account for data and concept drift. This python library focuses on solving data drift and concept d

null 7 Nov 18, 2021
icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models

icepickle It's a cooler way to store simple linear models. The goal of icepickle is to allow a safe way to serialize and deserialize linear scikit-lea

vincent d warmerdam 24 Dec 9, 2022
A Lucid Framework for Transparent and Interpretable Machine Learning Models.

Currently a Beta-Version lucidmode is an open-source, low-code and lightweight Python framework for transparent and interpretable machine learning mod

lucidmode 15 Aug 12, 2022
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

null 802 Jan 1, 2023
Automated Machine Learning with scikit-learn

auto-sklearn auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. Find the documentation here

AutoML-Freiburg-Hannover 6.7k Jan 7, 2023
Relevance Vector Machine implementation using the scikit-learn API.

scikit-rvm scikit-rvm is a Python module implementing the Relevance Vector Machine (RVM) machine learning technique using the scikit-learn API. Quicks

James Ritchie 204 Nov 18, 2022
Distributed scikit-learn meta-estimators in PySpark

sk-dist: Distributed scikit-learn meta-estimators in PySpark What is it? sk-dist is a Python package for machine learning built on top of scikit-learn

Ibotta 282 Dec 9, 2022
Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Iris Species Predictor Iris species predictor app is used to classify iris species using their sepal length, sepal width, petal length and petal width

Siva Prakash 5 Apr 5, 2022
A collection of Scikit-Learn compatible time series transformers and tools.

tsfeast A collection of Scikit-Learn compatible time series transformers and tools. Installation Create a virtual environment and install: From PyPi p

Chris Santiago 0 Mar 30, 2022
Penguins species predictor app is used to classify penguins species created using python's scikit-learn, fastapi, numpy and joblib packages.

Penguins Classification App Penguins species predictor app is used to classify penguins species using their island, sex, bill length (mm), bill depth

Siva Prakash 3 Apr 5, 2022
Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

Samrat Mitra 2 Nov 18, 2021
K-Means clusternig example with Python and Scikit-learn

Unsupervised-Machine-Learning Flat Clustering K-Means clusternig example with Python and Scikit-learn Flat clustering Clustering algorithms group a se

Emin 1 Dec 13, 2021
Scikit-Learn useful pre-defined Pipelines Hub

Scikit-Pipes Scikit-Learn useful pre-defined Pipelines Hub Usage: Install scikit-pipes It's advised to install sklearn-genetic using a virtual env, in

Rodrigo Arenas 1 Apr 26, 2022
Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Clustering Clustering Application in Python Using scikit-learn This repository contains the prediction of baseball metric clusters using MLB Statcast

Tom Weichle 2 Apr 18, 2022
To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

Astitva Veer Garg 1 Jan 11, 2022
Painless Machine Learning for python based on scikit-learn

PlainML Painless Machine Learning Library for python based on scikit-learn. Install pip install plainml Example from plainml import KnnModel, load_ir

null 1 Aug 6, 2022