Python implementation of the rulefit algorithm

Overview

RuleFit

Implementation of a rule based prediction algorithm based on the rulefit algorithm from Friedman and Popescu (PDF)

The algorithm can be used for predicting an output vector y given an input matrix X. In the first step a tree ensemble is generated with gradient boosting. The trees are then used to form rules, where the paths to each node in each tree form one rule. A rule is a binary decision if an observation is in a given node, which is dependent on the input features that were used in the splits. The ensemble of rules together with the original input features are then being input in a L1-regularized linear model, also called Lasso, which estimates the effects of each rule on the output target but at the same time estimating many of those effects to zero.

You can use rulefit for predicting a numeric response (categorial not yet implemented). The input has to be a numpy matrix with only numeric values.

Installation

The latest version can be installed from the master branch using pip:

pip install git+git://github.com/christophM/rulefit.git

Another option is to clone the repository and install using python setup.py install or python setup.py develop.

Usage

Train your model:

import numpy as np
import pandas as pd

from rulefit import RuleFit

boston_data = pd.read_csv("boston.csv", index_col=0)

y = boston_data.medv.values
X = boston_data.drop("medv", axis=1)
features = X.columns
X = X.as_matrix()

rf = RuleFit()
rf.fit(X, y, feature_names=features)

If you want to have influence on the tree generator you can pass the generator as argument:

from sklearn.ensemble import GradientBoostingRegressor
gb = GradientBoostingRegressor(n_estimators=500, max_depth=10, learning_rate=0.01)
rf = RuleFit(gb)

rf.fit(X, y, feature_names=features)

Predict

rf.predict(X)

Inspect rules:

rules = rf.get_rules()

rules = rules[rules.coef != 0].sort_values("support", ascending=False)

print(rules)

Notes

  • In contrast to the original paper, the generated trees are always fitted with the same maximum depth. In the original implementation the maximum depth of the tree are drawn from a distribution each time
  • This implementation is in progress. If you find a bug, don't hesitate to contact me.

Changelog

All notable changes to this project will be documented here.

[v0.3] - IN PROGRESS

  • set default of exclude_zero_coef to False in get_rules():
  • syntax fix (Issue 21)

[v0.2] - 2017-11-24

  • Introduces classification for RuleFit
  • Adds scaling of variables (Friedscale)
  • Allows random size trees for creating rules

[v0.1] - 2016-06-18

  • Start changelog and versions
Comments
  • Added `tol`, `max_iter` parameters to allow fixing convergence issues.

    Added `tol`, `max_iter` parameters to allow fixing convergence issues.

    Adding max_iter and tol parameter for LassoCV and LogisticRegressionCV. This as something the solver doesn't converge and these parameters needs to respectively increase and decrease.

    It's actually a very small change, sorry for the multiple lines change - it's just VSCode that is doing some code sanitisation (removing trailing spaces). Also, test_fried_scale is broken and I had to comment that out to pass the tests.

    Thanks for making this - it's useful in my use case :)

    opened by alzmcr 5
  • rulefit.py - SyntaxError: invalid syntax, line 105

    rulefit.py - SyntaxError: invalid syntax, line 105

    Against the latest commit: 646d8ee

    Mar 26 14:45:43 ubuntu-xenial celery[27022]:   File "/app/venv/local/lib/python2.7/site-packages/rulefit/__init__.py", line 1, in <module>
    Mar 26 14:45:43 ubuntu-xenial celery[27022]:     from .rulefit import RuleCondition, Rule, RuleEnsemble, RuleFit, FriedScale
    Mar 26 14:45:43 ubuntu-xenial celery[27022]:   File "/app/venv/local/lib/python2.7/site-packages/rulefit/rulefit.py", line 105
    Mar 26 14:45:43 ubuntu-xenial celery[27022]:     self.scale_multipliers=scale_multipliers
    Mar 26 14:45:43 ubuntu-xenial celery[27022]:        ^
    Mar 26 14:45:43 ubuntu-xenial celery[27022]: SyntaxError: invalid syntax
    
    opened by ray-grointel 5
  • Upgrade with all FP2004 features and binary classification functionality

    Upgrade with all FP2004 features and binary classification functionality

    Christoph,

    This has quite a lot of changes and additional features to make it more like the original paper, and the interface more like Friedman's R implementation (http://statweb.stanford.edu/~jhf/r-rulefit/RuleFit_help.html). I don't mind if you don't want to accept this pull request as this branch serves my purposes, I won't be offended :) Thanks again for the great code it really sped up my development.

    Changes:

    • Added: use of Friedman standardisation on linear variables (Winsorised and scaled by 0.4/stdev)
    • Added: use of Friedman randomisation of number of terminal nodes using exponential distribution
    • Fixed: use of a set (i.e. unordered) for rules sometimes caused wrong coefficients to be associiated with the wrong rules! Rules are now stored as a list (ie ordered)
    • Improved: sped up prediction by not evaluating rules with zero coefficients
    • Added: Max rules parameter like Friedman
    • Added: Invisible use of BoostingRegressor/Classifier (created according to constructor parameters, like Friedman's R implementation)
    • Fixed: now only creates rules at terminal (leaf nodes) not branch nodes.
    • Added: Now has 'regress' and 'classify' modes.
    • Added: 'Classify' mode uses LogisticRegressionCV with L1 regularisation penalty
    • Added: Added model_type parameter to allow rules/linear terms or both like R version.

    Issues:

    • Classification is only checked for binary (two class) at the moment.
    • LogisticRegressionCV is not generating very sparse rulesets even though I've specified L1 penalty... not sure why. Possibly hyperparameters and CV parameters need some tuning.

    Cheers,

    Chris

    opened by chriswbartley 4
  • predict() does not work in rules-only mode

    predict() does not work in rules-only mode

    Reproduction:

    import numpy as np
    import pandas as pd
    
    from rulefit import RuleFit
    
    boston_data = pd.read_csv("boston.csv", index_col=0)
    
    y = boston_data.medv.values
    X = boston_data.drop("medv", axis=1)
    features = X.columns
    X = X.as_matrix()
    
    rf = RuleFit(model_type='r')
    rf.fit(X, y, feature_names=features)
    
    rf.predict(X)
    

    Produces:

    IndexError: index 1717 is out of bounds for axis 0 with size 1717

    I investigated the issue. It is about array addressing that works in linear and linear+rules mode, but not in r rules only mode. Will send a PR shortly.

    opened by benoitparis 2
  • Bugfixes and change to default behavior of get_rules

    Bugfixes and change to default behavior of get_rules

    I fixed two bugs I ran into:

    1. If one of the feature columns has a constant feature, then its standard deviation will be zero, and the friedman scaling done will have a divide by 0 error. I added a small constant to prevent this division by zero.

    2. If Cs is passed to RuleFit when instantiating the object, it won't be passed properly to the LogisticRegression subroutine -- it should be self.Cs, instead of Cs.

    I also ran into quirky behavior with get_rules versus transform. When transforming, I would get out a matrix with 116 columns, corresponding to 116 transformed features. When inspecting the rules with the output of get_rules, the total number of rules would only be 115. This was pretty frustrating, but I tracked down the source of the issue to be that when exclude_zero_coef was set to True, one of the rules was being eliminated. I think that the behavior between get_rules and transform should be identical -- either the variables with zero coefficient are eliminated from both, or neither. So this change at least makes the two consistent.

    opened by dchristle 2
  • Requested changes made

    Requested changes made

    Christoph, I finally got to do all those changes. I have done everything as requested with the exception of adding the 'Cs' parameter into fit(). I put that in the constructor for RuleFit to match all the sklearn standard. Also, FYI LassoCV uses alphas and n_alphas so I had to convert Cs to alphas=1/Cs and n_alphas as needed.

    Hopefully I haven't missed anything.

    Cheers, Chris

    opened by chriswbartley 2
  • Classification option needs Logistic Regression (not LassoCV?)

    Classification option needs Logistic Regression (not LassoCV?)

    Hi Christoph, Thanks for this code! I'm curious that although it allows Classification based tree generators, when the coefficients are calculated everything goes through LassoCV - but doesn't this use straight SSE (sum of squared error) loss? For binary classification purposes (or OVA multiclass) you'd want L1 regularised Logistic Regression wouldn't you (ie with log loss)? (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)? Happy to be corrected if I've not got it right. Friedman's 2005 paper seems a bit vague on this... Cheers, Chris

    opened by chriswbartley 2
  • Update installation instructions

    Update installation instructions

    With the new github.com security guidelines the current installation method does not work.

    Fix : Update the README.md with new installation command pip install git+https://github.com/christophM/rulefit.git

    opened by anilkumarpanda 1
  • Add get_feature_importance method to RuleFit class

    Add get_feature_importance method to RuleFit class

    Hello! I've taken much interest in this topic and would like to contribute to this repo.

    This method uses the rule set from the get_rules method in the RuleFit class and the submitted features to find the importance of the features either globally or over a subregion.

    opened by caseywhorton 1
  • Rule importance

    Rule importance

    Added global importance (eq 28 and 29) to get_rules() ; #6 Added local importance (eq 30 and 31) to get_rules(); #7 Added subregion local importance (eq 32) to get_rules(). #7 (Note: Single local importance is treated as a subregion with only one point.)

    opened by mcasx 1
  • import issue for rulefit

    import issue for rulefit

    Hi Christoph,

    While importing rulefit module, I am getting below error. Can you please suggest?


    ImportError Traceback (most recent call last) in () ----> 1 import rulefit

    /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rulefit/init.py in () ----> 1 from rulefit import RuleCondition, Rule, RuleEnsemble, RuleFit 2 3 all = ["rulefit"]

    ImportError: cannot import name 'RuleCondition'

    opened by anupmandvariya 1
  • changed set to Orderset to fix reproducibility issue, added Order-set…

    changed set to Orderset to fix reproducibility issue, added Order-set…

    Replaced 'set' with 'OrderedSet' to fix reproducibility issue. Set in unordered and therefore causes reproducibility issues when the script is ran multiple times. Replacing it with OrderSet (pip install ordered-set) solves this issue.

    Added 'ordered-set>=4.1.0' in setup.py

    opened by AdityaN1198 0
  • InvalidIndexError: (slice(None, None, None), 0)

    InvalidIndexError: (slice(None, None, None), 0)

    Python 3.10 rulefit==0.3.1

    Problem Unable to follow official document

    Step to reproduce:

    1. pip install rulefit
    2. Follow Train your model section. But omit .as_matrix because Dataframe has so such a method

    Expected result: Be able to train model

    As is:

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    File /t/pyenv/versions/py-default/lib/python3.10/site-packages/pandas/core/indexes/base.py:3600, in Index.get_loc(self, key, method, tolerance)
       3599 try:
    -> 3600     return self._engine.get_loc(casted_key)
       3601 except KeyError as err:
    
    File /t/pyenv/versions/py-default/lib/python3.10/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()
    
    File /t/pyenv/versions/py-default/lib/python3.10/site-packages/pandas/_libs/index.pyx:142, in pandas._libs.index.IndexEngine.get_loc()
    
    TypeError: '(slice(None, None, None), 0)' is an invalid key
    
    During handling of the above exception, another exception occurred:
    
    InvalidIndexError                         Traceback (most recent call last)
    Input In [71], in <module>
    ----> 1 rf.fit(X, y, feature_names=features)
    
    File /t/pyenv/versions/py-default/lib/python3.10/site-packages/rulefit/rulefit.py:410, in RuleFit.fit(self, X, y, feature_names)
        406     self.rule_ensemble = RuleEnsemble(tree_list = tree_list,
        407                                       feature_names=self.feature_names)
        409     ## concatenate original features and rules
    --> 410     X_rules = self.rule_ensemble.transform(X)
        412 ## standardise linear variables if requested (for regression model only)
        413 if 'l' in self.model_type: 
        414 
        415     ## standard deviation and mean of winsorized features
    
    File /t/pyenv/versions/py-default/lib/python3.10/site-packages/rulefit/rulefit.py:277, in RuleEnsemble.transform(self, X, coefs)
        275 rule_list=list(self.rules) 
        276 if   coefs is None :
    --> 277     return np.array([rule.transform(X) for rule in rule_list]).T
        278 else: # else use the coefs to filter the rules we bother to interpret
        279     res= np.array([rule_list[i_rule].transform(X) for i_rule in np.arange(len(rule_list)) if coefs[i_rule]!=0]).T
    
    File /t/pyenv/versions/py-default/lib/python3.10/site-packages/rulefit/rulefit.py:277, in <listcomp>(.0)
        275 rule_list=list(self.rules) 
        276 if   coefs is None :
    --> 277     return np.array([rule.transform(X) for rule in rule_list]).T
        278 else: # else use the coefs to filter the rules we bother to interpret
        279     res= np.array([rule_list[i_rule].transform(X) for i_rule in np.arange(len(rule_list)) if coefs[i_rule]!=0]).T
    
    File /t/pyenv/versions/py-default/lib/python3.10/site-packages/rulefit/rulefit.py:155, in Rule.transform(self, X)
        144 def transform(self, X):
        145     """Transform dataset.
        146 
        147     Parameters
       (...)
        153     X_transformed: array-like matrix, shape=(n_samples, 1)
        154     """
    --> 155     rule_applies = [condition.transform(X) for condition in self.conditions]
        156     return reduce(lambda x,y: x * y, rule_applies)
    
    File /t/pyenv/versions/py-default/lib/python3.10/site-packages/rulefit/rulefit.py:155, in <listcomp>(.0)
        144 def transform(self, X):
        145     """Transform dataset.
        146 
        147     Parameters
       (...)
        153     X_transformed: array-like matrix, shape=(n_samples, 1)
        154     """
    --> 155     rule_applies = [condition.transform(X) for condition in self.conditions]
        156     return reduce(lambda x,y: x * y, rule_applies)
    
    File /t/pyenv/versions/py-default/lib/python3.10/site-packages/rulefit/rulefit.py:66, in RuleCondition.transform(self, X)
         55 """Transform dataset.
         56 
         57 Parameters
       (...)
         63 X_transformed: array-like matrix, shape=(n_samples, 1)
         64 """
         65 if self.operator == "<=":
    ---> 66     res =  1 * (X[:,self.feature_index] <= self.threshold)
         67 elif self.operator == ">":
         68     res = 1 * (X[:,self.feature_index] > self.threshold)
    
    File /t/pyenv/versions/py-default/lib/python3.10/site-packages/pandas/core/frame.py:3504, in DataFrame.__getitem__(self, key)
       3502 if self.columns.nlevels > 1:
       3503     return self._getitem_multilevel(key)
    -> 3504 indexer = self.columns.get_loc(key)
       3505 if is_integer(indexer):
       3506     indexer = [indexer]
    
    File /t/pyenv/versions/py-default/lib/python3.10/site-packages/pandas/core/indexes/base.py:3607, in Index.get_loc(self, key, method, tolerance)
       3602         raise KeyError(key) from err
       3603     except TypeError:
       3604         # If we have a listlike key, _check_indexing_error will raise
       3605         #  InvalidIndexError. Otherwise we fall through and re-raise
       3606         #  the TypeError.
    -> 3607         self._check_indexing_error(key)
       3608         raise
       3610 # GH#42269
    
    File /t/pyenv/versions/py-default/lib/python3.10/site-packages/pandas/core/indexes/base.py:5609, in Index._check_indexing_error(self, key)
       5605 def _check_indexing_error(self, key):
       5606     if not is_scalar(key):
       5607         # if key is not a scalar, directly raise an error (the code below
       5608         # would convert to numpy arrays and raise later any way) - GH29926
    -> 5609         raise InvalidIndexError(key)
    
    InvalidIndexError: (slice(None, None, None), 0)
    
    opened by elcolie 1
  • Getting error : unsupported operand type(s) for /: 'int' and 'RandomForestClassifier'

    Getting error : unsupported operand type(s) for /: 'int' and 'RandomForestClassifier'

    While running rf.fit(X, y, feature_names=features) in your github code I am getting below error,

    : unsupported operand type(s) for /: 'int' and 'RandomForestClassifier'

    opened by sunnytholar 3
  • Argument description in rulefit.py needs corrected

    Argument description in rulefit.py needs corrected

    Hello! The description of exclude_zero_coef on line 557 in rulefit.py states that True is the default, but False looks to be the default for the argument in the function:

    exclude_zero_coef: If True (default), returns only the rules with an estimated coefficient not equalt to zero.

    https://github.com/christophM/rulefit/blob/b1657af4b41df59e2ae64bb1767dbaf5ff1ed7fe/rulefit/rulefit.py#L557

    opened by caseywhorton 0
  • Why is max depth fixed?

    Why is max depth fixed?

    "In contrast to the original paper, the generated trees are always fitted with the same maximum depth. In the original implementation the maximum depth of the tree are drawn from a distribution each time"

    Is this just an artefact of the sklearn implementation of random forest or is there a different motivation behind it? Thanks.

    opened by YovaKem 0
Owner
Christoph Molnar
Interpretable Machine Learning researcher. Author of Interpretable Machine Learning Book: https://christophm.github.io/interpretable-ml-book/
Christoph Molnar
Contains an implementation (sklearn API) of the algorithm proposed in "GENDIS: GEnetic DIscovery of Shapelets" and code to reproduce all experiments.

GENDIS GENetic DIscovery of Shapelets In the time series classification domain, shapelets are small subseries that are discriminative for a certain cl

IDLab Services 90 Oct 28, 2022
Implementation of K-Nearest Neighbors Algorithm Using PySpark

KNN With Spark Implementation of KNN using PySpark. The KNN was used on two separate datasets (https://archive.ics.uci.edu/ml/datasets/iris and https:

Zachary Petroff 4 Dec 30, 2022
Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm.

Naive-Bayes Spam Classificator Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm. Main goal is to code a

Viktoria Maksymiuk 1 Jun 27, 2022
Send rockets to Mars with artificial intelligence(Genetic algorithm) in python.

Send Rockets To Mars With AI Send rockets to Mars with artificial intelligence(Genetic algorithm) in python. Tools Python 3 EasyDraw How to Play Insta

Mohammad Dori 3 Jul 15, 2022
Decision Tree Regression algorithm implemented on Python from scratch.

Decision_Tree_Regression I implemented the decision tree regression algorithm on Python. Unlike regular linear regression, this algorithm is used when

null 1 Dec 22, 2021
This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Crypto-Currency-Predictor This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you

Hazim Arafa 6 Dec 4, 2022
BASTA: The BAyesian STellar Algorithm

BASTA: BAyesian STellar Algorithm Current stable version: v1.0 Important note: BASTA is developed for Python 3.8, but Python 3.7 should work as well.

BASTA team 16 Nov 15, 2022
using Machine Learning Algorithm to classification AppleStore application

AppleStore-classification-with-Machine-learning-Algo- using Machine Learning Algorithm to classification AppleStore application. the first step : 1: p

Mohammed Hussien 2 May 2, 2022
The project's goal is to show a real world application of image segmentation using k means algorithm

The project's goal is to show a real world application of image segmentation using k means algorithm

null 2 Jan 22, 2022
Extreme Learning Machine implementation in Python

Python-ELM v0.3 ---> ARCHIVED March 2021 <--- This is an implementation of the Extreme Learning Machine [1][2] in Python, based on scikit-learn. From

David C. Lambert 511 Dec 20, 2022
Implementation of different ML Algorithms from scratch, written in Python 3.x

Implementation of different ML Algorithms from scratch, written in Python 3.x

Gautam J 393 Nov 29, 2022
A Python implementation of the Robotics Toolbox for MATLAB

Robotics Toolbox for Python A Python implementation of the Robotics Toolbox for MATLAB® GitHub repository Documentation Wiki (examples and details) Sy

Peter Corke 1.2k Jan 7, 2023
A Python implementation of GRAIL, a generic framework to learn compact time series representations.

GRAIL A Python implementation of GRAIL, a generic framework to learn compact time series representations. Requirements Python 3.6+ numpy scipy tslearn

null 3 Nov 24, 2021
Implementation of linesearch Optimization Algorithms in Python

Nonlinear Optimization Algorithms During my time as Scientific Assistant at the Karlsruhe Institute of Technology (Germany) I implemented various Opti

Paul 3 Dec 6, 2022
Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

RGF-team 363 Dec 14, 2022
High performance implementation of Extreme Learning Machines (fast randomized neural networks).

High Performance toolbox for Extreme Learning Machines. Extreme learning machines (ELM) are a particular kind of Artificial Neural Networks, which sol

Anton Akusok 174 Dec 7, 2022
TensorFlow implementation of an arbitrary order Factorization Machine

This is a TensorFlow implementation of an arbitrary order (>=2) Factorization Machine based on paper Factorization Machines with libFM. It supports: d

Mikhail Trofimov 785 Dec 21, 2022
Relevance Vector Machine implementation using the scikit-learn API.

scikit-rvm scikit-rvm is a Python module implementing the Relevance Vector Machine (RVM) machine learning technique using the scikit-learn API. Quicks

James Ritchie 204 Nov 18, 2022
Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)"

CRAN Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)" This code doesn't exa

null 4 Nov 11, 2021