Python implementation of the rulefit algorithm

Christoph Molnar

Last update: Jan 2, 2023

Related tags

Machine Learning rulefit

Overview

RuleFit

Implementation of a rule based prediction algorithm based on the rulefit algorithm from Friedman and Popescu (PDF)

The algorithm can be used for predicting an output vector y given an input matrix X. In the first step a tree ensemble is generated with gradient boosting. The trees are then used to form rules, where the paths to each node in each tree form one rule. A rule is a binary decision if an observation is in a given node, which is dependent on the input features that were used in the splits. The ensemble of rules together with the original input features are then being input in a L1-regularized linear model, also called Lasso, which estimates the effects of each rule on the output target but at the same time estimating many of those effects to zero.

You can use rulefit for predicting a numeric response (categorial not yet implemented). The input has to be a numpy matrix with only numeric values.

Installation

The latest version can be installed from the master branch using pip:

pip install git+git://github.com/christophM/rulefit.git

Another option is to clone the repository and install using python setup.py install or python setup.py develop.

Usage

Train your model:

import numpy as np
import pandas as pd

from rulefit import RuleFit

boston_data = pd.read_csv("boston.csv", index_col=0)

y = boston_data.medv.values
X = boston_data.drop("medv", axis=1)
features = X.columns
X = X.as_matrix()

rf = RuleFit()
rf.fit(X, y, feature_names=features)

If you want to have influence on the tree generator you can pass the generator as argument:

from sklearn.ensemble import GradientBoostingRegressor
gb = GradientBoostingRegressor(n_estimators=500, max_depth=10, learning_rate=0.01)
rf = RuleFit(gb)

rf.fit(X, y, feature_names=features)

Predict

rf.predict(X)

Inspect rules:

rules = rf.get_rules()

rules = rules[rules.coef != 0].sort_values("support", ascending=False)

print(rules)

Notes

In contrast to the original paper, the generated trees are always fitted with the same maximum depth. In the original implementation the maximum depth of the tree are drawn from a distribution each time
This implementation is in progress. If you find a bug, don't hesitate to contact me.

Changelog

All notable changes to this project will be documented here.

[v0.3] - IN PROGRESS

set default of exclude_zero_coef to False in get_rules():
syntax fix (Issue 21)

[v0.2] - 2017-11-24

Introduces classification for RuleFit
Adds scaling of variables (Friedscale)
Allows random size trees for creating rules

[v0.1] - 2016-06-18

Start changelog and versions

Comments

Added `tol`, `max_iter` parameters to allow fixing convergence issues.

Adding max_iter and tol parameter for LassoCV and LogisticRegressionCV. This as something the solver doesn't converge and these parameters needs to respectively increase and decrease.

It's actually a very small change, sorry for the multiple lines change - it's just VSCode that is doing some code sanitisation (removing trailing spaces). Also, test_fried_scale is broken and I had to comment that out to pass the tests.

Thanks for making this - it's useful in my use case :)

opened by alzmcr 5

rulefit.py - SyntaxError: invalid syntax, line 105

Against the latest commit: 646d8ee

Mar 26 14:45:43 ubuntu-xenial celery[27022]:   File "/app/venv/local/lib/python2.7/site-packages/rulefit/__init__.py", line 1, in <module>
Mar 26 14:45:43 ubuntu-xenial celery[27022]:     from .rulefit import RuleCondition, Rule, RuleEnsemble, RuleFit, FriedScale
Mar 26 14:45:43 ubuntu-xenial celery[27022]:   File "/app/venv/local/lib/python2.7/site-packages/rulefit/rulefit.py", line 105
Mar 26 14:45:43 ubuntu-xenial celery[27022]:     self.scale_multipliers=scale_multipliers
Mar 26 14:45:43 ubuntu-xenial celery[27022]:        ^
Mar 26 14:45:43 ubuntu-xenial celery[27022]: SyntaxError: invalid syntax

opened by ray-grointel 5

Upgrade with all FP2004 features and binary classification functionality
Christoph,

This has quite a lot of changes and additional features to make it more like the original paper, and the interface more like Friedman's R implementation (http://statweb.stanford.edu/~jhf/r-rulefit/RuleFit_help.html). I don't mind if you don't want to accept this pull request as this branch serves my purposes, I won't be offended :) Thanks again for the great code it really sped up my development.

Changes:

Added: use of Friedman standardisation on linear variables (Winsorised and scaled by 0.4/stdev)

Added: use of Friedman randomisation of number of terminal nodes using exponential distribution

Fixed: use of a set (i.e. unordered) for rules sometimes caused wrong coefficients to be associiated with the wrong rules! Rules are now stored as a list (ie ordered)

Improved: sped up prediction by not evaluating rules with zero coefficients

Added: Max rules parameter like Friedman

Added: Invisible use of BoostingRegressor/Classifier (created according to constructor parameters, like Friedman's R implementation)

Fixed: now only creates rules at terminal (leaf nodes) not branch nodes.

Added: Now has 'regress' and 'classify' modes.

Added: 'Classify' mode uses LogisticRegressionCV with L1 regularisation penalty

Added: Added model_type parameter to allow rules/linear terms or both like R version.

Issues:

Classification is only checked for binary (two class) at the moment.

LogisticRegressionCV is not generating very sparse rulesets even though I've specified L1 penalty... not sure why. Possibly hyperparameters and CV parameters need some tuning.

Cheers,

Chris
opened by chriswbartley 4

predict() does not work in rules-only mode

Reproduction:

import numpy as np
import pandas as pd

from rulefit import RuleFit

boston_data = pd.read_csv("boston.csv", index_col=0)

y = boston_data.medv.values
X = boston_data.drop("medv", axis=1)
features = X.columns
X = X.as_matrix()

rf = RuleFit(model_type='r')
rf.fit(X, y, feature_names=features)

rf.predict(X)

Produces:

IndexError: index 1717 is out of bounds for axis 0 with size 1717

I investigated the issue. It is about array addressing that works in linear and linear+rules mode, but not in r rules only mode. Will send a PR shortly.

opened by benoitparis 2

Bugfixes and change to default behavior of get_rules
I fixed two bugs I ran into:

If one of the feature columns has a constant feature, then its standard deviation will be zero, and the friedman scaling done will have a divide by 0 error. I added a small constant to prevent this division by zero.

If Cs is passed to RuleFit when instantiating the object, it won't be passed properly to the LogisticRegression subroutine -- it should be self.Cs, instead of Cs.

I also ran into quirky behavior with get_rules versus transform. When transforming, I would get out a matrix with 116 columns, corresponding to 116 transformed features. When inspecting the rules with the output of get_rules, the total number of rules would only be 115. This was pretty frustrating, but I tracked down the source of the issue to be that when exclude_zero_coef was set to True, one of the rules was being eliminated. I think that the behavior between get_rules and transform should be identical -- either the variables with zero coefficient are eliminated from both, or neither. So this change at least makes the two consistent.
opened by dchristle 2
Requested changes made

Christoph, I finally got to do all those changes. I have done everything as requested with the exception of adding the 'Cs' parameter into fit(). I put that in the constructor for RuleFit to match all the sklearn standard. Also, FYI LassoCV uses alphas and n_alphas so I had to convert Cs to alphas=1/Cs and n_alphas as needed.

Hopefully I haven't missed anything.

Cheers, Chris

opened by chriswbartley 2
Classification option needs Logistic Regression (not LassoCV?)

Hi Christoph, Thanks for this code! I'm curious that although it allows Classification based tree generators, when the coefficients are calculated everything goes through LassoCV - but doesn't this use straight SSE (sum of squared error) loss? For binary classification purposes (or OVA multiclass) you'd want L1 regularised Logistic Regression wouldn't you (ie with log loss)? (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)? Happy to be corrected if I've not got it right. Friedman's 2005 paper seems a bit vague on this... Cheers, Chris

opened by chriswbartley 2
Update installation instructions

With the new github.com security guidelines the current installation method does not work.

Fix : Update the README.md with new installation command pip install git+https://github.com/christophM/rulefit.git

opened by anilkumarpanda 1
Add get_feature_importance method to RuleFit class

Hello! I've taken much interest in this topic and would like to contribute to this repo.

This method uses the rule set from the get_rules method in the RuleFit class and the submitted features to find the importance of the features either globally or over a subregion.

opened by caseywhorton 1
Rule importance

Added global importance (eq 28 and 29) to get_rules() ; #6 Added local importance (eq 30 and 31) to get_rules(); #7 Added subregion local importance (eq 32) to get_rules(). #7 (Note: Single local importance is treated as a subregion with only one point.)

opened by mcasx 1
import issue for rulefit

Hi Christoph,

While importing rulefit module, I am getting below error. Can you please suggest?

ImportError Traceback (most recent call last) in () ----> 1 import rulefit

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rulefit/init.py in () ----> 1 from rulefit import RuleCondition, Rule, RuleEnsemble, RuleFit 2 3 all = ["rulefit"]

ImportError: cannot import name 'RuleCondition'

opened by anupmandvariya 1
changed set to Orderset to fix reproducibility issue, added Order-set…

Replaced 'set' with 'OrderedSet' to fix reproducibility issue. Set in unordered and therefore causes reproducibility issues when the script is ran multiple times. Replacing it with OrderSet (pip install ordered-set) solves this issue.

Added 'ordered-set>=4.1.0' in setup.py

opened by AdityaN1198 0

InvalidIndexError: (slice(None, None, None), 0)

Python 3.10 rulefit==0.3.1

Problem Unable to follow official document

Step to reproduce:

pip install rulefit
Follow Train your model section. But omit .as_matrix because Dataframe has so such a method

Expected result: Be able to train model

As is:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File /t/pyenv/versions/py-default/lib/python3.10/site-packages/pandas/core/indexes/base.py:3600, in Index.get_loc(self, key, method, tolerance)
   3599 try:
-> 3600     return self._engine.get_loc(casted_key)
   3601 except KeyError as err:

File /t/pyenv/versions/py-default/lib/python3.10/site-packages/pandas/_libs/index.pyx:136, in pandas._libs.index.IndexEngine.get_loc()

File /t/pyenv/versions/py-default/lib/python3.10/site-packages/pandas/_libs/index.pyx:142, in pandas._libs.index.IndexEngine.get_loc()

TypeError: '(slice(None, None, None), 0)' is an invalid key

During handling of the above exception, another exception occurred:

InvalidIndexError                         Traceback (most recent call last)
Input In [71], in <module>
----> 1 rf.fit(X, y, feature_names=features)

File /t/pyenv/versions/py-default/lib/python3.10/site-packages/rulefit/rulefit.py:410, in RuleFit.fit(self, X, y, feature_names)
    406     self.rule_ensemble = RuleEnsemble(tree_list = tree_list,
    407                                       feature_names=self.feature_names)
    409     ## concatenate original features and rules
--> 410     X_rules = self.rule_ensemble.transform(X)
    412 ## standardise linear variables if requested (for regression model only)
    413 if 'l' in self.model_type: 
    414 
    415     ## standard deviation and mean of winsorized features

File /t/pyenv/versions/py-default/lib/python3.10/site-packages/rulefit/rulefit.py:277, in RuleEnsemble.transform(self, X, coefs)
    275 rule_list=list(self.rules) 
    276 if   coefs is None :
--> 277     return np.array([rule.transform(X) for rule in rule_list]).T
    278 else: # else use the coefs to filter the rules we bother to interpret
    279     res= np.array([rule_list[i_rule].transform(X) for i_rule in np.arange(len(rule_list)) if coefs[i_rule]!=0]).T

File /t/pyenv/versions/py-default/lib/python3.10/site-packages/rulefit/rulefit.py:277, in <listcomp>(.0)
    275 rule_list=list(self.rules) 
    276 if   coefs is None :
--> 277     return np.array([rule.transform(X) for rule in rule_list]).T
    278 else: # else use the coefs to filter the rules we bother to interpret
    279     res= np.array([rule_list[i_rule].transform(X) for i_rule in np.arange(len(rule_list)) if coefs[i_rule]!=0]).T

File /t/pyenv/versions/py-default/lib/python3.10/site-packages/rulefit/rulefit.py:155, in Rule.transform(self, X)
    144 def transform(self, X):
    145     """Transform dataset.
    146 
    147     Parameters
   (...)
    153     X_transformed: array-like matrix, shape=(n_samples, 1)
    154     """
--> 155     rule_applies = [condition.transform(X) for condition in self.conditions]
    156     return reduce(lambda x,y: x * y, rule_applies)

File /t/pyenv/versions/py-default/lib/python3.10/site-packages/rulefit/rulefit.py:155, in <listcomp>(.0)
    144 def transform(self, X):
    145     """Transform dataset.
    146 
    147     Parameters
   (...)
    153     X_transformed: array-like matrix, shape=(n_samples, 1)
    154     """
--> 155     rule_applies = [condition.transform(X) for condition in self.conditions]
    156     return reduce(lambda x,y: x * y, rule_applies)

File /t/pyenv/versions/py-default/lib/python3.10/site-packages/rulefit/rulefit.py:66, in RuleCondition.transform(self, X)
     55 """Transform dataset.
     56 
     57 Parameters
   (...)
     63 X_transformed: array-like matrix, shape=(n_samples, 1)
     64 """
     65 if self.operator == "<=":
---> 66     res =  1 * (X[:,self.feature_index] <= self.threshold)
     67 elif self.operator == ">":
     68     res = 1 * (X[:,self.feature_index] > self.threshold)

File /t/pyenv/versions/py-default/lib/python3.10/site-packages/pandas/core/frame.py:3504, in DataFrame.__getitem__(self, key)
   3502 if self.columns.nlevels > 1:
   3503     return self._getitem_multilevel(key)
-> 3504 indexer = self.columns.get_loc(key)
   3505 if is_integer(indexer):
   3506     indexer = [indexer]

File /t/pyenv/versions/py-default/lib/python3.10/site-packages/pandas/core/indexes/base.py:3607, in Index.get_loc(self, key, method, tolerance)
   3602         raise KeyError(key) from err
   3603     except TypeError:
   3604         # If we have a listlike key, _check_indexing_error will raise
   3605         #  InvalidIndexError. Otherwise we fall through and re-raise
   3606         #  the TypeError.
-> 3607         self._check_indexing_error(key)
   3608         raise
   3610 # GH#42269

File /t/pyenv/versions/py-default/lib/python3.10/site-packages/pandas/core/indexes/base.py:5609, in Index._check_indexing_error(self, key)
   5605 def _check_indexing_error(self, key):
   5606     if not is_scalar(key):
   5607         # if key is not a scalar, directly raise an error (the code below
   5608         # would convert to numpy arrays and raise later any way) - GH29926
-> 5609         raise InvalidIndexError(key)

InvalidIndexError: (slice(None, None, None), 0)

opened by elcolie 1

Getting error : unsupported operand type(s) for /: 'int' and 'RandomForestClassifier'

While running rf.fit(X, y, feature_names=features) in your github code I am getting below error,

: unsupported operand type(s) for /: 'int' and 'RandomForestClassifier'

opened by sunnytholar 3
Argument description in rulefit.py needs corrected

Hello! The description of exclude_zero_coef on line 557 in rulefit.py states that True is the default, but False looks to be the default for the argument in the function:

exclude_zero_coef: If True (default), returns only the rules with an estimated coefficient not equalt to zero.

https://github.com/christophM/rulefit/blob/b1657af4b41df59e2ae64bb1767dbaf5ff1ed7fe/rulefit/rulefit.py#L557

opened by caseywhorton 0
Why is max depth fixed?

"In contrast to the original paper, the generated trees are always fitted with the same maximum depth. In the original implementation the maximum depth of the tree are drawn from a distribution each time"

Is this just an artefact of the sklearn implementation of random forest or is there a different motivation behind it? Thanks.

opened by YovaKem 0

Owner

Christoph Molnar

Interpretable Machine Learning researcher. Author of Interpretable Machine Learning Book: https://christophm.github.io/interpretable-ml-book/

GitHub

Contains an implementation (sklearn API) of the algorithm proposed in "GENDIS: GEnetic DIscovery of Shapelets" and code to reproduce all experiments.

GENDIS GENetic DIscovery of Shapelets In the time series classification domain, shapelets are small subseries that are discriminative for a certain cl

90 Oct 28, 2022

Implementation of K-Nearest Neighbors Algorithm Using PySpark

KNN With Spark Implementation of KNN using PySpark. The KNN was used on two separate datasets (https://archive.ics.uci.edu/ml/datasets/iris and https:

4 Dec 30, 2022

Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm.

Naive-Bayes Spam Classificator Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm. Main goal is to code a

1 Jun 27, 2022

Send rockets to Mars with artificial intelligence(Genetic algorithm) in python.

Send Rockets To Mars With AI Send rockets to Mars with artificial intelligence(Genetic algorithm) in python. Tools Python 3 EasyDraw How to Play Insta

3 Jul 15, 2022

Decision Tree Regression algorithm implemented on Python from scratch.

Decision_Tree_Regression I implemented the decision tree regression algorithm on Python. Unlike regular linear regression, this algorithm is used when

1 Dec 22, 2021

This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

Crypto-Currency-Predictor This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you

6 Dec 4, 2022

BASTA: The BAyesian STellar Algorithm

BASTA: BAyesian STellar Algorithm Current stable version: v1.0 Important note: BASTA is developed for Python 3.8, but Python 3.7 should work as well.

16 Nov 15, 2022

using Machine Learning Algorithm to classification AppleStore application

AppleStore-classification-with-Machine-learning-Algo- using Machine Learning Algorithm to classification AppleStore application. the first step : 1: p

2 May 2, 2022

The project's goal is to show a real world application of image segmentation using k means algorithm

2 Jan 22, 2022

Extreme Learning Machine implementation in Python

Python-ELM v0.3 ---> ARCHIVED March 2021 <--- This is an implementation of the Extreme Learning Machine [1][2] in Python, based on scikit-learn. From

511 Dec 20, 2022

Implementation of different ML Algorithms from scratch, written in Python 3.x

393 Nov 29, 2022

A Python implementation of the Robotics Toolbox for MATLAB

Robotics Toolbox for Python A Python implementation of the Robotics Toolbox for MATLAB® GitHub repository Documentation Wiki (examples and details) Sy

1.2k Jan 7, 2023

A Python implementation of GRAIL, a generic framework to learn compact time series representations.

GRAIL A Python implementation of GRAIL, a generic framework to learn compact time series representations. Requirements Python 3.6+ numpy scipy tslearn

3 Nov 24, 2021

Implementation of linesearch Optimization Algorithms in Python

Nonlinear Optimization Algorithms During my time as Scientific Assistant at the Karlsruhe Institute of Technology (Germany) I implemented various Opti

3 Dec 6, 2022

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

Regularized Greedy Forest Regularized Greedy Forest (RGF) is a tree ensemble machine learning method described in this paper. RGF can deliver better r

363 Dec 14, 2022

Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)"

CRAN Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)" This code doesn't exa

4 Nov 11, 2021

Python implementation of the rulefit algorithm

Related tags

Overview

RuleFit

Installation

Usage

Train your model:

Predict

Inspect rules:

Notes

Changelog

[v0.3] - IN PROGRESS

[v0.2] - 2017-11-24

[v0.1] - 2016-06-18

Comments

Owner

Christoph Molnar

Contains an implementation (sklearn API) of the algorithm proposed in "GENDIS: GEnetic DIscovery of Shapelets" and code to reproduce all experiments.

Implementation of K-Nearest Neighbors Algorithm Using PySpark

Module is created to build a spam filter using Python and the multinomial Naive Bayes algorithm.

Send rockets to Mars with artificial intelligence(Genetic algorithm) in python.

Decision Tree Regression algorithm implemented on Python from scratch.

This machine-learning algorithm takes in data from the last 60 days and tries to predict tomorrow's price of any crypto you ask it.

BASTA: The BAyesian STellar Algorithm

using Machine Learning Algorithm to classification AppleStore application

The project's goal is to show a real world application of image segmentation using k means algorithm

Extreme Learning Machine implementation in Python

Implementation of different ML Algorithms from scratch, written in Python 3.x

A Python implementation of the Robotics Toolbox for MATLAB

A Python implementation of GRAIL, a generic framework to learn compact time series representations.

Implementation of linesearch Optimization Algorithms in Python

Home repository for the Regularized Greedy Forest (RGF) library. It includes original implementation from the paper and multithreaded one written in C++, along with various language-specific wrappers.

High performance implementation of Extreme Learning Machines (fast randomized neural networks).

TensorFlow implementation of an arbitrary order Factorization Machine

Relevance Vector Machine implementation using the scikit-learn API.

Unofficial pytorch implementation of the paper "Context Reasoning Attention Network for Image Super-Resolution (ICCV 2021)"