Library for machine learning stacking generalization.

Last update: Jul 19, 2022

Related tags

Machine Learning machine-learning

Overview

stacked_generalization

Implemented machine learning *stacking technic[1]* as handy library in Python. Feature weighted linear stacking is also available. (See https://github.com/fukatani/stacked_generalization/tree/master/stacked_generalization/example)

Including simple model cache system Joblibed claasifier and Joblibed Regressor.

Feature

1) Any scikit-learn model is availavle for Stage 0 and Stage 1 model.

And stacked model itself has the same interface as scikit-learn library.

You can replace model such as RandomForestClassifier to stacked model easily in your scripts. And multi stage stacking is also easy.

ex.

from stacked_generalization.lib.stacking import StackedClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn import datasets, metrics
iris = datasets.load_iris()

# Stage 1 model
bclf = LogisticRegression(random_state=1)

# Stage 0 models
clfs = [RandomForestClassifier(n_estimators=40, criterion = 'gini', random_state=1),
        GradientBoostingClassifier(n_estimators=25, random_state=1),
        RidgeClassifier(random_state=1)]

# same interface as scikit-learn
sl = StackedClassifier(bclf, clfs)
sl.fit(iris.target, iris.data)
score = metrics.accuracy_score(iris.target, sl.predict(iris.data))
print("Accuracy: %f" % score)

More detail example is here. https://github.com/fukatani/stacked_generalization/blob/master/stacked_generalization/example/cross_validation_for_iris.py

https://github.com/fukatani/stacked_generalization/blob/master/stacked_generalization/example/simple_regression.py

2) Evaluation model by out-of-bugs score.

Stacking technic itself uses CV to stage0. So if you use CV for entire stacked model, *each stage 0 model are fitted n_folds squared times.* Sometimes its computational cost can be significent, therefore we implemented CV only for stage1[2].

For example, when we get 3 blends (stage0 prediction), 2 blends are used for stage 1 fitting. The remaining one blend is used for model test. Repitation this cycle for all 3 blends, and averaging scores, we can get oob (out-of-bugs) score *with only n_fold times stage0 fitting.*

ex.

sl = StackedClassifier(bclf, clfs, oob_score_flag=True)
sl.fit(iris.data, iris.target)
print("Accuracy: %f" % sl.oob_score_)

3) Caching stage1 blend_data and trained model. (optional)

If cache is exists, recalculation for stage 0 will be skipped. This function is useful for stage 1 tuning.

sl = StackedClassifier(bclf, clfs, save_stage0=True, save_dir='stack_temp')

Feature of Joblibed Classifier / Regressor

Joblibed Classifier / Regressor is simple cache system for scikit-learn machine learning model. You can use it easily by minimum code modification.

At first fitting and prediction, model calculation is performed normally. At the same time, model fitting result and prediction result are saved as .pkl and .csv respectively.

At second fitting and prediction, if cache is existence, model and prediction results will be loaded from cache and never recalculation.

e.g.

from sklearn import datasets
from sklearn.cross_validation import StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from stacked_generalization.lib.joblibed import JoblibedClassifier

# Load iris
iris = datasets.load_iris()

# Declaration of Joblibed model
rf = RandomForestClassifier(n_estimators=40)
clf = JoblibedClassifier(rf, "rf")

train_idx, test_idx = list(StratifiedKFold(iris.target, 3))[0]

xs_train = iris.data[train_idx]
y_train = iris.target[train_idx]
xs_test = iris.data[test_idx]
y_test = iris.target[test_idx]

# Need to indicate sample for discriminating cache existence.
clf.fit(xs_train, y_train, train_idx)
score = clf.score(xs_test, y_test, test_idx)

Software Requirement

Python (2.7 or 3.5 or later)
numpy
scikit-learn
pandas

Installation

pip install stacked_generalization

License

MIT License. (http://opensource.org/licenses/mit-license.php)

Copyright

Many part of the implementation of stacking is based on the following. Thanks! https://github.com/log0/vertebral/blob/master/stacked_generalization.py

Other

Any contributions (implement, documentation, test or idea...) are welcome.

References

[1] L. Breiman, "Stacked Regressions", Machine Learning, 24, 49-64 (1996). [2] J. Sill1 et al, "Feature Weighted Linear Stacking", https://arxiv.org/abs/0911.0460, 2009.

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

Petastorm Contents Petastorm Installation Generating a dataset Plain Python API Tensorflow API Pytorch API Spark Dataset Converter API Analyzing petas

1.6k Dec 31, 2022

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

152 Jan 2, 2023

Pandas Machine Learning and Quant Finance Library Collection

148 Dec 7, 2022

FLAML is a lightweight Python library that finds accurate machine learning models automatically, efficiently and economically

FLAML - Fast and Lightweight AutoML

2.2k Jan 9, 2023

Model Agnostic Confidence Estimator (MACEST) - A Python library for calibrating Machine Learning models' confidence scores

95 Dec 28, 2022

Machine learning template for projects based on sklearn library.

17 Oct 28, 2022

A Python library for choreographing your machine learning research.

270 Jan 6, 2023

Pytools is an open source library containing general machine learning and visualisation utilities for reuse

pytools is an open source library containing general machine learning and visualisation utilities for reuse, including: Basic tools for API developmen

26 Nov 6, 2022

Python Automated Machine Learning library for tabular data.

Simple but powerful Automated Machine Learning library for tabular data. It uses efficient in-memory SAP HANA algorithms to automate routine Data Scie

47 Dec 17, 2022

Comments

something goes wrong when I use command line "pip install stacked_generalization"
here is the detail. pip install stacked_generalization Collecting stacked_generalization Using cached https://files.pythonhosted.org/packages/d0/99/e42d99355f00068aa2f8eef5f88c1610558f10217e2704067b5134629397/stacked_generalization-0.0.4.zip Complete output from command python setup.py egg_info: Traceback (most recent call last): File "", line 1, in File "/tmp/pip-build-mm6gy5sl/stacked-generalization/setup.py", line 28, in long_description=read_md('Readme.md'), File "/tmp/pip-build-mm6gy5sl/stacked-generalization/setup.py", line 13, in read_md = lambda f: pypandoc.convert(f, 'rst') File "/home/wxk/anaconda3/lib/python3.6/site-packages/pypandoc/init.py", line 66, in convert raise RuntimeError("Format missing, but need one (identified source as text as no " RuntimeError: Format missing, but need one (identified source as text as no file with that name was found).

----------------------------------------

Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-mm6gy5sl/stacked-generalization/

Please tell me how to fix it if you know it.
opened by wEEang763162 2
fix import util

you can try it with pip3 install -e [email protected]:geoHeil/stacked_generalization.git@fixImportUtil#egg=stacked-generalization

I fixed the failing import of utils

opened by geoHeil 1
ModuleNotFoundError and ImportError due to deprecated sklearn sub-modules
Hi,

I was trying to rerun the example, after doing:

from stacked_generalization.lib.stacking import FWLSRegressor

I got the following errors, for stacking.py and util.py:

ModuleNotFoundError: No module named 'sklearn.cross_validation' ImportError: cannot import name 'joblib' from 'sklearn.externals'

According to https://stackoverflow.com/questions/30667525/importerror-no-module-named-sklearn-cross-validation and https://stackoverflow.com/questions/61893719/importerror-cannot-import-name-joblib-from-sklearn-externals, it seems to be related to the deprecation of sub-modules. It worked for me to replace all the:

from sklearn.cross_validation import StratifiedKFold, KFold from sklearn.externals import joblib

by

from sklearn.model_selection import StratifiedKFold, KFold import joblib

Thanks.
opened by ShaunFChen 0
Improve memory management.

Current version of stacked_genelarization continue to holt all stage0 model after fitting. But in case of huge stacking model (~100), stacked_genelarization needs huge memory.

By adding lazy evaluation, I want to improve memory management. i.e. hold stage0 model only when predicting, and release memory. (Save model to joblib cache and load only when needed.)

opened by fukatani 0

Owner

GitHub

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

6.9k Jan 5, 2023

Library for machine learning stacking generalization.

Related tags

Overview

stacked_generalization

Feature

1) Any scikit-learn model is availavle for Stage 0 and Stage 1 model.

And stacked model itself has the same interface as scikit-learn library.

2) Evaluation model by out-of-bugs score.

3) Caching stage1 blend_data and trained model. (optional)

Feature of Joblibed Classifier / Regressor

Software Requirement

Installation

License

Copyright

Other

References

You might also like...

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

QuickAI is a Python library that makes it extremely easy to experiment with state-of-the-art Machine Learning models.

Pandas Machine Learning and Quant Finance Library Collection

FLAML is a lightweight Python library that finds accurate machine learning models automatically, efficiently and economically

Model Agnostic Confidence Estimator (MACEST) - A Python library for calibrating Machine Learning models' confidence scores

Machine learning template for projects based on sklearn library.

A Python library for choreographing your machine learning research.

Pytools is an open source library containing general machine learning and visualisation utilities for reuse

Python Automated Machine Learning library for tabular data.

Comments

something goes wrong when I use command line "pip install stacked_generalization"

Please tell me how to fix it if you know it.

fix import util

ModuleNotFoundError and ImportError due to deprecated sklearn sub-modules

Improve memory management.

Owner

Stacked Generalization (Ensemble Learning)

A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

cuML - RAPIDS Machine Learning Library

mlpack: a scalable C++ machine learning library --

A library of extension and helper modules for Python's data analysis and machine learning libraries.

MLBox is a powerful Automated Machine Learning python library.

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.