Stacked Generalization (Ensemble Learning)

Overview

Stacking (stacked generalization)

PyPI version license

Overview

ikki407/stacking - Simple and useful stacking library, written in Python.

User can use models of scikit-learn, XGboost, and Keras for stacking.
As a feature of this library, all out-of-fold predictions can be saved for further analisys after training.

Description

Stacking (sometimes called stacked generalization) involves training a learning algorithm to combine the predictions of several other learning algorithms. The basic idea is to use a pool of base classifiers, then using another classifier to combine their predictions, with the aim of reducing the generalization error.

This blog is very helpful to understand stacking and ensemble learning.

Usage

See working example:

To run these examples, just run sh run.sh. Note that:

  1. Set train and test dataset under data/input

  2. Created features from original dataset need to be under data/output/features

  3. Models for stacking are defined in scripts.py under scripts folder

  4. Need to define created features in that scripts

  5. Just run sh run.sh (python scripts/XXX.py).

Detailed Usage

  1. Set train dataset with its target data and test dataset.

    FEATURE_LIST_stage1 = {
                    'train':(
                             INPUT_PATH + 'train.csv',
                             FEATURES_PATH + 'train_log.csv',
                            ),
    
                    'target':(
                             INPUT_PATH + 'target.csv',
                            ),
    
                    'test':(
                             INPUT_PATH + 'test.csv',
                             FEATURES_PATH + 'test_log.csv',
                            ),
                    }
  2. Define model classes that inherit BaseModel class, which are used in Stage 1, Stage 2, ..., Stage N.

    # For Stage 1
    PARAMS_V1 = {
            'colsample_bytree':0.80,
            'learning_rate':0.1,"eval_metric":"auc",
            'max_depth':5, 'min_child_weight':1,
            'nthread':4,
            'objective':'binary:logistic','seed':407,
            'silent':1, 'subsample':0.60,
            }
    
    class ModelV1(BaseModel):
            def build_model(self):
                return XGBClassifier(params=self.params, num_round=10)
    
    ...
    
    # For Stage 2
    PARAMS_V1_stage2 = {
                        'penalty':'l2',
                        'tol':0.0001, 
                        'C':1.0, 
                        'random_state':None, 
                        'verbose':0, 
                        'n_jobs':8
                        }
    
    class ModelV1_stage2(BaseModel):
            def build_model(self):
                return LR(**self.params)
  3. Train each models of Stage 1 for stacking.

    m = ModelV1(name="v1_stage1",
                flist=FEATURE_LIST_stage1,
                params = PARAMS_V1,
                kind = 'st'
                )
    m.run()
    
    ...
  4. Train each model(s) of Stage 2 by using the prediction of Stage-1 models.

    FEATURE_LIST_stage2 = {
                'train': (
                         TEMP_PATH + 'v1_stage1_all_fold.csv',
                         TEMP_PATH + 'v2_stage1_all_fold.csv',
                         TEMP_PATH + 'v3_stage1_all_fold.csv',
                         TEMP_PATH + 'v4_stage1_all_fold.csv',
                         ...
                         ),
    
                'target':(
                         INPUT_PATH + 'target.csv',
                         ),
    
                'test': (
                        TEMP_PATH + 'v1_stage1_test.csv',
                        TEMP_PATH + 'v2_stage1_test.csv',
                        TEMP_PATH + 'v3_stage1_test.csv',
                        TEMP_PATH + 'v4_stage1_test.csv',
                        ...                     
                        ),
                }
    
    # Models
    m = ModelV1_stage2(name="v1_stage2",
                    flist=FEATURE_LIST_stage2,
                    params = PARAMS_V1_stage2,
                    kind = 'st',
                    )
    m.run()
  5. Final result is saved as v1_stage2_TestInAllTrainingData.csv.

Prerequisite

  • (MaxOS) Install xgboost first manually: pip install xgboost
  • (Optional) Install paratext: fast csv loading library

Installation

To install stacking, cd to the stacking folder and run the install command**(up-to-date version, recommended)**:

sudo python setup.py install

You can also install stacking from PyPI:

pip install stacking

Files

Details of scripts

  • base.py:
    • Base models for stacking are defined here (using sklearn.base.BaseEstimator).
    • Some models are defined here. e.g., XGBoost, Keras, Vowpal Wabbit.
    • These models are wrapped as scikit-learn like (using sklearn.base.ClassifierMixin, sklearn.base.RegressorMixin).
    • That is, model class has some methods, fit(), predict_proba(), and predict().

New user-defined models can be added here.

Scikit-learn models can be used.

Base model have some arguments.

  • 's': Stacking. Saving oof(out-of-fold) prediction({model_name}_all_fold.csv) and average of test prediction based on train-fold models({model_name}_test.csv). These files will be used for next level stacking.

  • 't': Training with all data and predict test({model_name}_TestInAllTrainingData.csv). In this training, no validation data are used.

  • 'st': Stacking and then training with all data and predict test ('s' and 't').

  • 'cv': Only cross validation without saving the prediction.

Define several models and its parameters used for stacking. Define task details on the top of script. Train and test feature set are defined here. Need to define CV-fold index.

Any level stacking can be defined.

PredictionFiles

Reference

[1] Wolpert, David H. Stacked generalization, Neural Networks, 5(2), 241-259

[2] Ensemble learning(Stacking)

[3] KAGGLE ENSEMBLING GUIDE

Comments
  • Test

    Test

    Need to carry out following tests.

    • [x] binary classification
    • [x] multi-class classification
    • [x] regression

    First, we need to create dataset(train & test) for above problems.

    Next, define several models.

    Finally, do stacking.

    opened by ikki407 4
  • About EXAMPLE

    About EXAMPLE

    Hi I find something maybe you don't notice 1. First of all, run 'pip install stacking' , it's not work for macOS because the xgboost package can't install here automatic,so we need install xgboost first manually. 2. I notice that in "examples/multi_class/scripts/multiclass.py" , something wrong in line 25 ,it should be "from keras.regularizers import l1, l2" ,little mistake but make 'bash run.sh' can't work 3. 'No module named paratext' is another problem ,the solution is git from https://github.com/wiseio/paratext, and don't forget to install swig by using 'brew install swig' if brew have been installed

    That's it ,now i can run the EXAMPLE, i suggest the environment should write in README.md just in case. BTW, project is wonderful!

    enhancement 
    opened by MrLevo520 2
  • Added validation in training

    Added validation in training

    In stacking(i.e., cross-validation), validation data can be used for checking training process. XGBoost and Keras are now done because this models show the evaluation during training. ( #8 )

    before(XGB):

    Fold 0
    [0] train-mlogloss:1.963271
    [1] train-mlogloss:1.716260
    [2] train-mlogloss:1.519960
    [3] train-mlogloss:1.358485
    [4] train-mlogloss:1.224851
    [5] train-mlogloss:1.111999
    [6] train-mlogloss:1.012638
    [7] train-mlogloss:0.925325
    [8] train-mlogloss:0.846453
    [9] train-mlogloss:0.778393
    logloss:  0.947207616242
    

    after(XGB):

    Fold 0
    [0] train-mlogloss:1.963271 validation-mlogloss:2.006139
    [1] train-mlogloss:1.716260 validation-mlogloss:1.782591
    [2] train-mlogloss:1.519960 validation-mlogloss:1.601844
    [3] train-mlogloss:1.358485 validation-mlogloss:1.462138
    [4] train-mlogloss:1.224851 validation-mlogloss:1.343135
    [5] train-mlogloss:1.111999 validation-mlogloss:1.242986
    [6] train-mlogloss:1.012638 validation-mlogloss:1.152776
    [7] train-mlogloss:0.925325 validation-mlogloss:1.078933
    [8] train-mlogloss:0.846453 validation-mlogloss:1.009647
    [9] train-mlogloss:0.778393 validation-mlogloss:0.947208
    logloss:  0.947207616242
    

    before(Keras):

    Fold 0
    Train on 1072 samples, validate on 275 samples
    Epoch 1/15
    1072/1072 [==============================] - 0s - loss: 0.9323 - acc: 0.7015
    Epoch 2/15
    1072/1072 [==============================] - 0s - loss: 0.3831 - acc: 0.8843
    Epoch 3/15
    1072/1072 [==============================] - 0s - loss: 0.3377 - acc: 0.8909
    Epoch 4/15
    1072/1072 [==============================] - 0s - loss: 0.2932 - acc: 0.9002
    Epoch 5/15
    1072/1072 [==============================] - 0s - loss: 0.2936 - acc: 0.9198
    Epoch 6/15
    1072/1072 [==============================] - 0s - loss: 0.3889 - acc: 0.8983
    Epoch 7/15
    1072/1072 [==============================] - 0s - loss: 0.3491 - acc: 0.9179
    Epoch 8/15
    1072/1072 [==============================] - 0s - loss: 0.3642 - acc: 0.9114
    Epoch 9/15
    1072/1072 [==============================] - 0s - loss: 0.3045 - acc: 0.9216
    Epoch 10/15
    1072/1072 [==============================] - 0s - loss: 0.4251 - acc: 0.9114
    mlogloss = 0.4170
    

    after(Keras):

    Fold 0
    Train on 1072 samples, validate on 275 samples
    Epoch 1/15
    1072/1072 [==============================] - 0s - loss: 0.9323 - acc: 0.7015 - val_loss: 0.2980 - val_acc: 0.9018
    Epoch 2/15
    1072/1072 [==============================] - 0s - loss: 0.3831 - acc: 0.8843 - val_loss: 0.1917 - val_acc: 0.9309
    Epoch 3/15
    1072/1072 [==============================] - 0s - loss: 0.3377 - acc: 0.8909 - val_loss: 0.2097 - val_acc: 0.9309
    Epoch 4/15
    1072/1072 [==============================] - 0s - loss: 0.2932 - acc: 0.9002 - val_loss: 0.1793 - val_acc: 0.9382
    Epoch 5/15
    1072/1072 [==============================] - 0s - loss: 0.2936 - acc: 0.9198 - val_loss: 0.2713 - val_acc: 0.9309
    Epoch 6/15
    1072/1072 [==============================] - 0s - loss: 0.3889 - acc: 0.8983 - val_loss: 0.2549 - val_acc: 0.9382
    Epoch 7/15
    1072/1072 [==============================] - 0s - loss: 0.3491 - acc: 0.9179 - val_loss: 0.3072 - val_acc: 0.9418
    Epoch 8/15
    1072/1072 [==============================] - 0s - loss: 0.3642 - acc: 0.9114 - val_loss: 0.2830 - val_acc: 0.9418
    Epoch 9/15
    1072/1072 [==============================] - 0s - loss: 0.3045 - acc: 0.9216 - val_loss: 0.3789 - val_acc: 0.9236
    Epoch 10/15
    1072/1072 [==============================] - 0s - loss: 0.4251 - acc: 0.9114 - val_loss: 0.4170 - val_acc: 0.9273
    mlogloss = 0.4170
    
    opened by ikki407 1
  • Multi class

    Multi class

    Added multi-class classification methods in base_fixed_fold.py and modified the way of holding predictions. ( #3 )

    And tested each function for multi-class classification. ( #9 ) Also tested binary classification and regression problem.

    Scikit-learn methods are not tested. So it is needed.

    opened by ikki407 1
  • Added reg model & some fixed

    Added reg model & some fixed

    Added regression test under test/regression/. ( #9 ) Run following code sh run.sh

    Fixed normalization in KerasRegressor.

    Modified the way of setting problem type.

    Need to set like following code above your model-defined script.

    # ----- Set problem type!! -----
    problem_type = 'regression'
    classification_type = ''
    eval_type = 'rmse'
    
    BaseModel.set_prob_type(problem_type, classification_type, eval_type)
    

    I used class variables like BaseModel.problem_type in base_fixed_fold.py instead of global variables. Discussion is need.

    opened by ikki407 1
  • Added binary test & some fixed

    Added binary test & some fixed

    Tested binary prediction under test/binary/. ( #9 done binary classification) Run following code sh run.sh
    XGBoost and Keras are tested.

    And modified base_fixed_fold.py

    • Modified PATH variables
    • Fixed how to create CV-fold index file ( #4, #6 )
      • User need to define own create_cv_id()
    • Delete arguments of keras(i.e., show_accuracy)
    • Delete save_pred_as_submit_format() ( #5, #6 )
      • User need to create submission format file(i.e., adding prediction ID, name...)
    opened by ikki407 1
  • validation in training of stacking

    validation in training of stacking

    In stacking, test data(out-of-fold data) is not passed to models as validation data (XGBoost and NN). That is, validation scores are calculated after model training is done. It will be convenient to check the validation score every epoch. So need to pass out-of-fold data in model training as well.

    opened by ikki407 1
  • load_data() bug

    load_data() bug

    Current load_data() use the same train and test features. (related #2 ) And target is included in train dataset.

    FEATURE_LIST_stage1 = {
                    'train':(INPUT_PATH + 'train.csv',
                             FEATURES_PATH + 'train_log.csv',
    
                            ),#target is in 'train'
                    'test':(INPUT_PATH + 'test.csv',
                            FEATURES_PATH + 'test_log.csv',
                            ),
                    }
    

    However, in level 2 stacking, the following codes cause an error if target data is passed only in train dataset. This error occurs due to the difference of length of lists between train and test.

        FEATURE_LIST_stage2 = {
                    'train':(INPUT_PATH + 'target.csv',
    
                             TEMP_PATH + 'v1_stage1_all_fold.csv',
                             TEMP_PATH + 'v2_stage1_all_fold.csv',
                             TEMP_PATH + 'v3_stage1_all_fold.csv',
                             TEMP_PATH + 'v4_stage1_all_fold.csv',
                             TEMP_PATH + 'v5_stage1_all_fold.csv',
                             TEMP_PATH + 'v6_stage1_all_fold.csv',
    
                            ),#target is in 'train'
                    'test':(
                             TEMP_PATH + 'v1_stage1_test.csv',
                             TEMP_PATH + 'v2_stage1_test.csv',
                             TEMP_PATH + 'v3_stage1_test.csv',
                             TEMP_PATH + 'v4_stage1_test.csv',
                             TEMP_PATH + 'v5_stage1_test.csv',
                             TEMP_PATH + 'v6_stage1_test.csv',                       
                            ),
                    }
    

    This bug is fixed by using each length of list in reading. But for more general library, new key of feature dictionary, i.e., target, should be made like:

        FEATURE_LIST_stage2 = {
                    'train':(
                             TEMP_PATH + 'v1_stage1_all_fold.csv',
                             TEMP_PATH + 'v2_stage1_all_fold.csv',
                             TEMP_PATH + 'v3_stage1_all_fold.csv',
                            ),#target is not in 'train'
    
                    'target':(
                             INPUT_PATH + 'target.csv',
                            ),#target is in 'target'
    
                    'test':(
                             TEMP_PATH + 'v1_stage1_test.csv',
                             TEMP_PATH + 'v2_stage1_test.csv',
                             TEMP_PATH + 'v3_stage1_test.csv',
                            ),
                    }
    

    This is very reasonable.

    bug 
    opened by ikki407 0
  • Checking function when making directory

    Checking function when making directory

    Now data directories(i.e., data/input(output)/...) are made in importing stacking if the directories does not exist. But if this is simply implemented, user can make directories without intent. It is useful that one can select if making them in the first import.

    like:

    Can new directories for input and output data be created? [Y/n]
    Y(N) <---- input
    
    opened by ikki407 0
  • Task-dependent functions

    Task-dependent functions

    Should change task-dependent functions to virtual functions?

    User need to define such functions themselves. (e.g., CV-fold index, save_pred_as_submit_format, )

    opened by ikki407 0
  • How to create CV-fold index

    How to create CV-fold index

    • previous version for CV-fold file. Using index.

    train:

    [2,3,4,5,6,7,8,9]

    [0,1,4,5,6,7,8,9]

    [0,1,2,3,6,7,8,9]

    [0,1,2,3,4,5,8,9]

    [0,1,2,3,4,5,6,7]

    test:

    [0,1]

    [2,3]

    [4,5]

    [6,7]

    [8,9]

    • current version for CV-fold file(better than previous one). Using fold ID.

    [0,0,1,1,2,2,3,3,4,4]

    But current BaseModel uses previous version architectures. If current version is used, it is changed to previous format. So need to change that to using original format. And need to change .ix to .iloc for stable behavior.

    Need to change global CV-fold file name with new CV-fold file name, if new CV-fold be created.

    opened by ikki407 0
  • Efficient PATH setting problem

    Efficient PATH setting problem

    Now path is defined using slash / like

    PATH = "foo/"
    PATH + 'hoge'
    

    , but it seems good to use os.path.join like

    import os
    
    PATH = "foo"
    os.path.join(PATH, 'hoge')
    
    opened by ikki407 0
  • Large data loading problem

    Large data loading problem

    Training and test data are now loaded at the beginning of model building.

    However, it is not time-efficient if the data is very large.

    So, if the data used is not changed through staking script, the data should be stored somewhere of base.py.

    STORED_X = pd.DataFrame()
    
    def load_data(stored=True):
        if STORED_X:
            return STORED_X
    
        else:
            ...
    
        if stored:
            STORED_X = X.copy()
            return STORED_X
    
        else:
            return X
    
    opened by ikki407 0
  • Version update stability

    Version update stability

    Refactor hard coding for stability in preparation for the version up of libraries, such as sklearn, keras, and XGBoost. Specifically, initialization of BaseModel class.

    opened by ikki407 0
  • Reinitialize weights of neural net

    Reinitialize weights of neural net

    In Keras-implemented neural net, to avoid recompile, initial weights after compilation is saved and used at the next beginning of training in cross validation. However, the initial weights are same in all fold-training. So initial weights should be changed at each training.

    A possible solution is passing the argument of compilation (e.g., optimizer, loss, and metrics). In binary_class.py,

    class ModelV2(BaseModel):
            def build_model(self):
                model = Sequential()
                model.add(Dense(64, input_shape=nn_input_dim_NN, init='he_normal'))
                model.add(LeakyReLU(alpha=.00001))
                model.add(Dropout(0.5))
    
                model.add(Dense(output_dim, init='he_normal'))
                model.add(Activation('softmax'))
                sgd = SGD(lr=0.1, decay=1e-5, momentum=0.9, nesterov=True)
    
                compile_options = {
                                               'optimizer': sgd, 
                                               'loss': 'categorical_crossentropy', 
                                               'metrics': ["accuracy"]
                                               }
    
                return KerasClassifier(nn=model, compile_options=compile_options, **self.params)
    

    In base.py,

    import copy
    
    class KerasClassifier(BaseEstimator, ClassifierMixin):
        def __init__(self,nn):
            self.nn = nn
    
        def fit(self, X, y, X_test=None, y_test=None):
            self.compiled_nn = copy.copy(self.nn)
            self.compiled_nn.compile(**self.compile_options)
    
            return self.compiled_nn(X, y)
    

    But this approach leads memory consumption...

    opened by ikki407 0
Owner
Ikki Tanaka
Data Scientist, Machine Learning/Reinforcement Learning Engineer. Kaggle Master.
Ikki Tanaka
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 9, 2023
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

Vowpal Wabbit 8.1k Dec 30, 2022
Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft 366 Jan 3, 2023
A data preprocessing package for time series data. Design for machine learning and deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

Allen Chiang 152 Jan 7, 2023
A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

Daniel Formoso 5.7k Dec 30, 2022
A comprehensive repository containing 30+ notebooks on learning machine learning!

A comprehensive repository containing 30+ notebooks on learning machine learning!

Jean de Dieu Nyandwi 3.8k Jan 9, 2023
MIT-Machine Learning with Python–From Linear Models to Deep Learning

MIT-Machine Learning with Python–From Linear Models to Deep Learning | One of the 5 courses in MIT MicroMasters in Statistics & Data Science Welcome t

null 2 Aug 23, 2022
CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

CML with cloud compute This repository contains a sample project using CML with Terraform (via the cml-runner function) to launch an AWS EC2 instance

Iterative 19 Oct 3, 2022
Implemented four supervised learning Machine Learning algorithms

Implemented four supervised learning Machine Learning algorithms from an algorithmic family called Classification and Regression Trees (CARTs), details see README_Report.

Teng (Elijah)  Xue 0 Jan 31, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 8, 2023
cuML - RAPIDS Machine Learning Library

cuML - GPU Machine Learning Algorithms cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions t

RAPIDS 3.1k Dec 28, 2022
A modular active learning framework for Python

Modular Active Learning framework for Python3 Page contents Introduction Active learning from bird's-eye view modAL in action From zero to one in a fe

modAL 1.9k Dec 31, 2022
mlpack: a scalable C++ machine learning library --

a fast, flexible machine learning library Home | Documentation | Doxygen | Community | Help | IRC Chat Download: current stable version (3.4.2) mlpack

mlpack 4.2k Jan 1, 2023
A toolkit for making real world machine learning and data analysis applications in C++

dlib C++ library Dlib is a modern C++ toolkit containing machine learning algorithms and tools for creating complex software in C++ to solve real worl

Davis E. King 11.6k Jan 2, 2023
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 29, 2022
50% faster, 50% less RAM Machine Learning. Numba rewritten Sklearn. SVD, NNMF, PCA, LinearReg, RidgeReg, Randomized, Truncated SVD/PCA, CSR Matrices all 50+% faster

[Due to the time taken @ uni, work + hell breaking loose in my life, since things have calmed down a bit, will continue commiting!!!] [By the way, I'm

Daniel Han-Chen 1.4k Jan 1, 2023
Machine Learning toolbox for Humans

Reproducible Experiment Platform (REP) REP is ipython-based environment for conducting data-driven research in a consistent and reproducible way. Main

Yandex 663 Dec 31, 2022
Sequence learning toolkit for Python

seqlearn seqlearn is a sequence classification toolkit for Python. It is designed to extend scikit-learn and offer as similar as possible an API. Comp

Lars 653 Dec 27, 2022