Python package for stacking (machine learning technique)

Overview

PyPI version PyPI license Build Status Coverage Status PyPI pyversions

vecstack

Python package for stacking (stacked generalization) featuring lightweight functional API and fully compatible scikit-learn API
Convenient way to automate OOF computation, prediction and bagging using any number of models

Get started

Installation

Note: Python 3.5 or higher is required. If you’re still using Python 2.7 or 3.4 see installation details here

  • Classic 1st time installation (recommended):
    • pip install vecstack
  • Install for current user only (if you have some troubles with write permission):
    • pip install --user vecstack
  • If your PATH doesn't work:
    • /usr/bin/python -m pip install vecstack
    • C:/Python36/python -m pip install vecstack
  • Upgrade vecstack and all dependencies:
    • pip install --upgrade vecstack
  • Upgrade vecstack WITHOUT upgrading dependencies:
    • pip install --upgrade --no-deps vecstack
  • Upgrade directly from GitHub WITHOUT upgrading dependencies:
    • pip install --upgrade --no-deps https://github.com/vecxoz/vecstack/archive/master.zip
  • Uninstall
    • pip uninstall vecstack

Usage. Functional API

from vecstack import stacking

# Get your data

# Initialize 1st level estimators
models = [LinearRegression(),
          Ridge(random_state=0)]

# Get your stacked features in a single line
S_train, S_test = stacking(models, X_train, y_train, X_test, regression=True, verbose=2)

# Use 2nd level estimator with stacked features

Usage. Scikit-learn API

from vecstack import StackingTransformer

# Get your data

# Initialize 1st level estimators
estimators = [('lr', LinearRegression()),
              ('ridge', Ridge(random_state=0))]
              
# Initialize StackingTransformer
stack = StackingTransformer(estimators, regression=True, verbose=2)

# Fit
stack = stack.fit(X_train, y_train)

# Get your stacked features
S_train = stack.transform(X_train)
S_test = stack.transform(X_test)

# Use 2nd level estimator with stacked features

Stacking FAQ

  1. How can I report an issue? How can I ask a question about stacking or vecstack package?
  2. How can I say thanks?
  3. How to cite vecstack?
  4. What is stacking?
  5. What about stacking name?
  6. Do I need stacking at all?
  7. Can you explain stacking (stacked generalization) in 10 lines of code?
  8. Why do I need complicated inner procedure for stacking?
  9. I want to implement stacking (stacked generalization) from scratch. Can you help me?
  10. What is OOF?
  11. What are estimator, learner, model?
  12. What is blending? How is it related to stacking?
  13. How to optimize weights for weighted average?
  14. What is better: weighted average for current level or additional level?
  15. What is bagging? How is it related to stacking?
  16. How many models should I use on a given stacking level?
  17. How many stacking levels should I use?
  18. How do I choose models for stacking?
  19. I am trying hard but still can't beat my best single model with stacking. What is wrong?
  20. What should I choose: functional API (stacking function) or Scikit-learn API (StackingTransformer)?
  21. How do parameters of stacking function and StackingTransformer correspond?
  22. Why Scikit-learn API was implemented as transformer and not predictor?
  23. How to estimate stacking training time and number of models which will be built?
  24. Which stacking variant should I use: 'A' ('oof_pred_bag') or 'B' ('oof_pred')?
  25. How to choose number of folds?
  26. When I transform train set I see 'Train set was detected'. What does it mean?
  27. How is the very first stacking level called: L0 or L1? Where does counting start?
  28. Can I use (Randomized)GridSearchCV to tune the whole stacking Pipeline?
  29. How to define custom metric, especially AUC?
  30. Do folds (splits) have to be the same across estimators and stacking levels? How does random_state work?

1. How can I report an issue? How can I ask a question about stacking or vecstack package?

Just open an issue here.
Ask me anything on the topic.
I'm a bit busy, so typically I answer on the next day.

2. How can I say thanks?

Just give me a star in the top right corner of the repository page.

3. How to cite vecstack?

@misc{vecstack2016,
       author = {Igor Ivanov},
       title = {Vecstack},
       year = {2016},
       publisher = {GitHub},
       howpublished = {\url{https://github.com/vecxoz/vecstack}},
}

4. What is stacking?

Stacking (stacked generalization) is a machine learning ensembling technique.
Main idea is to use predictions as features.
More specifically we predict train set (in CV-like fashion) and test set using some 1st level model(s), and then use these predictions as features for 2nd level model. You can find more details (concept, pictures, code) in stacking tutorial.
Also make sure to check out:

5. What about stacking name?

Often it is also called stacked generalization. The term is derived from the verb to stack (to put together, to put on top of each other). It implies that we put some models on top of other models, i.e. train some models on predictions of other models. From another point of view we can say that we stack predictions in order to use them as features.

6. Do I need stacking at all?

It depends on specific business case. The main thing to know about stacking is that it requires significant computing resources. No Free Lunch Theorem applies as always. Stacking can give you an improvement but for certain price (deployment, computation, maintenance). Only experiment for given business case will give you an answer: is it worth an effort and money.

At current point large part of stacking users are participants of machine learning competitions. On Kaggle you can't go too far without ensembling. I can secretly tell you that at least top half of leaderboard in pretty much any competition uses ensembling (stacking) in some way. Stacking is less popular in production due to time and resource constraints, but I think it gains popularity.

7. Can you explain stacking (stacked generalization) in 10 lines of code?

Of course

8. Why do I need complicated inner procedure for stacking?

I can just do the following. Why not?

model_L1 = XGBRegressor()
model_L1 = model_L1.fit(X_train, y_train)
S_train = model_L1.predict(X_train).reshape(-1, 1)  # <- DOES NOT work due to overfitting. Must be CV
S_test = model_L1.predict(X_test).reshape(-1, 1)
model_L2 = LinearRegression()
model_L2 = model_L2.fit(S_train, y_train)
final_prediction = model_L2.predict(S_test)

Code above will give meaningless result. If we fit on X_train we can’t just predict X_train, because our 1st level model has already seen X_train, and its prediction will be overfitted. To avoid overfitting we perform cross-validation procedure and in each fold we predict out-of-fold (OOF) part of X_train. You can find more details (concept, pictures, code) in stacking tutorial.

9. I want to implement stacking (stacked generalization) from scratch. Can you help me?

Not a problem

10. What is OOF?

OOF is abbreviation for out-of-fold prediction. It's also known as OOF features, stacked features, stacking features, etc. Basically it means predictions for the part of train data that model haven't seen during training.

11. What are estimator, learner, model?

Basically it is the same thing meaning machine learning algorithm. Often these terms are used interchangeably.
Speaking about inner stacking mechanics, you should remember that when you have single 1st level model there will be at least n_folds separate models trained in each CV fold on different subsets of data. See Q23 for more details.

12. What is blending? How is it related to stacking?

Basically it is the same thing. Both approaches use predictions as features.
Often this terms are used interchangeably.
The difference is how we generate features (predictions) for the next level:

  • stacking: perform cross-validation procedure and predict each part of train set (OOF)
  • blending: predict fixed holdout set

vecstack package supports only stacking i.e. cross-validation approach. For given random_state value (e.g. 42) folds (splits) will be the same across all estimators. See also Q30.

13. How to optimize weights for weighted average?

You can use for example:

  • scipy.optimize.minimize
  • scipy.optimize.differential_evolution

14. What is better: weighted average for current level or additional level?

By default you can start from weighted average. It is easier to apply and more chances that it will give good result. Then you can try additional level which potentially can outperform weighted average (but not always and not in an easy way). Experiment is your friend.

15. What is bagging? How is it related to stacking?

Bagging or Bootstrap aggregating works as follows: generate subsets of training set, train models on these subsets and then find average of predictions.
Also term bagging is often used to describe following approaches:

  • train several different models on the same data and average predictions
  • train same model with different random seeds on the same data and average predictions

So if we run stacking and just average predictions - it is bagging.

16. How many models should I use on a given stacking level?

Note 1: The best architecture can be found only by experiment.
Note 2: Always remember that higher number of levels or models does NOT guarantee better result. The key to success in stacking (and ensembling in general) is diversity - low correlation between models.

It depends on many factors like type of problem, type of data, quality of models, correlation of models, expected result, etc.
Some example configurations are listed below.

  • Reasonable starting point:
    • L1: 2-10 models -> L2: weighted (rank) average or single model
  • Then try to add more 1st level models and additional level:
    • L1: 10-50 models -> L2: 2-10 models -> L3: weighted (rank) average
  • If you're crunching numbers at Kaggle and decided to go wild:
    • L1: 100-inf models -> L2: 10-50 models -> L3: 2-10 models -> L4: weighted (rank) average

You can also find some winning stacking architectures on Kaggle blog, e.g.: 1st place in Homesite Quote Conversion.

17. How many stacking levels should I use?

Note 1: The best architecture can be found only by experiment.
Note 2: Always remember that higher number of levels or models does NOT guarantee better result. The key to success in stacking (and ensembling in general) is diversity - low correlation between models.

For some example configurations see Q16.

18. How do I choose models for stacking?

Based on experiments and correlation (e.g. Pearson). Less correlated models give better result. It means that we should never judge our models by accuracy only. We should also consider correlation (how given model is different from others). Sometimes inaccurate but very different model can add substantial value to resulting ensemble.

19. I am trying hard but still can't beat my best single model with stacking. What is wrong?

Nothing is wrong. Stacking is advanced complicated technique. It's hard to make it work. Solution: make sure to try weighted (rank) average first instead of additional level with some advanced models. Average is much easier to apply and in most cases it will surely outperform your best model. If still no luck - then probably your models are highly correlated.

20. What should I choose: functional API (stacking function) or Scikit-learn API (StackingTransformer)?

Quick guide:

  • By default start from StackingTransformer with familiar scikit-learn interface and logic
  • If you need low RAM consumption try stacking function but remember that it does not store models and does not have scikit-learn capabilities

Stacking API comparison:

Property stacking function StackingTransformer
Execution time Same Same
RAM Consumes the smallest possible amount of RAM. Does not store models. At any point in time only one model is alive. Logic: train model -> predict -> delete -> etc. When execution ends all RAM is released. Consumes much more RAM. It stores all models built in each fold. This price is paid for standard scikit-learn capabilities like Pipeline and FeatureUnion.
Access to models after training No Yes
Compatibility with Pipeline and FeatureUnion No Yes
Estimator implementation restrictions Must have only fit and predict (predict_proba) methods Must be fully scikit-learn compatible
NaN and inf in input data Allowed Not allowed
Can automatically save OOF and log in files Yes No
Input dimensionality (X_train, X_test) Arbitrary 2-D

21. How do parameters of stacking function and StackingTransformer correspond?

stacking function StackingTransformer
models=[Ridge()] estimators=[('ridge', Ridge())]
mode='oof_pred_bag' (alias 'A') variant='A'
mode='oof_pred' (alias 'B') variant='B'

22. Why Scikit-learn API was implemented as transformer and not predictor?

  • By nature stacking procedure is predictor, but by application it is definitely transformer.
  • Transformer architecture was chosen because first of all user needs direct access to OOF. I.e. the ability to compute correlations, weighted average, etc.
  • If you need predictor based on StackingTransformer you can easily create it via Pipeline by adding on the top of StackingTransformer some regressor or classifier.
  • Transformer makes it easy to create any number of stacking levels. Using Pipeline we can easily create multilevel stacking by just adding several StackingTransformer's on top of each other and then some final regressor or classifier.

23. How to estimate stacking training time and number of models which will be built?

Note: Stacking usually takes long time. It's expected (inevitable) behavior.

We can compute total number of models which will be built during stacking procedure using following formulas:

  • Variant A: n_models_total = n_estimators * n_folds
  • Variant B: n_models_total = n_estimators * n_folds + n_estimators

Let's look at example. Say we define our stacking procedure as follows:

estimators_L1 = [('lr', LinearRegression()),
                 ('ridge', Ridge())]
stack = StackingTransformer(estimators_L1, n_folds=4)

So we have two 1st level estimators and 4 folds. It means stacking procedure will build the following number of models:

  • Variant A: 8 models total. Each model is trained on 3/4 of X_train.
  • Variant B: 10 models total. 8 models are trained on 3/4 of X_train and 2 models on full X_train.

Compute time:

  • If estimators have relatively similar training time, we can roughly compute total training time as: time_total = n_models_total * time_of_one_model
  • If estimators have different training time, we should compute number of models and time for each estimator separately (set n_estimators=1 in formulas above) and then sum up times.

24. Which stacking variant should I use: 'A' ('oof_pred_bag') or 'B' ('oof_pred')?

You can find out only by experiment. Default choice is variant A, because it takes less time and there should be no significant difference in result. But of course you may also try variant B. For more details see stacking tutorial.

25. How to choose number of folds?

Note: Remember that higher number of folds substantially increase training time (and RAM consumption for StackingTransformer). See Q23.

  • Standard approach: 4 or 5 folds.
  • If data is big: 3 folds.
  • If data is small: you can try more folds like 10 or so.

26. When I transform train set I see 'Train set was detected'. What does it mean?

Note 1: It is NOT allowed to change train set between calls to fit and transform methods. Due to stacking nature transformation is different for train set and any other set. If train set is changed after training, stacking procedure will not be able to correctly identify it and transformation will be wrong.

Note 2: To be correctly detected train set does not necessarily have to be identical (exactly the same). It must have the same shape and all values must be close (np.isclose is used for checking). So if you somehow regenerate your train set you should not worry about numerical precision.

If you transform X_train and see 'Train set was detected' everything is OK. If you transform X_train but you don't see this message then something went wrong. Probably your train set was changed (it is not allowed). In this case you have to retrain StackingTransformer. For more details see stacking tutorial or Q8.

27. How is the very first stacking level called: L0 or L1? Where does counting start?

Common convention: The very first bunch of models which are trained on initial raw data are called L1. On top of L1 we have so called stacker level or meta level or L2 i.e. models which are trained on predictions of L1 models. Count continues in the same fashion up to arbitrary number of levels.

I use this convention in my code and docs. But of course your Kaggle teammates may use some other naming approach, so you should clarify this for your specific case.

28. Can I use (Randomized)GridSearchCV to tune the whole stacking Pipeline?

Yes, technically you can, but it is not recommended because this approach will lead to redundant computations. General practical advice is to tune each estimator separately and then use tuned estimators on the 1st level of stacking. Higher level estimators should be tuned in the same fashion using OOF from previous level. For manual tuning you can use stacking function or StackingTransformer with a single 1st level estimator.

29. How to define custom metric, especially AUC?

from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OneHotEncoder

def auc(y_true, y_pred):
    """ROC AUC metric for both binary and multiclass classification.
    
    Parameters
    ----------
    y_true : 1d numpy array
        True class labels
    y_pred : 2d numpy array
        Predicted probabilities for each class
    """
    ohe = OneHotEncoder(sparse=False)
    y_true = ohe.fit_transform(y_true.reshape(-1, 1))
    auc_score = roc_auc_score(y_true, y_pred)
    return auc_score

30. Do folds (splits) have to be the same across estimators and stacking levels? How does random_state work?

To ensure better result, folds (splits) have to be the same across all estimators and all stacking levels. It means that random_state has to be the same in every call to stacking function or StackingTransformer. This is default behavior of stacking function and StackingTransformer (by default random_state=0). If you want to try different folds (splits) try to set different random_state values.

Stacking concept

  1. We want to predict train set and test set with some 1st level model(s), and then use these predictions as features for 2nd level model(s).
  2. Any model can be used as 1st level model or 2nd level model.
  3. To avoid overfitting (for train set) we use cross-validation technique and in each fold we predict out-of-fold (OOF) part of train set.
  4. The common practice is to use from 3 to 10 folds.
  5. Predict test set:
    • Variant A: In each fold we predict test set, so after completion of all folds we need to find mean (mode) of all temporary test set predictions made in each fold.
    • Variant B: We do not predict test set during cross-validation cycle. After completion of all folds we perform additional step: fit model on full train set and predict test set once. This approach takes more time because we need to perform one additional fitting.
  6. As an example we look at stacking implemented with single 1st level model and 3-fold cross-validation.
  7. Pictures:
    • Variant A: Three pictures describe three folds of cross-validation. After completion of all three folds we get single train feature and single test feature to use with 2nd level model.
    • Variant B: First three pictures describe three folds of cross-validation (like in Variant A) to get single train feature and fourth picture describes additional step to get single test feature.
  8. We can repeat this cycle using other 1st level models to get more features for 2nd level model.
  9. You can also look at animation of Variant A and Variant B.

Variant A

Fold 1 of 3


Fold 2 of 3


Fold 3 of 3

Variant A. Animation

Variant A. Animation

Variant B

Step 1 of 4


Step 2 of 4


Step 3 of 4


Step 4 of 4

Variant B. Animation

Variant B. Animation

Comments
  • How to save model

    How to save model

    when i trained a stacking regression model that has two levels , how can i save model to predict new data like RandomForest which i can use joblib to save a model to predict new data? can i save 1st model and 2 nd model Respectively ?

    opened by zhaobin19941008 14
  • Would it be possible to use Vecstack with a Neural Network?

    Would it be possible to use Vecstack with a Neural Network?

    Hi,

    I used Vecstack to perform a regression with 12 regressors and get a pretty good prediction, after performing an exhausting tuning of each of the 12 estimators. However, I reached a point that adding a 13th estimator starts to denigrate the score (might be over fitting at this point).

    I was able to run a kerras neural network on the same data, but it is not performing very well and my predictions are not very accurate.

    So, I was wondering, if I could now add a kerras neural network into the mix to see if I can increase the accuracy of the predictions for a Housing Pricing dataset from Kaggle. If that is possible, how would I go about it?

    opened by webzest 8
  • Question about usage...

    Question about usage...

    I am trying to predict Housing prices, where I have a train data set and a test data set. the train data has a label and I need to train on it to later use this trained model to predict the label for the test data, which do not have a label. Aso, I followed your process on my train data set and performed the stacking, and applied the second level to the S_train and S_test variables as indicated in your instructions.
    Now that i have done that, how do I proceed to predict the label on the test (unknown) dataset?

    opened by webzest 7
  • Allow user to pass custom folds (GroupKFold)

    Allow user to pass custom folds (GroupKFold)

    As far as I can tell, in the current implementation, you can only pass in the number of folds. What if the user wants to pass in a custom folds object (e.g. sklearn.model_selection.GroupKFold)?

    If this is of interest, I can submit a pull request.

    opened by MarcoGorelli 5
  • Error in `python': free(): invalid next size (normal)

    Error in `python': free(): invalid next size (normal)

    Using any model except GaussianNB causes an error in stacking(): task: [classification] n_classes: [2] metric: [log_loss] mode: [oof_pred_bag] n_models: [1] model 0: [LogisticRegression] /opt/conda/lib/python3.6/site-packages/sklearn/linear_model/base.py:297: RuntimeWarning: overflow encountered in exp np.exp(prob, prob) ---- MEAN: [0.56676799] + [0.01295934] FULL: [0.56677227]

    *** Error in `python': free(): invalid next size (normal): 0x0000564aaa718ea0 *** How to debug it to find the reason of error?

    opened by lukyanenkomax 5
  • Ability to use different features in each model.

    Ability to use different features in each model.

    I have a model whose most predictive features are the most noisy. To compensate, I train 1 model on those features, and a separate model on all the other features. By combining these models, I can quickly and easily prevent strange outlier predictions.

    Simple stacking / voting is okay, but I imagine the model would generalize better were I to implement vecstack instead. Is there any feasible way we could add different X (column-wise) per model to vecstack? I.e. multiple X that are the same length, but have different widths.

    Thank you for your time! -Nathan

    opened by nathanwalker-sp 4
  • Using different data transformations and fit parameters for different models

    Using different data transformations and fit parameters for different models

    Hi Igor,

    Congratulations for your package. I've been searching for a stacking package and this nails it (both for simplicity and efectiveness). Thanks for your contribution

    Is there any possibility to stack already trained models with your package? There are 2 reasons for this: -People might want to set fit arguments to the models (currently not available as the stacking function will actually train the models) -We might want to use different data scaling and preprocessing techniques for different algorithms (label encoding for tree-based methods and one hot for linear)

    For example, H2O stacking allows users to stack already trained models: http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html

    I would love to contribute to your package but unfortunately my technical level would be too dangerous for your code :P

    opened by davidolmo 4
  • Using the functional API for training only

    Using the functional API for training only

    There doesn't seem to be a way to use the functional API just for training a model - since X_test= is a required argument. However, if I've already tested my 2nd level model, I think I should be able to train a model on the full data set.

    To be clear, I would like to be able to just do the following:

    from vecstack import stacking
    
    # Get your data
    
    # Initialize 1st level estimators
    models = [LinearRegression(),
              Ridge(random_state=0)]
    
    # Get your stacked features in a single line
    S_train = stacking(models, X_train, y_train, regression=True, verbose=2)
    
    # Use 2nd level estimator with stacked features
    

    Am I missing something?

    opened by epetrovski 3
  • Memory error in footprint for sparce matrix

    Memory error in footprint for sparce matrix

    X is <239761x68891 sparse matrix of type '<class 'numpy.float64'>' with 8726453 stored elements in Compressed Sparse Row format>

    Specifically, choice function is crashing, because n==16517375051 Error is:

    ~/.local/lib/python3.5/site-packages/vecstack/coresk.py in _get_footprint(self, X, n_items)
        863             # np.random.seed(0) # for development
    --> 864             ids = np.random.choice(n, min(n_items, n), replace=False)
        865 
    
    mtrand.pyx in mtrand.RandomState.choice()
    
    mtrand.pyx in mtrand.RandomState.permutation()
    
    MemoryError: 
    
    During handling of the above exception, another exception occurred:
    
    ValueError                                Traceback (most recent call last)
    <ipython-input-24-4be4f86278e7> in <module>()
          8                             verbose=2)             
          9 t = targets[0]
    ---> 10 stack = stack.fit(X, y[t])
    
    ~/.local/lib/python3.5/site-packages/vecstack/coresk.py in fit(self, X, y, sample_weight)
        393             self.n_classes_ = None
        394         self.n_estimators_ = len(self.estimators_)
    --> 395         self.train_footprint_ = self._get_footprint(X)
        396 
        397         # ---------------------------------------------------------------------
    
    ~/.local/lib/python3.5/site-packages/vecstack/coresk.py in _get_footprint(self, X, n_items)
        872 
        873         except Exception:
    --> 874             raise ValueError('Internal error. '
        875                              'Please save traceback and inform developers.')
        876 
    
    ValueError: Internal error. Please save traceback and inform developers.```
    opened by dremovd 3
  • Support for custom Cross Validation strategies

    Support for custom Cross Validation strategies

    The package looks amazing, but from what I saw, one can not pass a cross-validation sklearn object, only the number of folds, and enable/disable shuffling and stratification. This is an issue when trying to work with time series data, and using TimeSeriesSplit from sklearn. Would you consider adding maybe another toggle, like time_series={True, False} or even changing the API a bit, and instead of passing the number of folds and shuffle and stratified to have only one argument, like cv and pass a separate object from sklearn in there?

    opened by AlexandruBurlacu 3
  • How to combine early stopping?

    How to combine early stopping?

    Thanks for your contribution. I was looking for a great api for stacking then found your good package .

    I am wondering that is it possible to combine the early_stopping in lightgbm or EarlyStopping in keras with VECSTACK (because I don't know how to do it) ?

    opened by ZeroAlcoholic 3
Releases(v0.4.0)
Owner
Igor Ivanov
Deep Learning Engineer, MSc, Kaggle Grandmaster
Igor Ivanov
Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Python Extreme Learning Machine (ELM) Python Extreme Learning Machine (ELM) is a machine learning technique used for classification/regression tasks.

Augusto Almeida 84 Nov 25, 2022
Ml based project which uses regression technique to predict the price.

Price-Predictor Ml based project which uses regression technique to predict the price. I have used various regression models and finds the model with

Garvit Verma 1 Jul 9, 2022
A data preprocessing package for time series data. Design for machine learning and deep learning.

A data preprocessing package for time series data. Design for machine learning and deep learning.

Allen Chiang 152 Jan 7, 2023
A Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.

Master status: Development status: Package information: TPOT stands for Tree-based Pipeline Optimization Tool. Consider TPOT your Data Science Assista

Epistasis Lab at UPenn 8.9k Jan 9, 2023
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

Vowpal Wabbit 8.1k Dec 30, 2022
CD) in machine learning projectsImplementing continuous integration & delivery (CI/CD) in machine learning projects

CML with cloud compute This repository contains a sample project using CML with Terraform (via the cml-runner function) to launch an AWS EC2 instance

Iterative 19 Oct 3, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Chao Ma 3k Jan 8, 2023
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

imbalanced-learn imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-cla

null 6.2k Jan 1, 2023
ELI5 is a Python package which helps to debug machine learning classifiers and explain their predictions

A library for debugging/inspecting machine learning classifiers and explaining their predictions

null 154 Dec 17, 2022
Python package for machine learning for healthcare using a OMOP common data model

This library was developed in order to facilitate rapid prototyping in Python of predictive machine-learning models using longitudinal medical data from an OMOP CDM-standard database.

Sontag Lab 75 Jan 3, 2023
A simple machine learning package to cluster keywords in higher-level groups.

Simple Keyword Clusterer A simple machine learning package to cluster keywords in higher-level groups. Example: "Senior Frontend Engineer" --> "Fronte

Andrea D'Agostino 10 Dec 18, 2022
Data science, Data manipulation and Machine learning package.

duality Data science, Data manipulation and Machine learning package. Use permitted according to the terms of use and conditions set by the attached l

David Kundih 3 Oct 19, 2022
MIT-Machine Learning with Python–From Linear Models to Deep Learning

MIT-Machine Learning with Python–From Linear Models to Deep Learning | One of the 5 courses in MIT MicroMasters in Statistics & Data Science Welcome t

null 2 Aug 23, 2022
Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft contributing libraries, tools, recipes, sample codes and workshop contents for machine learning & deep learning.

Microsoft 366 Jan 3, 2023
A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

A mindmap summarising Machine Learning concepts, from Data Analysis to Deep Learning.

Daniel Formoso 5.7k Dec 30, 2022
A comprehensive repository containing 30+ notebooks on learning machine learning!

A comprehensive repository containing 30+ notebooks on learning machine learning!

Jean de Dieu Nyandwi 3.8k Jan 9, 2023
Implemented four supervised learning Machine Learning algorithms

Implemented four supervised learning Machine Learning algorithms from an algorithmic family called Classification and Regression Trees (CARTs), details see README_Report.

Teng (Elijah)  Xue 0 Jan 31, 2022
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 29, 2022
MLBox is a powerful Automated Machine Learning python library.

MLBox is a powerful Automated Machine Learning python library. It provides the following features: Fast reading and distributed data preprocessing/cle

Axel 1.4k Jan 6, 2023