scikit-learn cross validators for iterative stratification of multilabel data

Overview

Build Status Coverage Status

iterative-stratification

iterative-stratification is a project that provides scikit-learn compatible cross validators with stratification for multilabel data.

Presently scikit-learn provides several cross validators with stratification. However, these cross validators do not offer the ability to stratify multilabel data. This iterative-stratification project offers implementations of MultilabelStratifiedKFold, MultilabelRepeatedStratifiedKFold, and MultilabelStratifiedShuffleSplit with a base algorithm for stratifying multilabel data described in the following paper:

Sechidis K., Tsoumakas G., Vlahavas I. (2011) On the Stratification of Multi-Label Data. In: Gunopulos D., Hofmann T., Malerba D., Vazirgiannis M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science, vol 6913. Springer, Berlin, Heidelberg.

Requirements

iterative-stratification has been tested under Python 3.4 through 3.8 with the following dependencies:

  • scipy(>=0.13.3)
  • numpy(>=1.8.2)
  • scikit-learn(>=0.19.0)

Installation

iterative-stratification is currently available on the PyPi repository and can be installed via pip:

pip install iterative-stratification


The package is also installable from the Anaconda Cloud platform:

conda install -c trent-b iterative-stratification

Toy Examples

The multilabel cross validators that this package provides may be used with the scikit-learn API in the same manner as any other cross validators. For example, these cross validators may be passed to cross_val_score or cross_val_predict. Below are some toy examples of the direct use of the multilabel cross validators.

MultilabelStratifiedKFold

from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
import numpy as np

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])

mskf = MultilabelStratifiedKFold(n_splits=2, shuffle=True, random_state=0)

for train_index, test_index in mskf.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

Output:

TRAIN: [0 3 4 6] TEST: [1 2 5 7]
TRAIN: [1 2 5 7] TEST: [0 3 4 6]

RepeatedMultilabelStratifiedKFold

from iterstrat.ml_stratifiers import RepeatedMultilabelStratifiedKFold
import numpy as np

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])

rmskf = RepeatedMultilabelStratifiedKFold(n_splits=2, n_repeats=2, random_state=0)

for train_index, test_index in rmskf.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

Output:

TRAIN: [0 3 4 6] TEST: [1 2 5 7]
TRAIN: [1 2 5 7] TEST: [0 3 4 6]
TRAIN: [0 1 4 5] TEST: [2 3 6 7]
TRAIN: [2 3 6 7] TEST: [0 1 4 5]

MultilabelStratifiedShuffleSplit

from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
import numpy as np

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])

msss = MultilabelStratifiedShuffleSplit(n_splits=3, test_size=0.5, random_state=0)

for train_index, test_index in msss.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]

Output:

TRAIN: [1 2 5 7] TEST: [0 3 4 6]
TRAIN: [2 3 6 7] TEST: [0 1 4 5]
TRAIN: [1 2 5 6] TEST: [0 3 4 7]
Comments
  • Adjusting test_size doesn't actually change test_size

    Adjusting test_size doesn't actually change test_size

    Hello! I'm trying to use this code for a project, however, I don't want my test size to be 0.5. When I try and adjust it, I don't get a change:

    # from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
    import numpy as np
    
    X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
    y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])
    msss = MultilabelStratifiedShuffleSplit(n_splits=3, test_size=0.25, random_state=42)
    
    for train_index, test_index in msss.split(X, y):
        print("TRAIN:", train_index, "TEST:", test_index)
        print(len(train_index))
        print(len(test_index))
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
    

    outputs:

    ('TRAIN:', array([1, 2, 4, 7]), 'TEST:', array([0, 3, 5, 6]))
    4
    4
    ('TRAIN:', array([2, 3, 6, 7]), 'TEST:', array([0, 1, 4, 5]))
    4
    4
    ('TRAIN:', array([0, 2, 4, 6]), 'TEST:', array([1, 3, 5, 7]))
    4
    4
    

    Koodos on putting this out there!

    opened by tyler-lanigan-hs 9
  • [MOD] Bug Fix for sklearn 1.0~

    [MOD] Bug Fix for sklearn 1.0~

    scikit-learn has been updated to 1.0.0. As a result, there are some functions that don't work properly. it makes errors like the below:

    TypeError: __init__() takes from 1 to 2 positional arguments but 5 were given.
    

    To fix this problem, I added * in init parameters refers to PEP 3102(https://www.python.org/dev/peps/pep-3102/).

    opened by CryptoSalamander 4
  • Incompatibility with scikit-learn 1.0 in latest release

    Incompatibility with scikit-learn 1.0 in latest release

    As of scikit-learn 1.0 the deprecation warning fixed in 0a108bc2062fd32f98c9a6305508ea213292ba08 has become a hard error. Could a new release be pushed to pypi in order to remain compatible with the latest scikit-learn?

    For other users experiencing this issue (it will look something like

    , in __init__
        super(MultilabelStratifiedShuffleSplit, self).__init__(
    TypeError: __init__() takes from 1 to 2 positional arguments but 5 were given
    

    ^this) the workaround is to use the latest master of this package.

    opened by lunik1 4
  • Error using MultilabelStratifiedKFold

    Error using MultilabelStratifiedKFold

    Hi Trent! First, thanks for this repository, it have helped me a lot.

    I have a question. I use the MultilabelStratifiedKFold for a machine learning model, but since the last week it have been giving me an error. I haven't changed anything on it, so I don't know what can be happening.

    The error I'm having is in this line of code:

    mskf = MultilabelStratifiedKFold(n_splits=3, shuffle=True, random_state=42)

    And the error that it throws is it:

    Input In [13], in <cell line: 6>()
          3 oof_preds["fold_idx"] = -1
          4 oof_preds["oof_pred"] = -1
    ----> 6 mskf = MultilabelStratifiedKFold(n_splits=3, shuffle=True, random_state=42)
          7 mskf_split = mskf.split(dataset, dataset[["rvm_tipo_enc","rvm_marca_enc","rvm_antiguedad","converted"]])
          9 for fold,(train_idx,valid_idx) in enumerate(mskf_split):
    
    File ~\Anaconda3\envs\JARVIS\lib\site-packages\iterstrat\ml_stratifiers.py:157, in MultilabelStratifiedKFold.__init__(self, n_splits, shuffle, random_state)
        156 def __init__(self, n_splits=3, shuffle=False, random_state=None):
    --> 157     super(MultilabelStratifiedKFold, self).__init__(n_splits, shuffle, random_state)
    
    TypeError: __init__() takes 2 positional arguments but 4 were given```
    
    
    
    What can be happening on here? Thanks a lot!
    opened by robertogarces 3
  • Ability to set a custom fold proportions for MultilabelStratifiedKFold (pass

    Ability to set a custom fold proportions for MultilabelStratifiedKFold (pass "r" to IterativeStratification)

    For us it's useful to be able to set custom fold proportions when using MultilabelStratifiedKFold (essentially passing custom r to IterativeStratification). It's easy enough to extend outside of the lib (only _make_test_folds needs to be copied), but I wonder if such a feature could be useful in the library itself, what do you think?

    And thanks for a great library!

    opened by lopuhin 3
  • Balanced sample with low number of one of the classes

    Balanced sample with low number of one of the classes

    I'm working with an extreme large multilabel problem and there are some rare classes. I was trying to use your package to balance by train/test split and notice that it does not guarantee at least one class in each set. The following example shows to the problem:

    >>> import numpy as np
    >>> from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
    >>> X = np.arange(10)
    >>> 
    >>> 
    >>> 
    >>> import numpy as np
    >>> from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
    >>> 
    >>> 
    >>> X = np.arange(10)
    >>> X
    array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    >>> 
    >>> y = np.array([[1,1,0],[0,1,0],[1,0,0],[1,0,0],[0,1,0],[0,1,0],[0,1,0],[1,1,0],[0,1,1],[1,0,1]])
    >>> y
    array([[1, 1, 0],
           [0, 1, 0],
           [1, 0, 0],
           [1, 0, 0],
           [0, 1, 0],
           [0, 1, 0],
           [0, 1, 0],
           [1, 1, 0],
           [0, 1, 1],
           [1, 0, 1]])
    >>> 
    >>> temp = MultilabelStratifiedShuffleSplit(n_splits = 1,test_size =.2,random_state = 0)
    >>> train, test  = list(temp.split(X, y))[0]
    >>> 
    >>> train
    array([1, 2, 3, 4, 5, 6, 7, 8, 9])
    >>> 
    >>> 
    >>> test
    array([0])
    

    The train set contains both samples 8 and 9, which are the only ones that have the class with index 2. How can I make sure that all splits have at least one sample per class?

    opened by miguelwon 3
  • Getting started help

    Getting started help

    Hello and thank you for this project.

    I am new to machine learning and have a little bit of trouble getting started with this.

    If i got it correctly this method is used, when I have unevenly distributed multilabel dataset, in order to get an evenly distributed one.

    To test this I used one of the toy examples and changed it a little, so that I have an uneven distribution over 3 classes.

    from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
    import numpy as np
    from matplotlib import pyplot as plt
    
    
    AMOUNT_OF_CLASSES = 3
    X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
    y = np.array([[1,0,1], [1,1,0], [1,0,1], [0,0,1], [1,1,0], [0,0,1], [1,0,0], [1,0,0]])
    

    If I take a look at the distribution at the beginning it will look like the following:

    dis = np.zeros(shape=(AMOUNT_OF_CLASSES,))
    for i in range(0,AMOUNT_OF_CLASSES):
        dis[i] = y[:,i].sum()
    
    # Show original distribution
    plt.figure(0)
    plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],dis)
    

    image

    If I now do the stratification like this:

    # now go for stratifcaation
    msss = MultilabelStratifiedShuffleSplit(n_splits=10, test_size=0.5, random_state=0)
    
    cnt = 1
    # distribution over all iterations
    all_dis = np.zeros(shape=(AMOUNT_OF_CLASSES,))
    for train_index, test_index in msss.split(X, y):
        iter_dis = np.zeros(shape=(AMOUNT_OF_CLASSES,))
        print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        
        
        for i in range(0,AMOUNT_OF_CLASSES):
            iter_dis[i] = y_train[:,i].sum()
            
        all_dis += iter_dis
        # Show new distribution (for the latest one at first)
        plt.figure(cnt)
        plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],iter_dis)
        
        
        
        cnt += 1
    

    and look at the distribution at the end:

    
    plt.figure(cnt+1)
    plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],all_dis)    
    plt.bar([i for i in range(0,AMOUNT_OF_CLASSES)],dis)
    plt.title("Distribution after Stratification")
    plt.legend(['Distribution after stratification','original distribution'])
    

    I will get the following:

    image

    So it still looks like I do not have an even distribution among the classes.

    Is this not what this is used for? How could I achieve that every class is evenly distributed over the data? Thank you really much

    opened by kevinkit 3
  • Possibility to do stratification with multi-output multi-class (multi-target) data

    Possibility to do stratification with multi-output multi-class (multi-target) data

    Hi, I have a multi-output multi-class (multi-target) dataset and would like to do data stratification before applying a learning algorithm. Using iterative_train_test_split from skmultilearn library (``` from skmultilearn.model_selection import iterative_train_test_split x_train, y_train, x_test, y_test = iterative_train_test_split(x, y, test_size = 0.1)

    Thank you.
    opened by bundit786 2
  • Do we need X for `split`

    Do we need X for `split`

    Forgive me if this is a dumb question, but if I understand this library correctly, the main aim is to look at correlations between the y's and somehow accomodate for that when stratifying. Is there any reason why we need the X variable as when splitting? eg. to get the indices we always have to do: train_index, test_index = next(iter(msss.split(X, y))).

    Thanks in advance.

    opened by sachinruk 2
  • Is it possible to calculate the total number of all possible splits?

    Is it possible to calculate the total number of all possible splits?

    Love this repo, it spares me a lot effort.

    Here is my question (or concern).

    When we don't enforce any constraint when generating KFold, the number of all possible splits is the largest and simple to calculate.

    When we only have one label and enforce the splits to be stratified, i.e. StratifiedKFold, this number drops, but normally will still be large enough to generate a diverse set of splits. Again, this number can be calculated with some simple combinatorics.

    However, when stratification on multiple labels is enforced (the goal of this repo), things become more complicated and I am worried that if there are too much labels, say hundreds of them, there won't be too many possible splits that can satisfy the stratification constraint😟.

    So my question is,

    • Does my concern make sense?
    • Can we calculate the total number of possibilities?

    Looking forward to reply.

    opened by whatever60 2
  • Different percentage of samples for each label after using MultilabelStratifiedKFold

    Different percentage of samples for each label after using MultilabelStratifiedKFold

    Hi trent-b:

    Thanks for this nice repository, hope you can reply these questions below:

    def multi2single_labels(y):
        d = {}
        for yy in y:
            d[str(yy)] = d.get(str(yy), 0) + 1
        return d
    yy = np.array([[0,0,0,0]]*318+[[1,0,0,0]]*264+[[0,0,1,0]]*58+[[0,1,0,1]]*51+\
                  [[1,0,0,1]]*81+[[0,1,0,0]]*151+[[0,1,1,0]]*33+[[0,0,1,1]]*27+\
                  [[0,0,0,1]]*54+[[0,1,1,1]]*21+[[1,1,0,0]]*11+[[1,1,0,1]]*7+[[1,0,1,0]]*2)
    xx = np.zeros((yy.shape[0],))
    kfold = MultilabelStratifiedKFold(n_splits=2, random_state=42, shuffle=True)
    for idx_fold, (idx_train, idx_valid) in enumerate(kfold.split(xx, yy)):
        print(f'Now in {idx_fold}th fold')
        y_valid = yy[idx_valid]
        d_y = multi2single_labels(y_valid)
        print(f'labels of y: {d_y}')
    

    Using the code (simplest 2 fold) above will get result: Now in 0th fold labels of y: {'[0 0 0 0]': 155, '[1 0 0 0]': 136, '[0 0 1 0]': 28, '[0 1 0 1]': 25, '[1 0 0 1]': 37, '[0 1 0 0]': 76, '[0 1 1 0]': 18, '[0 0 1 1]': 15, '[0 0 0 1]': 31, '[0 1 1 1]': 9, '[1 1 0 0]': 5, '[1 1 0 1]': 4} Now in 1th fold labels of y: {'[0 0 0 0]': 163, '[1 0 0 0]': 128, '[0 0 1 0]': 30, '[0 1 0 1]': 26, '[1 0 0 1]': 44, '[0 1 0 0]': 75, '[0 1 1 0]': 15, '[0 0 1 1]': 12, '[0 0 0 1]': 23, '[0 1 1 1]': 12, '[1 1 0 0]': 6, '[1 1 0 1]': 3, '[1 0 1 0]': 2} Q1: Why is '[1 0 1 0]' not be 1 in both two fold but all in 1th fold? Q2: Why is number of some label so differ in each fold? (e.g.'[0 0 0 0]', '[1 0 0 0]')

    Thanks!

    opened by Lance0218 2
  • Unable to create a small sample of 1000 train and 100 using MultilabelStratifiedShuffleSplit

    Unable to create a small sample of 1000 train and 100 using MultilabelStratifiedShuffleSplit

    Hi trent-b:

    Thanks for this repository, hope you can help with my issue. I have a large json data set that i want to use MultilabelStratifiedShuffleSplit to create a smaller sample set.

    def mlb_train_test_split(labels, test_size, train_size, random_state=0):
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=FutureWarning)
            msss = MultilabelStratifiedShuffleSplit(
                test_size=test_size, train_size=train_size, random_state=random_state
            )
        train_idx, test_idx = next(msss.split(np.ones_like(labels), labels))
        return train_idx, test_idx
    

    i then call the function as :

    train_idx, test_idx = mlb_train_test_split(labels, test_size=1000 train_size=200, random_state=0)

    When i look at the numbers I'm seeing way more than 200 rows. Is there a limitation? The labels length is approximately 500,000 in the dataset.

    opened by meltedhead 1
Releases(0.1.7)
Owner
null
Scikit-learn compatible estimation of general graphical models

skggm : Gaussian graphical models using the scikit-learn API In the last decade, learning networks that encode conditional independence relationships

null 213 Jan 2, 2023
scikit-learn inspired API for CRFsuite

sklearn-crfsuite sklearn-crfsuite is a thin CRFsuite (python-crfsuite) wrapper which provides interface simlar to scikit-learn. sklearn_crfsuite.CRF i

null 418 Jan 9, 2023
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

null 803 Jan 5, 2023
A library of extension and helper modules for Python's data analysis and machine learning libraries.

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. Sebastian Raschka 2014-2021 Links Doc

Sebastian Raschka 4.2k Dec 28, 2022
Topological Data Analysis for Python🐍

Scikit-TDA is a home for Topological Data Analysis Python libraries intended for non-topologists. This project aims to provide a curated library of TD

Scikit-TDA 373 Dec 24, 2022
Data Analysis Baseline Library

dabl The data analysis baseline library. "Mr Sanchez, are you a data scientist?" "I dabl, Mr president." Find more information on the website. State o

Andreas Mueller 122 Dec 27, 2022
Auto updating website that tracks closed & open issues/PRs on scikit-learn/scikit-learn.

Repository Status for Scikit-learn Live webpage Auto updating website that tracks closed & open issues/PRs on scikit-learn/scikit-learn. Running local

Thomas J. Fan 6 Dec 27, 2022
NHS AI Lab Skunkworks project: Long Stayer Risk Stratification

NHS AI Lab Skunkworks project: Long Stayer Risk Stratification A pilot project for the NHS AI Lab Skunkworks team, Long Stayer Risk Stratification use

NHSX 21 Nov 14, 2022
Django-pwned - A collection of django password validators

Django Pwned A collection of django password validators. Compatibility Python: 3

Quera 22 Jun 27, 2022
PySpark + Scikit-learn = Sparkit-learn

Sparkit-learn PySpark + Scikit-learn = Sparkit-learn GitHub: https://github.com/lensacom/sparkit-learn About Sparkit-learn aims to provide scikit-lear

Lensa 1.1k Jan 4, 2023
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

Coiled 102 Nov 10, 2022
Scikit learn library models to account for data and concept drift.

liquid_scikit_learn Scikit learn library models to account for data and concept drift. This python library focuses on solving data drift and concept d

null 7 Nov 18, 2021
A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Feature Forge This library provides a set of tools that can be useful in many machine learning applications (classification, clustering, regression, e

Machinalis 380 Nov 5, 2022
SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.

SciKit-Learn Laboratory This Python package provides command-line utilities to make it easier to run machine learning experiments with scikit-learn. O

ETS 528 Nov 25, 2022
Python package for Bayesian Machine Learning with scikit-learn API

Python package for Bayesian Machine Learning with scikit-learn API Installing & Upgrading package pip install https://github.com/AmazaspShumik/sklearn

Amazasp Shaumyan 482 Jan 4, 2023
A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch. Resources Documentation Source Code Examples To see more elaborate examples, look

null 4.9k Dec 31, 2022
An intuitive library to add plotting functionality to scikit-learn objects.

Welcome to Scikit-plot Single line functions for detailed visualizations The quickest and easiest way to go from analysis... ...to this. Scikit-plot i

Reiichiro Nakano 2.3k Dec 31, 2022
scikit-learn: machine learning in Python

scikit-learn is a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license. The project was started

scikit-learn 52.5k Jan 8, 2023
A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch. Resources Documentation Source Code Examples To see more elaborate examples, look

null 3.8k Feb 13, 2021
Scikit-learn style model finetuning for NLP

Scikit-learn style model finetuning for NLP Finetune is a library that allows users to leverage state-of-the-art pretrained NLP models for a wide vari

indico 665 Dec 17, 2022