scikit-learn cross validators for iterative stratification of multilabel data


iterative-stratification is a project that provides scikit-learn compatible cross validators with stratification for multilabel data.

Presently scikit-learn provides several cross validators with stratification. However, these cross validators do not offer the ability to stratify multilabel data. This iterative-stratification project offers implementations of MultilabelStratifiedKFold, MultilabelRepeatedStratifiedKFold, and MultilabelStratifiedShuffleSplit with a base algorithm for stratifying multilabel data described in the following paper:

Sechidis K., Tsoumakas G., Vlahavas I. (2011) On the Stratification of Multi-Label Data. In: Gunopulos D., Hofmann T., Malerba D., Vazirgiannis M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science, vol 6913. Springer, Berlin, Heidelberg.


iterative-stratification has been tested under Python 3.4 through 3.8 with the following dependencies:

  • scipy(>=0.13.3)
  • numpy(>=1.8.2)
  • scikit-learn(>=0.19.0)


iterative-stratification is currently available on the PyPi repository and can be installed via pip:

pip install iterative-stratification

The package is also installable from the Anaconda Cloud platform:

conda install -c trent-b iterative-stratification

Toy Examples

The multilabel cross validators that this package provides may be used with the scikit-learn API in the same manner as any other cross validators. For example, these cross validators may be passed to cross_val_score or cross_val_predict. Below are some toy examples of the direct use of the multilabel cross validators.


from iterstrat.ml_stratifiers import MultilabelStratifiedKFold
import numpy as np

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])

mskf = MultilabelStratifiedKFold(n_splits=2, shuffle=True, random_state=0)

for train_index, test_index in mskf.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]


TRAIN: [0 3 4 6] TEST: [1 2 5 7]
TRAIN: [1 2 5 7] TEST: [0 3 4 6]


from iterstrat.ml_stratifiers import RepeatedMultilabelStratifiedKFold
import numpy as np

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])

rmskf = RepeatedMultilabelStratifiedKFold(n_splits=2, n_repeats=2, random_state=0)

for train_index, test_index in rmskf.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]


TRAIN: [0 3 4 6] TEST: [1 2 5 7]
TRAIN: [1 2 5 7] TEST: [0 3 4 6]
TRAIN: [0 1 4 5] TEST: [2 3 6 7]
TRAIN: [2 3 6 7] TEST: [0 1 4 5]


from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
import numpy as np

X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])

msss = MultilabelStratifiedShuffleSplit(n_splits=3, test_size=0.5, random_state=0)

for train_index, test_index in msss.split(X, y):
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]


TRAIN: [1 2 5 7] TEST: [0 3 4 6]
TRAIN: [2 3 6 7] TEST: [0 1 4 5]
TRAIN: [1 2 5 6] TEST: [0 3 4 7]
    Hello! I'm trying to use this code for a project, however, I don't want my test size to be 0.5. When I try and adjust it, I don't get a change:

    # from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
    import numpy as np
    X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
    y = np.array([[0,0], [0,0], [0,1], [0,1], [1,1], [1,1], [1,0], [1,0]])
    msss = MultilabelStratifiedShuffleSplit(n_splits=3, test_size=0.25, random_state=42)
    for train_index, test_index in msss.split(X, y):
        print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]


    ('TRAIN:', array([1, 2, 4, 7]), 'TEST:', array([0, 3, 5, 6]))
    ('TRAIN:', array([2, 3, 6, 7]), 'TEST:', array([0, 1, 4, 5]))
    ('TRAIN:', array([0, 2, 4, 6]), 'TEST:', array([1, 3, 5, 7]))

    Koodos on putting this out there!

    scikit-learn has been updated to 1.0.0. As a result, there are some functions that don't work properly. it makes errors like the below:

    TypeError: __init__() takes from 1 to 2 positional arguments but 5 were given.

    To fix this problem, I added * in init parameters refers to PEP 3102(

    As of scikit-learn 1.0 the deprecation warning fixed in 0a108bc2062fd32f98c9a6305508ea213292ba08 has become a hard error. Could a new release be pushed to pypi in order to remain compatible with the latest scikit-learn?

    For other users experiencing this issue (it will look something like

    , in __init__
        super(MultilabelStratifiedShuffleSplit, self).__init__(
    TypeError: __init__() takes from 1 to 2 positional arguments but 5 were given

    ^this) the workaround is to use the latest master of this package.

    Hi Trent! First, thanks for this repository, it have helped me a lot.

    I have a question. I use the MultilabelStratifiedKFold for a machine learning model, but since the last week it have been giving me an error. I haven't changed anything on it, so I don't know what can be happening.

    The error I'm having is in this line of code:

    mskf = MultilabelStratifiedKFold(n_splits=3, shuffle=True, random_state=42)

    And the error that it throws is it:

    Input In [13], in <cell line: 6>()
          3 oof_preds["fold_idx"] = -1
          4 oof_preds["oof_pred"] = -1
    ----> 6 mskf = MultilabelStratifiedKFold(n_splits=3, shuffle=True, random_state=42)
          7 mskf_split = mskf.split(dataset, dataset[["rvm_tipo_enc","rvm_marca_enc","rvm_antiguedad","converted"]])
          9 for fold,(train_idx,valid_idx) in enumerate(mskf_split):
    File ~\Anaconda3\envs\JARVIS\lib\site-packages\iterstrat\, in MultilabelStratifiedKFold.__init__(self, n_splits, shuffle, random_state)
        156 def __init__(self, n_splits=3, shuffle=False, random_state=None):
    --> 157     super(MultilabelStratifiedKFold, self).__init__(n_splits, shuffle, random_state)
    TypeError: __init__() takes 2 positional arguments but 4 were given```
    What can be happening on here? Thanks a lot!
    For us it's useful to be able to set custom fold proportions when using MultilabelStratifiedKFold (essentially passing custom r to IterativeStratification). It's easy enough to extend outside of the lib (only _make_test_folds needs to be copied), but I wonder if such a feature could be useful in the library itself, what do you think?

And thanks for a great library!

    And thanks for a great library!

    I'm working with an extreme large multilabel problem and there are some rare classes. I was trying to use your package to balance by train/test split and notice that it does not guarantee at least one class in each set. The following example shows to the problem:

    >>> import numpy as np
    >>> from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
    >>> X = np.arange(10)
    >>> import numpy as np
    >>> from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
    >>> X = np.arange(10)
    >>> X
    array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
    >>> y = np.array([[1,1,0],[0,1,0],[1,0,0],[1,0,0],[0,1,0],[0,1,0],[0,1,0],[1,1,0],[0,1,1],[1,0,1]])
    >>> y
    array([[1, 1, 0],
           [0, 1, 0],
           [1, 0, 0],
           [1, 0, 0],
           [0, 1, 0],
           [0, 1, 0],
           [0, 1, 0],
           [1, 1, 0],
           [0, 1, 1],
           [1, 0, 1]])
    >>> temp = MultilabelStratifiedShuffleSplit(n_splits = 1,test_size =.2,random_state = 0)
    >>> train, test  = list(temp.split(X, y))[0]
    >>> train
    array([1, 2, 3, 4, 5, 6, 7, 8, 9])
    >>> test

    The train set contains both samples 8 and 9, which are the only ones that have the class with index 2. How can I make sure that all splits have at least one sample per class?

    Hello and thank you for this project.

    I am new to machine learning and have a little bit of trouble getting started with this.

    If i got it correctly this method is used, when I have unevenly distributed multilabel dataset, in order to get an evenly distributed one.

    To test this I used one of the toy examples and changed it a little, so that I have an uneven distribution over 3 classes.

    from iterstrat.ml_stratifiers import MultilabelStratifiedShuffleSplit
    import numpy as np
    from matplotlib import pyplot as plt
    X = np.array([[1,2], [3,4], [1,2], [3,4], [1,2], [3,4], [1,2], [3,4]])
    y = np.array([[1,0,1], [1,1,0], [1,0,1], [0,0,1], [1,1,0], [0,0,1], [1,0,0], [1,0,0]])

    If I take a look at the distribution at the beginning it will look like the following:

    dis = np.zeros(shape=(AMOUNT_OF_CLASSES,))
    for i in range(0,AMOUNT_OF_CLASSES):
        dis[i] = y[:,i].sum()
    # Show original distribution
    plt.figure(0)[i for i in range(0,AMOUNT_OF_CLASSES)],dis)


    If I now do the stratification like this:

    # now go for stratifcaation
    msss = MultilabelStratifiedShuffleSplit(n_splits=10, test_size=0.5, random_state=0)
    cnt = 1
    # distribution over all iterations
    all_dis = np.zeros(shape=(AMOUNT_OF_CLASSES,))
    for train_index, test_index in msss.split(X, y):
        iter_dis = np.zeros(shape=(AMOUNT_OF_CLASSES,))
        print("TRAIN:", train_index, "TEST:", test_index)
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        for i in range(0,AMOUNT_OF_CLASSES):
            iter_dis[i] = y_train[:,i].sum()
        all_dis += iter_dis
        # Show new distribution (for the latest one at first)
        plt.figure(cnt)[i for i in range(0,AMOUNT_OF_CLASSES)],iter_dis)
        cnt += 1

    and look at the distribution at the end:

    plt.figure(cnt+1)[i for i in range(0,AMOUNT_OF_CLASSES)],all_dis)[i for i in range(0,AMOUNT_OF_CLASSES)],dis)
    plt.title("Distribution after Stratification")
    plt.legend(['Distribution after stratification','original distribution'])

    I will get the following:


    So it still looks like I do not have an even distribution among the classes.

    Is this not what this is used for? How could I achieve that every class is evenly distributed over the data? Thank you really much

    Hi, I have a multi-output multi-class (multi-target) dataset and would like to do data stratification before applying a learning algorithm. Using iterative_train_test_split from skmultilearn library (``` from skmultilearn.model_selection import iterative_train_test_split x_train, y_train, x_test, y_test = iterative_train_test_split(x, y, test_size = 0.1)

Thank you.

    Thank you.
    Forgive me if this is a dumb question, but if I understand this library correctly, the main aim is to look at correlations between the y's and somehow accomodate for that when stratifying. Is there any reason why we need the X variable as when splitting? eg. to get the indices we always have to do: train_index, test_index = next(iter(msss.split(X, y))).

Thanks in advance.

    Thanks in advance.

    Love this repo, it spares me a lot effort.

    Here is my question (or concern).

    When we don't enforce any constraint when generating KFold, the number of all possible splits is the largest and simple to calculate.

    When we only have one label and enforce the splits to be stratified, i.e. StratifiedKFold, this number drops, but normally will still be large enough to generate a diverse set of splits. Again, this number can be calculated with some simple combinatorics.

    However, when stratification on multiple labels is enforced (the goal of this repo), things become more complicated and I am worried that if there are too much labels, say hundreds of them, there won't be too many possible splits that can satisfy the stratification constraint😟.

    So my question is,

    • Does my concern make sense?
    • Can we calculate the total number of possibilities?

    Looking forward to reply.

    Hi trent-b:

    Thanks for this nice repository, hope you can reply these questions below:

    def multi2single_labels(y):
        d = {}
        for yy in y:
            d[str(yy)] = d.get(str(yy), 0) + 1
        return d
    yy = np.array([[0,0,0,0]]*318+[[1,0,0,0]]*264+[[0,0,1,0]]*58+[[0,1,0,1]]*51+\
    xx = np.zeros((yy.shape[0],))
    kfold = MultilabelStratifiedKFold(n_splits=2, random_state=42, shuffle=True)
    for idx_fold, (idx_train, idx_valid) in enumerate(kfold.split(xx, yy)):
        print(f'Now in {idx_fold}th fold')
        y_valid = yy[idx_valid]
        d_y = multi2single_labels(y_valid)
        print(f'labels of y: {d_y}')

    Using the code (simplest 2 fold) above will get result: Now in 0th fold labels of y: {'[0 0 0 0]': 155, '[1 0 0 0]': 136, '[0 0 1 0]': 28, '[0 1 0 1]': 25, '[1 0 0 1]': 37, '[0 1 0 0]': 76, '[0 1 1 0]': 18, '[0 0 1 1]': 15, '[0 0 0 1]': 31, '[0 1 1 1]': 9, '[1 1 0 0]': 5, '[1 1 0 1]': 4} Now in 1th fold labels of y: {'[0 0 0 0]': 163, '[1 0 0 0]': 128, '[0 0 1 0]': 30, '[0 1 0 1]': 26, '[1 0 0 1]': 44, '[0 1 0 0]': 75, '[0 1 1 0]': 15, '[0 0 1 1]': 12, '[0 0 0 1]': 23, '[0 1 1 1]': 12, '[1 1 0 0]': 6, '[1 1 0 1]': 3, '[1 0 1 0]': 2} Q1: Why is '[1 0 1 0]' not be 1 in both two fold but all in 1th fold? Q2: Why is number of some label so differ in each fold? (e.g.'[0 0 0 0]', '[1 0 0 0]')


    Hi trent-b:

    Thanks for this repository, hope you can help with my issue. I have a large json data set that i want to use MultilabelStratifiedShuffleSplit to create a smaller sample set.

    def mlb_train_test_split(labels, test_size, train_size, random_state=0):
        with warnings.catch_warnings():
            warnings.simplefilter("ignore", category=FutureWarning)
            msss = MultilabelStratifiedShuffleSplit(
                test_size=test_size, train_size=train_size, random_state=random_state
        train_idx, test_idx = next(msss.split(np.ones_like(labels), labels))
        return train_idx, test_idx

    i then call the function as :

    train_idx, test_idx = mlb_train_test_split(labels, test_size=1000 train_size=200, random_state=0)

    When i look at the numbers I'm seeing way more than 200 rows. Is there a limitation? The labels length is approximately 500,000 in the dataset.

