A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

Last update: Jan 1, 2023

Related tags

Overview

imbalanced-learn

imbalanced-learn is a python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and is part of scikit-learn-contrib projects.

Documentation

Installation documentation, API documentation, and examples can be found on the documentation.

Installation

Dependencies

imbalanced-learn is tested to work under Python 3.6+. The dependency requirements are based on the last scikit-learn release:

scipy(>=0.19.1)
numpy(>=1.13.3)
scikit-learn(>=0.23)
joblib(>=0.11)
keras 2 (optional)
tensorflow (optional)

Additionally, to run the examples, you need matplotlib(>=2.0.0) and pandas(>=0.22).

Installation

From PyPi or conda-forge repositories

imbalanced-learn is currently available on the PyPi's repositories and you can install it via pip:

pip install -U imbalanced-learn

The package is release also in Anaconda Cloud platform:

conda install -c conda-forge imbalanced-learn

From source available on GitHub

If you prefer, you can clone it and run the setup.py file. Use the following commands to get a copy from Github and install all dependencies:

git clone https://github.com/scikit-learn-contrib/imbalanced-learn.git
cd imbalanced-learn
pip install .

Be aware that you can install in developer mode with:

pip install --no-build-isolation --editable .

If you wish to make pull-requests on GitHub, we advise you to install pre-commit:

pip install pre-commit
pre-commit install

Testing

After installation, you can use pytest to run the test suite:

make coverage

Development

The development of this scikit-learn-contrib is in line with the one of the scikit-learn community. Therefore, you can refer to their Development Guide.

About

If you use imbalanced-learn in a scientific publication, we would appreciate citations to the following paper:

@article{JMLR:v18:16-365,
author  = {Guillaume  Lema{{\^i}}tre and Fernando Nogueira and Christos K. Aridas},
title   = {Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning},
journal = {Journal of Machine Learning Research},
year    = {2017},
volume  = {18},
number  = {17},
pages   = {1-5},
url     = {http://jmlr.org/papers/v18/16-365}
}

Most classification algorithms will only perform optimally when the number of samples of each class is roughly the same. Highly skewed datasets, where the minority is heavily outnumbered by one or more classes, have proven to be a challenge while at the same time becoming more and more common.

One way of addressing this issue is by re-sampling the dataset as to offset this imbalance with the hope of arriving at a more robust and fair decision boundary than you would otherwise.

Re-sampling techniques are divided in two categories:

Under-sampling the majority class(es).
Over-sampling the minority class.
Combining over- and under-sampling.
Create ensemble balanced sets.

Below is a list of the methods currently implemented in this module.

Under-sampling
1. Random majority under-sampling with replacement
2. Extraction of majority-minority Tomek links [1]
3. Under-sampling with Cluster Centroids
4. NearMiss-(1 & 2 & 3) [2]
5. Condensed Nearest Neighbour [3]
6. One-Sided Selection [4]
7. Neighboorhood Cleaning Rule [5]
8. Edited Nearest Neighbours [6]
9. Instance Hardness Threshold [7]
10. Repeated Edited Nearest Neighbours [14]
11. AllKNN [14]
Over-sampling
1. Random minority over-sampling with replacement
2. SMOTE - Synthetic Minority Over-sampling Technique [8]
3. SMOTENC - SMOTE for Nominal and Continuous [8]
4. SMOTEN - SMOTE for Nominal [8]
5. bSMOTE(1 & 2) - Borderline SMOTE of types 1 and 2 [9]
6. SVM SMOTE - Support Vectors SMOTE [10]
7. ADASYN - Adaptive synthetic sampling approach for imbalanced learning [15]
8. KMeans-SMOTE [17]
9. ROSE - Random OverSampling Examples [19]
Over-sampling followed by under-sampling
1. SMOTE + Tomek links [12]
2. SMOTE + ENN [11]
Ensemble classifier using samplers internally
1. Easy Ensemble classifier [13]
2. Balanced Random Forest [16]
3. Balanced Bagging
4. RUSBoost [18]
Mini-batch resampling for Keras and Tensorflow

The different algorithms are presented in the sphinx-gallery.

References:

[1]	: I. Tomek, “Two modifications of CNN,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 6, pp. 769-772, 1976.

[2]	: I. Mani, J. Zhang. “kNN approach to unbalanced data distributions: A case study involving information extraction,” In Proceedings of the Workshop on Learning from Imbalanced Data Sets, pp. 1-7, 2003.

[3]	: P. E. Hart, “The condensed nearest neighbor rule,” IEEE Transactions on Information Theory, vol. 14(3), pp. 515-516, 1968.

[4]	: M. Kubat, S. Matwin, “Addressing the curse of imbalanced training sets: One-sided selection,” In Proceedings of the 14th International Conference on Machine Learning, vol. 97, pp. 179-186, 1997.

[5]	: J. Laurikkala, “Improving identification of difficult small classes by balancing class distribution,” Proceedings of the 8th Conference on Artificial Intelligence in Medicine in Europe, pp. 63-66, 2001.

[6]	: D. Wilson, “Asymptotic Properties of Nearest Neighbor Rules Using Edited Data,” IEEE Transactions on Systems, Man, and Cybernetrics, vol. 2(3), pp. 408-421, 1972.

[7]	: M. R. Smith, T. Martinez, C. Giraud-Carrier, “An instance level analysis of data complexity,” Machine learning, vol. 95(2), pp. 225-256, 2014.

[8]	(1, 2, 3) : N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321-357, 2002.

[9]	: H. Han, W.-Y. Wang, B.-H. Mao, “Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning,” In Proceedings of the 1st International Conference on Intelligent Computing, pp. 878-887, 2005.

[10]	: H. M. Nguyen, E. W. Cooper, K. Kamei, “Borderline over-sampling for imbalanced data classification,” In Proceedings of the 5th International Workshop on computational Intelligence and Applications, pp. 24-29, 2009.

[11]	: G. E. A. P. A. Batista, R. C. Prati, M. C. Monard, “A study of the behavior of several methods for balancing machine learning training data,” ACM Sigkdd Explorations Newsletter, vol. 6(1), pp. 20-29, 2004.

[12]	: G. E. A. P. A. Batista, A. L. C. Bazzan, M. C. Monard, “Balancing training data for automated annotation of keywords: A case study,” In Proceedings of the 2nd Brazilian Workshop on Bioinformatics, pp. 10-18, 2003.

[13]	: X.-Y. Liu, J. Wu and Z.-H. Zhou, “Exploratory undersampling for class-imbalance learning,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 39(2), pp. 539-550, 2009.

[14]	(1, 2) : I. Tomek, “An experiment with the edited nearest-neighbor rule,” IEEE Transactions on Systems, Man, and Cybernetics, vol. 6(6), pp. 448-452, 1976.

[15]	: H. He, Y. Bai, E. A. Garcia, S. Li, “ADASYN: Adaptive synthetic sampling approach for imbalanced learning,” In Proceedings of the 5th IEEE International Joint Conference on Neural Networks, pp. 1322-1328, 2008.

[16]	: C. Chao, A. Liaw, and L. Breiman. "Using random forest to learn imbalanced data." University of California, Berkeley 110 (2004): 1-12.

[17]	: Felix Last, Georgios Douzas, Fernando Bacao, "Oversampling for Imbalanced Learning Based on K-Means and SMOTE"

[18]	: Seiffert, C., Khoshgoftaar, T. M., Van Hulse, J., & Napolitano, A. "RUSBoost: A hybrid approach to alleviating class imbalance." IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40.1 (2010): 185-197.

[19]	: Menardi, G., Torelli, N.: "Training and assessing classification rules with unbalanced data", Data Mining and Knowledge Discovery, 28, (2014): 92–122

Comments

Speed improvements

I have a dataset which has around 150.000 entries. Exploring SMOTHE sampling seems to be pretty slow as only a single core is used to perform calculations. Am I missing a configuration property? How else could I improve the speed of SMOTHE?
Type: Enhancement

opened by geoHeil 31
Issues using SMOTE
Hi First of all thank you for providing us with the nice library

I have a imbalanced dataset and I've loaded the dataset using pandas. When I'm supplying the dataset as input to the SMOTE I'm getting the following error:

ValueError: Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6

Thanks in Advance
opened by Ayyappatheegala 30
[BUG] SMOTEEN and SMOTETomek run for ages on larger datasets on the new update
I've been using SMOTETomek in production with success for a while. The 0.7.6 version runs through the dataset in around 5-8min. Updated and the new version ran for 1,5h before I killed the process.

balancer = SMOTETomek(random_state=2425, n_jobs=-1) df_resampled, target_resampled = balancer.fit_resample(dataframe, target) return df_resampled, target_resampled
opened by jruokolainen 29
[MRG] ENH: K-Means SMOTE implementation

What does this implement/fix? Explain your changes.

This pull request implements K-Means SMOTE, as described in Oversampling for Imbalanced Learning Based on K-Means and SMOTE by Last et al.

Any other comments?

The density estimation function has been changed slightly from the reference paper, as the power term yielded very large numbers. This caused the weighting to favour a single cluster.

opened by StephanHeijl 25
[MRG] Address issue #113 - Create toy example for testing
Address issue #113

Over-sampling

[x] ADASYN

[x] SMOTE

[x] ROS

Under-sampling

[x] CC

[x] CNN

[x] ENN

[x] RENN => PR #135 needs to be merged before writing this code

[x] AllKNN => PR #136 needs to be merged before writing this code

[x] IHT

[x] NearMiss

[x] OSS

[x] RUS

[x] Tomek

Combine

[x] SMOTE ENN

[x] SMOTE Tomek

Ensemble

[x] Easy Ensemble => PR #117 needs to be merged before writing this code

[x] Balance Cascade
opened by glemaitre 25
[MRG+1] Rename all occurrences of size_ngh to n_neighbors for consistency with scikit-learn

For consistency reasons I think that we should follow scikit-learn conventions in naming the parameters. I propose to change the size_ngh parameter to n_neighbors. Unfortunately, this change will have impact in the public API. It is an early modification but it will break users code. I don't know if we could merge this change without a deprecation warning.

opened by chkoar 25
MNT blackify source code and add pre-commit
Reference Issue

Addressing https://github.com/scikit-learn-contrib/imbalanced-learn/issues/684

What does this implement/fix? Explain your changes.

Integrating black into the codebase, to keep the code format consistent.

[x] Integrate black

[x] Run black over all files

[x] Add black into precommit hook

Any other comments?

Open questions -

Which requirements file should the black dependency be added to?

line-length for black is currently set as 79. Is that alright?
opened by akash-suresh 23
conda install version 0.3.0
I used

conda install -c glemaitre imbalanced-learn

to install Imbalanced-learn. Instead of getting version 0.3.0, I have the older version

# imbalanced-learn 0.2.1 py27_0 glemaitre

How do I install version 0.3.0 via conda install?
Type: Bug Type: CI/CD
opened by ljiang14 22
ValueError: could not convert string to float: 'aaa'
I have imbalanced classes with 10,000 1s and 10m 0s. I want to undersample before I convert category columns to dummies to save memory. I expected it would ignore the content of x and randomly select based on y. However I get the above error. What am I not understanding and how do I do this without converting category features to dummies first?

clf_sample = RandomUnderSampler(ratio=.025) x = pd.DataFrame(np.random.random((100,5)), columns=list("abcde")) x.loc[:, "b"] = "aaa" clf_sample.fit(x, y.head(100))
opened by simonm3 22
`ratio` should allow to specify which class to target when resampling

TomekLinks and EditedNearestNeighbours only remove samples form the majority class. However both methods are often used rather for data cleaning (removing samples form both classes) but undersampling (only removing samples form the majority class). Thus SMOTETomek and SMOTEENN are not implemented as proposed by Batista, Prati and Monard (2004), because they use TomekLinks and ENN for removing samples from the majority and the minority class.

It would be great to have a parameter that lets you choose whether to remove samples from both classes or only from the majority class.
Type: Enhancement Status: Blocker

opened by lmittmann 22
EHN: implementation of SMOTE-NC for continuous and categorical mixed types
Reference Issue

#401

What does this implement/fix? Explain your changes.

Implements SMOTE-NC as per paragraph 6.1 from original SMOTE paper by Chawla, K. W. Bowyer, L. O.Hall, W. P. Kegelmeyer

Any other comments?

Some parts are missing to make it ready to merge, but I would like to get an opinion on implementation first, especially on the part which deals with sparse matrices as I do not have much experience with them.

Points to pay attention to:

working with sparse matrices

2 FIXME points in code

'fit' method expects 'feature_indices' keyword argument and issues a warning if it is not provided falling back to normal SMOTE. Raising an error would probably be better but this would break common estimator tests from sklearn (via imblearn/tests/test_common)
opened by ddudnik 21
ValueError: Found array with dim 4. RandomOverSampler expected <= 2

I want to perform OverSampler on the image classification task, but the result shows "ValueError: Found array with dim 4. RandomOverSampler expected <= 2." How can I use imbalanced-learn？

opened by LHXqwq 1
[MRG] [ENH] Add sample_indices_ for SMOTE/ADASYN classes
Adding attribute sample_indices to SMOTE/ADASYN classes that contains tuple of samples used to generate new sample. For the samples for original dataset it is index of original sample.

Reference Issue

Fixes #772

What does this implement/fix? Explain your changes.

Adds a get_sample_indices() function that returns a tuple of sample indices from which the new sample was created. For the original samples of dataset then [index, 0] is returned. Implemented for SMOTE and ADASYN class.

Adds tests for get_sample_indices() function.
opened by JurajSlivka 2
WIP ENH Add fixture in common tests
Reference Issue

Fixes #672

What does this implement/fix? Explain your changes.

Created a common fixture to create a sample dataset used in tests.

Replaced boilerplate code with fixture in necessary tests in estimator_checks.py.
opened by awinml 1
Add MLSMOTE algorithm to imblearn

What does this implement/fix? Explain your changes.

This is an implementation of the Multilabel SMOTE (MLSMOTE) algorithm described in the paper:

Charte, F. & Rivera Rivas, Antonio & Del Jesus, María José & Herrera, Francisco. (2015). MLSMOTE: Approaching imbalanced multilabel learning through synthetic instance generation. Knowledge-Based Systems. -. 10.1016/j.knosys.2015.07.019.

It is an oversampling technique that AFAIK there is no open-source implementation yet.

Addresses: https://github.com/scikit-learn-contrib/imbalanced-learn/issues/340

Any other comments?

The implementation is ready to be reviewed. Once reviewed, I can squash the commits for cleaner history.

opened by balvisio 6

ValueError: Found array with 0 sample(s) (shape=(0, 19)) while a minimum of 1 is required.

So I'm new at programming and machine learning, and I'm using this code I found from a journal for spam detection. When I try to use it, the result turns out to be error, even though I already prepared the data correctly. The error message is 'ValueError: Found array with 0 sample(s) (shape=(0, 19)) while a minimum of 1 is required.' Can anyone please help me out with this issue? [The link for the complete code is here] (https://github.com/ijdutse/spd)

#!/usr/bin/env python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from collections import defaultdict, Counter
from datetime import datetime
import preprocessor as p
import random, os, utils, smart_open, json, codecs, pickle, time
import gensim
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from sklearn.feature_extraction.text import HashingVectorizer, TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.fftpack import fft

data_sources = ['phone.json']

def main():
    spd = Spd(data_sources) #class instantiation
    start = time.process_time()
    relevant_tweets = spd.detector(data_sources)
    stop = time.process_time()
    return relevant_tweets




class Spd:
    """ some functions to accept raw files, extract relevant fields and filter our irrelevent content"""
    def __init__(self, data_sources):
        self.data_sources = data_sources
    pass
        
    # first function in the class:
    def extractor(self, data_sources): # accept list of files consisting of raw tweets in form of json object
        data_extracts = {'TweetID':[],'ScreenName':[],'RawTweets':[],'CreatedAt':[],'RetweetCount':[],\
                         'FollowersCount':[],'FriendsCount':[], 'StatusesCount':[],'FavouritesCount':[],\
                         'UserName':[],'Location':[],'AccountCreated':[],'Language':[],'Description':[],\
                         'UserURL':[],'VerifiedAccount':[],'CleanTweets':[],'UserID':[], 'TimeZone':[],'TweetFavouriteCount':[]}
        non_english_tweets = 0 # keep track of the non-English tweets
        with codecs.open('phone.json', 'r') as f: # data_source is read from extractor() function
            for line in f.readlines():
                non_English = 0
                try:
                    line = json.loads(line)
                    if line['lang'] in ['en','en-gb','en-GB','en-AU','en-IN','en_US']:
                        data_extracts['Language'].append(line['Language'])
                        data_extracts['TweetID'].append(line['TweetID'])
                        data_extracts['RawTweets'].append(line['RawTweets'])
                        data_extracts['CleanTweets'].append(p.clean(line['RawTweets']))
                        data_extracts['CreatedAt'].append(line['CreatedAt'])
                        data_extracts['AccountCreated'].append(line['AccountCreated'])                       
                        data_extracts['ScreenName'].append(line['ScreenName'])                          
                        data_extracts['RetweetCount'].append(line['RetweetCount'])
                        data_extracts['FollowersCount'].append(line['FollowersCount'])
                        data_extracts['FriendsCount'].append(line['FriendsCount'])
                        data_extracts['StatusesCount'].append(line['StatusesCount'])
                        data_extracts['FavouritesCount'].append(line['FavouritesCount'])
                        data_extracts['UserName'].append(line['UserName'])
                        data_extracts['Location'].append(line['Location'])
                        data_extracts['Description'].append(line['Description'])
                        data_extracts['UserURL'].append(line['UserURL'])
                        data_extracts['VerifiedAccount'].append(line['VerifiedAccount'])
                        data_extracts['UserID'].append(line['UserID'])
                        data_extracts['TimeZone'].append(line['TimeZone'])
                        data_extracts['TweetFavouriteCount'].append(line['TweetFavouriteCount'])
                    else:
                        non_english_tweets +=1
                except:
                    continue
            df0 = pd.DataFrame(data_extracts) #convert data extracts to pandas DataFrame
            df0['CreatedAt']=pd.to_datetime(data_extracts['CreatedAt'],errors='coerce') # convert to datetime
            df0['AccountCreated']=pd.to_datetime(data_extracts['AccountCreated'],errors='coerce')
            df0 = df0.dropna(subset=['AccountCreated','CreatedAt']) # drop na in datetime
            AccountAge = [] # compute the account age of accounts
            date_format = "%Y-%m-%d  %H:%M:%S"
            for dr,dc in zip(df0.CreatedAt, df0.AccountCreated):
                #try:
                dr = str(dr)
                dc = str(dc)
                d1 = datetime.strptime(dr,date_format)
                d2 = datetime.strptime(dc,date_format)
                dif = d1 - d2
                AccountAge.append(dif.days)
                #except:
                    #continue
            df0['AccountAge']=AccountAge
            # add/define additional features ...
            df0['Retweets'] = df0.RawTweets.apply(lambda x: str(x).split()[0]=='RT' )
            df0['RawTweetsLen'] = df0.RawTweets.apply(lambda x: len(str(x))) # modified
            df0['DescriptionLen'] = df0.Description.apply(lambda x: len(str(x)))
            df0['UserNameLen'] = df0.UserName.apply(lambda x: len(str(x)))
            df0['ScreenNameLen'] = df0.ScreenName.apply(lambda x: len(str(x)))
            df0['LocationLen'] = df0.Location.apply(lambda x: len(str(x)))
            df0['Activeness'] = df0.StatusesCount.truediv(df0.AccountAge)
            df0['Friendship'] = df0.FriendsCount.truediv(df0.FollowersCount)
            df0['Followership'] = df0.FollowersCount.truediv(df0.FriendsCount)
            df0['Interestingness'] = df0.FavouritesCount.truediv(df0.StatusesCount)
            df0['BidirFriendship'] = (df0.FriendsCount + df0.FollowersCount).truediv(df0.FriendsCount)
            df0['BidirFollowership'] = (df0.FriendsCount + df0.FollowersCount).truediv(df0.FollowersCount)
            df0['NamesRatio'] = df0.ScreenNameLen.truediv(df0.UserNameLen)
            df0['CleanTweetsLen'] = df0.CleanTweets.apply(lambda x: len(str(x)))
            df0['LexRichness'] = df0.CleanTweetsLen.truediv(df0.RawTweetsLen)       
            # Remove all RTs, set UserID as index and save relevant files:
            df0 = df0[df0.Retweets.values==False] # remove retweets
            df0 = df0.set_index('UserID')
            df0 = df0[~df0.index.duplicated()] # remove duplicates in the tweet
            #df0.to_csv(data_source[:15]+'all_extracts.csv') #save all extracts as csv
            df0.to_csv(data_sources[:5]+'all_extracts.csv') #save all extracts as csv 
            with open(data_sources[:5]+'non_English.txt','w') as d: # save count of non-English tweets
                d.write('{}'.format(non_english_tweets))
                d.close()
        return df0

    
    def detector(self, data_sources): # accept list of raw tweets as json objects
        self.data_sources = data_sources
        for data_sources in data_sources:
            self.data_sources = data_sources
            df0 = self.extractor(data_sources)
            #drop fields not required for predicition
            X = df0.drop(['Language','TweetID','RawTweets','CleanTweets','CreatedAt','AccountCreated','ScreenName',\
                 'Retweets','UserName','Location','Description','UserURL','VerifiedAccount','RetweetCount','TimeZone','TweetFavouriteCount'], axis=1)
            X = X.replace([np.inf,-np.inf],np.nan) # replace infinity values to avoid 0 division ...
            X = X.dropna()
            # reload the trained model for use:
            spd_filter=pickle.load(open('trained_rf.pkl','rb'))
            PredictedClass = spd_filter.predict(X) # Predict spam or automated accounts/tweets:
            X['PredictedClass'] = PredictedClass # include the predicted class in the dataframe
            nonspam = df0.loc[X.PredictedClass.values==1] # sort out the nonspam accounts
            spam = df0.loc[X.PredictedClass.values==0] # sort out spam/automated accounts
            #relevant_tweets = nonspam[['CreatedAt', 'CleanTweets']]
            relevant_tweets = nonspam[['CreatedAt','AccountCreated','ScreenName','Location','TimeZone','Description','VerifiedAccount','RawTweets', 'CleanTweets','TweetFavouriteCount','Retweets']]
            relevant_tweets = relevant_tweets.reset_index() # reset index and remove it from the dataframe
            #relevant_tweets = relevant_tweets.drop('UserID', axis=1) 
            # save files:
            X.to_csv(data_source[:5]+'_all_predicted_classes.csv') #save all extracts as csv, used to be 15
            nonspam.to_csv(data_source[:5]+'_nonspam_accounts.csv')
            spam.to_csv(data_source[:5]+'_spam_accounts.csv')
            relevant_tweets.to_csv(data_source[:5]+'_relevant_tweets.csv') # relevant tweets for subsequent analysis
        return relevant_tweets # or return relevant_tweets, nonspam, spam

if __name__ =='__main__':
    main()`

Status: More Info Needed

opened by balalaunicorn 2

[BUG] The estimator_ in CondensedNearestNeighbour() is incorrect for multiple classes

Describe the bug

The estimator_ object fit by CondensedNearestNeighbour() (and probably other sampling strategies) is incorrect when y has multiple classes (and possibly also for binary classes). In particular, the estimator is only fit to a subset of 2 of the classes.

Steps/Code to Reproduce

from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier
from imblearn.under_sampling import CondensedNearestNeighbour

n_clusters = 10
X, y = make_blobs(n_samples=2000, centers=n_clusters, n_features=2, cluster_std=.5, random_state=0)

n_neighbors = 1
condenser = CondensedNearestNeighbour(sampling_strategy='all', n_neighbors=n_neighbors)
X_cond, y_cond = condenser.fit_resample(X, y)
print('condenser.estimator_.classes_', condenser.estimator_.classes_) # this should have 10 classes, which it does!
print("condenser.estomator_ accuracy", condenser.estimator_.score(X, y))

condenser.estimator_.classes_ [5 9]
condenser.estomator_ accuracy 0.2

# I think the estimator we want should look like this
knn_cond_manual = KNeighborsClassifier(n_neighbors=n_neighbors).fit(X_cond, y_cond)
print('knn_cond_manual.classes_', knn_cond_manual.classes_)  # yes 10 classes!
print("Manual KNN on condensted data accuracy", knn_cond_manual.score(X, y)) # good accuracy!

knn_cond_manual.classes_ [0 1 2 3 4 5 6 7 8 9]
Manual KNN on condensted data accuracy 0.996

The issue

The issue that we set estimator_ in each run of the loop in _fit_resample e.g. this line. We should really set estimator_ after the loop ends on the condensed datasets.

This looks like it's also an issue with OneSidedSelection and possibly other samplers.

Fix

I think we should just add the following to directly before the return statement in fit_resample

X_condensed, y_condensed = _safe_indexing(X, idx_under), _safe_indexing(y, idx_under)
self.estimator_.fit(X_condensed, y_condensed)
return X_condensed, y_condensed

Versions


System:
    python: 3.8.12 (default, Oct 12 2021, 06:23:56)  [Clang 10.0.0 ]
executable: /Users/iaincarmichael/anaconda3/envs/comp_onc/bin/python
   machine: macOS-10.16-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.1.1
          pip: 21.2.4
   setuptools: 58.0.4
        numpy: 1.21.4
        scipy: 1.7.3
       Cython: 0.29.25
       pandas: 1.3.5
   matplotlib: 3.5.0
       joblib: 1.1.0
threadpoolctl: 2.2.0

Built with OpenMP: True

threadpoolctl info:
       filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/python3.8/site-packages/sklearn/.dylibs/libomp.dylib
         prefix: libomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8

       filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/python3.8/site-packages/numpy/.dylibs/libopenblas.0.dylib
         prefix: libopenblas
       user_api: blas
   internal_api: openblas
        version: 0.3.17
    num_threads: 4
threading_layer: pthreads
   architecture: Haswell

       filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/libmkl_rt.1.dylib
         prefix: libmkl_rt
       user_api: blas
   internal_api: mkl
        version: 2021.4-Product
    num_threads: 4
threading_layer: intel

       filepath: /Users/iaincarmichael/anaconda3/envs/comp_onc/lib/libomp.dylib
         prefix: libomp
       user_api: openmp
   internal_api: openmp
        version: None
    num_threads: 8

Type: Bug Package: under_sampling

opened by idc9 0

Releases(0.10.0)

0.10.0(Dec 9, 2022)
Changelog

Bug fixes

Make sure that Substitution is working with python -OO that replaces doc by None. #953 bu Guillaume Lemaitre.

Compatibility

Maintenance release for being compatible with scikit-learn >= 1.0.2. #946, #947, #949 by Guillaume Lemaitre.

Add support for automatic parameters validation as in scikit-learn >= 1.2. #955 by Guillaume Lemaitre.

Add support for feature_names_in_ as well as get_feature_names_out for all samplers. #959 by Guillaume Lemaitre.

Deprecation

The parameter n_jobs has been deprecated from the classes ADASYN, BorderlineSMOTE, SMOTE, SMOTENC, SMOTEN, and SVMSMOTE. Instead, pass a nearest neighbors estimator where n_jobs is set. #887 by Guillaume Lemaitre.

The parameter base_estimator is deprecated and will be removed in version 0.12. It is impacted the following classes: BalancedBaggingClassifier, EasyEnsembleClassifier, RUSBoostClassifier. #946 by Guillaume Lemaitre.

Enhancements

Add support to accept compatible NearestNeighbors objects by only duck-typing. For instance, it allows to accept cuML instances. #858 by NV-jpt and Guillaume Lemaitre.

Source code(tar.gz)
Source code(zip)
0.9.1(May 16, 2022)

Compatibility with scikit-learn 1.1.0
Source code(tar.gz)
Source code(zip)
0.9.0(Jan 16, 2022)

Compatibility with scikit-learn 1.0.2
Source code(tar.gz)
Source code(zip)
0.8.1(Sep 29, 2021)

Version 0.8.1

September 29, 2021

Maintenance

Make imbalanced-learn compatible with scikit-learn 1.0. #864 by Guillaume Lemaitre.
Source code(tar.gz)
Source code(zip)
0.8.0(Feb 18, 2021)
Version 0.8.0

February 18, 2021

Changelog

New features

Add the the function imblearn.metrics.macro_averaged_mean_absolute_error returning the average across class of the MAE. This metric is used in ordinal classification. #780 by Aurélien Massiot.

Add the class imblearn.metrics.pairwise.ValueDifferenceMetric to compute pairwise distances between samples containing only categorical values. #796 by Guillaume Lemaitre.

Add the class imblearn.over_sampling.SMOTEN to over-sample data only containing categorical features. #802 by Guillaume Lemaitre.

Add the possibility to pass any type of samplers in imblearn.ensemble.BalancedBaggingClassifier unlocking the implementation of methods based on resampled bagging. #808 by Guillaume Lemaitre.

Enhancements

Add option output_dict in imblearn.metrics.classification_report_imbalanced to return a dictionary instead of a string. #770 by Guillaume Lemaitre.

Added an option to generate smoothed bootstrap in `imblearn.over_sampling.RandomOverSampler. It is controled by the parameter shrinkage. This method is also known as Random Over-Sampling Examples (ROSE). #754 by Andrea Lorenzon and Guillaume Lemaitre.

Bug fixes

Fix a bug in imblearn.under_sampling.ClusterCentroids where voting="hard" could have lead to select a sample from any class instead of the targeted class. #769 by Guillaume Lemaitre.

Fix a bug in imblearn.FunctionSampler where validation was performed even with validate=False when calling fit. #790 by Guillaume Lemaitre.

Maintenance

Remove requirements files in favour of adding the packages in the extras_require within the setup.py file. #816 by Guillaume Lemaitre.

Change the website template to use pydata-sphinx-theme. #801 by Guillaume Lemaitre.

Deprecation

The context manager imblearn.utils.testing.warns is deprecated in 0.8 and will be removed 1.0. #815 by Guillaume Lemaitre.

Source code(tar.gz)
Source code(zip)
0.7.0(Jun 9, 2020)

A release to bump the minimum version of scikit-learn to 0.23 with a couple of bug fixes. Check the what's new for more information.
Source code(tar.gz)
Source code(zip)
0.6.2(Feb 16, 2020)
This is a bug-fix release to resolve some issues regarding the handling the input and the output format of the arrays.

Changelog

Allow column vectors to be passed as targets. #673 by @chkoar.

Better input/output handling for pandas, numpy and plain lists. #681 by @chkoar.

Source code(tar.gz)
Source code(zip)
0.6.1(Dec 7, 2019)
This is a bug-fix release to primarily resolve some packaging issues in version 0.6.0. It also includes minor documentation improvements and some bug fixes.

Changelog

Bug fixes

Fix a bug in :class:imblearn.ensemble.BalancedRandomForestClassifier leading to a wrong number of samples used during fitting due max_samples and therefore a bad computation of the OOB score. :pr:656 by :user:Guillaume Lemaitre <glemaitre>.

Source code(tar.gz)
Source code(zip)
0.6.0(Dec 5, 2019)
Changelog

Changed models ..............

The following models might give some different sampling due to changes in scikit-learn:

:class:imblearn.under_sampling.ClusterCentroids

:class:imblearn.under_sampling.InstanceHardnessThreshold

The following samplers will give different results due to change linked to the random state internal usage:

:class:imblearn.over_sampling.SMOTENC

Bug fixes .........

:class:imblearn.under_sampling.InstanceHardnessThreshold now take into account the random_state and will give deterministic results. In addition, cross_val_predict is used to take advantage of the parallelism. :pr:599 by :user:Shihab Shahriar Khan <Shihab-Shahriar>.

Fix a bug in :class:imblearn.ensemble.BalancedRandomForestClassifier leading to a wrong computation of the OOB score. :pr:656 by :user:Guillaume Lemaitre <glemaitre>.

Maintenance ...........

Update imports from scikit-learn after that some modules have been privatize. The following import have been changed: :class:sklearn.ensemble._base._set_random_states, :class:sklearn.ensemble._forest._parallel_build_trees, :class:sklearn.metrics._classification._check_targets, :class:sklearn.metrics._classification._prf_divide, :class:sklearn.utils.Bunch, :class:sklearn.utils._safe_indexing, :class:sklearn.utils._testing.assert_allclose, :class:sklearn.utils._testing.assert_array_equal, :class:sklearn.utils._testing.SkipTest. :pr:617 by :user:Guillaume Lemaitre <glemaitre>.

Synchronize :mod:imblearn.pipeline with :mod:sklearn.pipeline. :pr:620 by :user:Guillaume Lemaitre <glemaitre>.

Synchronize :class:imblearn.ensemble.BalancedRandomForestClassifier and add parameters max_samples and ccp_alpha. :pr:621 by :user:Guillaume Lemaitre <glemaitre>.

Enhancement ...........

:class:imblearn.under_sampling.RandomUnderSampling, :class:imblearn.over_sampling.RandomOverSampling, :class:imblearn.datasets.make_imbalance accepts Pandas DataFrame in and will output Pandas DataFrame. Similarly, it will accepts Pandas Series in and will output Pandas Series. :pr:636 by :user:Guillaume Lemaitre <glemaitre>.

:class:imblearn.FunctionSampler accepts a parameter validate allowing to check or not the input X and y. :pr:637 by :user:Guillaume Lemaitre <glemaitre>.

:class:imblearn.under_sampling.RandomUnderSampler, :class:imblearn.over_sampling.RandomOverSampler can resample when non finite values are present in X. :pr:643 by :user:Guillaume Lemaitre <glemaitre>.

All samplers will output a Pandas DataFrame if a Pandas DataFrame was given as an input. :pr:644 by :user:Guillaume Lemaitre <glemaitre>.

The samples generation in :class:imblearn.over_sampling.SMOTE, :class:imblearn.over_sampling.BorderlineSMOTE, :class:imblearn.over_sampling.SVMSMOTE, :class:imblearn.over_sampling.KMeansSMOTE, :class:imblearn.over_sampling.SMOTENC is now vectorize with giving an additional speed-up when X in sparse. :pr:596 by :user:Matt Eding <MattEding>.

Deprecation ...........

The following classes have been removed after 2 deprecation cycles: ensemble.BalanceCascade and ensemble.EasyEnsemble. :pr:617 by :user:Guillaume Lemaitre <glemaitre>.

The following functions have been removed after 2 deprecation cycles: utils.check_ratio. :pr:617 by :user:Guillaume Lemaitre <glemaitre>.

The parameter ratio and return_indices has been removed from all samplers. :pr:617 by :user:Guillaume Lemaitre <glemaitre>.

The parameters m_neighbors, out_step, kind, svm_estimator have been removed from the :class:imblearn.over_sampling.SMOTE. :pr:617 by :user:Guillaume Lemaitre <glemaitre>.

Source code(tar.gz)
Source code(zip)
0.5.0(Jun 28, 2019)
Version 0.5.0

Changed models

The following models or function might give different results even if the same data X and y are the same.

:class:imblearn.ensemble.RUSBoostClassifier default estimator changed from :class:sklearn.tree.DecisionTreeClassifier with full depth to a decision stump (i.e., tree with max_depth=1).

Documentation

Correct the definition of the ratio when using a float in sampling strategy for the over-sampling and under-sampling. :issue:525 by :user:Ariel Rossanigo <arielrossanigo>.

Add :class:imblearn.over_sampling.BorderlineSMOTE and :class:imblearn.over_sampling.SVMSMOTE in the API documenation. :issue:530 by :user:Guillaume Lemaitre <glemaitre>.

Enhancement

Add Parallelisation for SMOTEENN and SMOTETomek. :pr:547 by :user:Michael Hsieh <Microsheep>.

Add :class:imblearn.utils._show_versions. Updated the contribution guide and issue template showing how to print system and dependency information from the command line. :pr:557 by :user:Alexander L. Hayes <batflyer>.

Add :class:imblearn.over_sampling.KMeansSMOTE which is an over-sampler clustering points before to apply SMOTE. :pr:435 by :user:Stephan Heijl <StephanHeijl>.

Maintenance

Make it possible to import imblearn and access submodule. :pr:500 by :user:Guillaume Lemaitre <glemaitre>.

Remove support for Python 2, remove deprecation warning from scikit-learn 0.21. :pr:576 by :user:Guillaume Lemaitre <glemaitre>.

Bug

Fix wrong usage of :class:keras.layers.BatchNormalization in porto_seguro_keras_under_sampling.py example. The batch normalization was moved before the activation function and the bias was removed from the dense layer. :pr:531 by :user:Guillaume Lemaitre <glemaitre>.

Fix bug which converting to COO format sparse when stacking the matrices in :class:imblearn.over_sampling.SMOTENC. This bug was only old scipy version. :pr:539 by :user:Guillaume Lemaitre <glemaitre>.

Fix bug in :class:imblearn.pipeline.Pipeline where None could be the final estimator. :pr:554 by :user:Oliver Rausch <orausch>.

Fix bug in :class:imblearn.over_sampling.SVMSMOTE and :class:imblearn.over_sampling.BorderlineSMOTE where the default parameter of n_neighbors was not set properly. :pr:578 by :user:Guillaume Lemaitre <glemaitre>.

Fix bug by changing the default depth in :class:imblearn.ensemble.RUSBoostClassifier to get a decision stump as a weak learner as in the original paper. :pr:545 by :user:Christos Aridas <chkoar>.

Allow to import keras directly from tensorflow in the :mod:imblearn.keras. :pr:531 by :user:Guillaume Lemaitre <glemaitre>.

Source code(tar.gz)
Source code(zip)
0.4.3(Nov 6, 2018)

Mainly bugfix in SMOTE NC
Source code(tar.gz)
Source code(zip)
0.4.2(Oct 21, 2018)
Version 0.4.2

Bug fixes

Fix a bug in imblearn.over_sampling.SMOTENC in which the the median of the standard deviation instead of half of the median of the standard deviation. By Guillaume Lemaitre in #491.

Raise an error when passing target which is not supported, i.e. regression target or multilabel targets. Imbalanced-learn does not support this case. By Guillaume Lemaitre in #490.

Source code(tar.gz)
Source code(zip)
0.4.1(Oct 12, 2018)

Version 0.4

October, 2018

Version 0.4 is the last version of imbalanced-learn to support Python 2.7 and Python 3.4. Imbalanced-learn 0.5 will require Python 3.5 or higher.

Highlights

This release brings its set of new feature as well as some API changes to strengthen the foundation of imbalanced-learn.

As new feature, 2 new modules imblearn.keras and imblearn.tensorflow have been added in which imbalanced-learn samplers can be used to generate balanced mini-batches.

The module imblearn.ensemble has been consolidated with new classifier: imblearn.ensemble.BalancedRandomForestClassifier, imblearn.ensemble.EasyEnsembleClassifier, imblearn.ensemble.RUSBoostClassifier.

Support for string has been added in imblearn.over_sampling.RandomOverSampler and imblearn.under_sampling.RandomUnderSampler. In addition, a new class imblearn.over_sampling.SMOTENC allows to generate sample with data sets containing both continuous and categorical features.

The imblearn.over_sampling.SMOTE has been simplified and break down to 2 additional classes: imblearn.over_sampling.SVMSMOTE and imblearn.over_sampling.BorderlineSMOTE.

There is also some changes regarding the API: the parameter sampling_strategy has been introduced to replace the ratio parameter. In addition, the return_indices argument has been deprecated and all samplers will exposed a sample_indices_ whenever this is possible.
Source code(tar.gz)
Source code(zip)
0.4.0(Oct 12, 2018)
Version 0.4

October, 2018

.. warning::

Version 0.4 is the last version of imbalanced-learn to support Python 2.7 and Python 3.4. Imbalanced-learn 0.5 will require Python 3.5 or higher.

Highlights

This release brings its set of new feature as well as some API changes to strengthen the foundation of imbalanced-learn.

As new feature, 2 new modules imblearn.keras and imblearn.tensorflow have been added in which imbalanced-learn samplers can be used to generate balanced mini-batches.

The module imblearn.ensemble has been consolidated with new classifier: imblearn.ensemble.BalancedRandomForestClassifier, imblearn.ensemble.EasyEnsembleClassifier, imblearn.ensemble.RUSBoostClassifier.

Support for string has been added in imblearn.over_sampling.RandomOverSampler and imblearn.under_sampling.RandomUnderSampler. In addition, a new class imblearn.over_sampling.SMOTENC allows to generate sample with data sets containing both continuous and categorical features.

The imblearn.over_sampling.SMOTE has been simplified and break down to 2 additional classes: imblearn.over_sampling.SVMSMOTE and imblearn.over_sampling.BorderlineSMOTE.

There is also some changes regarding the API: the parameter sampling_strategy has been introduced to replace the ratio parameter. In addition, the return_indices argument has been deprecated and all samplers will exposed a sample_indices_ whenever this is possible.
Source code(tar.gz)
Source code(zip)
0.3.4(Sep 7, 2018)

Just for switching documentation
Source code(tar.gz)
Source code(zip)
0.3.3(Feb 22, 2018)

Bug fix in the classification report
Source code(tar.gz)
Source code(zip)
0.3.2(Dec 7, 2017)

Source code(tar.gz)
Source code(zip)
0.3.1(Oct 9, 2017)

Minor documentation revisions
Source code(tar.gz)
Source code(zip)
0.3.0(Oct 9, 2017)
What's new in version 0.3.0

Testing

Pytest is used instead of nosetests. :issue:321 by Joan Massich_.

Documentation

Added a User Guide and extended some examples. :issue:295 by Guillaume Lemaitre_.

Bug fixes

Fixed a bug in :func:utils.check_ratio such that an error is raised when the number of samples required is negative. :issue:312 by Guillaume Lemaitre_.

Fixed a bug in :class:under_sampling.NearMiss version 3. The indices returned were wrong. :issue:312 by Guillaume Lemaitre_.

Fixed bug for :class:ensemble.BalanceCascade and :class:combine.SMOTEENN and :class:SMOTETomek. :issue:295 by Guillaume Lemaitre_.`

Fixed bug for check_ratio to be able to pass arguments when ratio is a callable. :issue:307 by Guillaume Lemaitre_.`

New features

Turn off steps in :class:pipeline.Pipeline using the None object. By Christos Aridas_.

Add a fetching function :func:datasets.fetch_datasets in order to get some imbalanced datasets useful for benchmarking. :issue:249 by Guillaume Lemaitre_.

Enhancement

All samplers accepts sparse matrices with defaulting on CSR type. :issue:316 by Guillaume Lemaitre_.

:func:datasets.make_imbalance take a ratio similarly to other samplers. It supports multiclass. :issue:312 by Guillaume Lemaitre_.

All the unit tests have been factorized and a :func:utils.check_estimators has been derived from scikit-learn. By Guillaume Lemaitre_.

Script for automatic build of conda packages and uploading. :issue:242 by Guillaume Lemaitre_

Remove seaborn dependence and improve the examples. :issue:264 by Guillaume Lemaitre_.

adapt all classes to multi-class resampling. :issue:290 by Guillaume Lemaitre_

API changes summary

__init__ has been removed from the :class:base.SamplerMixin to create a real mixin class. :issue:242 by Guillaume Lemaitre_.

creation of a module :mod:exceptions to handle consistant raising of errors. :issue:242 by Guillaume Lemaitre_.

creation of a module utils.validation to make checking of recurrent patterns. :issue:242 by Guillaume Lemaitre_.

move the under-sampling methods in prototype_selection and prototype_generation submodule to make a clearer dinstinction. :issue:277 by Guillaume Lemaitre_.

change ratio such that it can adapt to multiple class problems. :issue:290 by Guillaume Lemaitre_.

Deprecation

Deprecation of the use of min_c_ in :func:datasets.make_imbalance. :issue:312 by Guillaume Lemaitre_

Deprecation of the use of float in :func:datasets.make_imbalance for the ratio parameter. :issue:290 by Guillaume Lemaitre_.

deprecate the use of float as ratio in favor of dictionary, string, or callable. :issue:290 by Guillaume Lemaitre_.

Source code(tar.gz)
Source code(zip)
0.2.1(Jan 1, 2017)

Source code(tar.gz)
Source code(zip)
0.2.0(Dec 31, 2016)

Release 0.2.0
Source code(tar.gz)
Source code(zip)
0.1.9(Dec 26, 2016)

Source code(tar.gz)
Source code(zip)
0.1.8(Sep 7, 2016)

Source code(tar.gz)
Source code(zip)
0.2.0.dev0(Sep 1, 2016)

Source code(tar.gz)
Source code(zip)
0.1.7(Aug 31, 2016)

Source code(tar.gz)
Source code(zip)
0.1.6(Aug 9, 2016)

Bug fix for NearMiss 3
Source code(tar.gz)
Source code(zip)
0.1.5(Jul 31, 2016)

Release 0.1.5
Source code(tar.gz)
Source code(zip)
0.1.4(Jul 31, 2016)

Release 0.1.4 Bug fix for EasyEnsemble method
Source code(tar.gz)
Source code(zip)
0.1.3(Jul 19, 2016)

Solve an issue with ADASYN
Source code(tar.gz)
Source code(zip)
0.1.2(Jul 19, 2016)

Release created after transferring the repository to scikit-learn-contrib.
Source code(tar.gz)
Source code(zip)