A fast xgboost feature selection algorithm

Overview

BoostARoota

A Fast XGBoost Feature Selection Algorithm (plus other sklearn tree-based classifiers)

Why Create Another Algorithm?

Automated processes like Boruta showed early promise as they were able to provide superior performance with Random Forests, but has some deficiencies including slow computation time: especially with high dimensional data. Regardless of the run time, Boruta does perform well on Random Forests, but performs poorly on other algorithms such as boosting or neural networks. Similar deficiencies occur with regularization on LASSO, elastic net, or ridge regressions in that they perform well on linear regressions, but poorly on other modern algorithms.

I am proposing and demonstrating a feature selection algorithm (called BoostARoota) in a similar spirit to Boruta utilizing XGBoost as the base model rather than a Random Forest. The algorithm runs in a fraction of the time it takes Boruta and has superior performance on a variety of datasets. While the spirit is similar to Boruta, BoostARoota takes a slightly different approach for the removal of attributes that executes much faster.

Installation

Easiest way is to use pip:

$ pip install boostaroota

Usage

This module is built for use in a similar manner to sklearn with fit(), transform(), etc. In order to use the package, it does require X to be one-hot-encoded(OHE), so using the pandas function pd.get_dummies(X) may be helpful as it determines which variables are categorical and converts them into dummy variables. This package does rely on pandas under the hood so data must be passed in as a pandas dataframe.

Assuming you have X and Y split, you can run the following:

from boostaroota import BoostARoota
import pandas as pd

#OHE the variables - BoostARoota may break if not done
x = pd.getdummies(x)
#Specify the evaluation metric: can use whichever you like as long as recognized by XGBoost
  #EXCEPTION: multi-class currently only supports "mlogloss" so much be passed in as eval_metric
br = BoostARoota(metric='logloss')

#Fit the model for the subset of variables
br.fit(x, y)

#Can look at the important variables - will return a pandas series
br.keep_vars_

#Then modify dataframe to only include the important variables
br.transform(x)

It's really that simple! Of course, as we build more functionality there may be a few more Keep in mind that since you are OHE, if you have a numeric variable that is imported by python as a character, pd.get_dummies() will convert those numeric into many columns. This can cause your DataFrame to explode in size, giving unexpected results and high run times.

###New as of 1/22/2018, can insert any sklearn tree-based learner into BoostARoota Please be aware that this hasn't been fully tested out for which parameters (cutoff, iterations, etc) are optimal. Currently, that will require some trial and error on the user's part.

For example, to use another classifer, you will initialize the object and then pass that object into the BoostARoota object like so:

from sklearn.ensemble import ExtraTreesClassifier
clf = ExtraTreesClassifier()

br = BoostARoota(clf=clf)
new_train = br.fit_transform(x, y)

You can also view a complete demo here.

Usage - Choosing Parameters

The default parameters are optimally chosen for the widest range of input dataframes. However, there are cases where other values could be more optimal.

  • clf [default=None] - optional, recommended to leave empty
    • Will default to xgboost if left empty
    • For use with any tree based learner from sklearn.
      • The default parameters are not optimal and will require user experimentation.
  • cutoff [default=4] - float (cutoff > 0)
    • Adjustment to removal cutoff from the feature importances
      • Larger values will be more conservative - if values are set too high, a small number of features may end up being removed.
      • Smaller values will be more aggressive; as long as the value is above zero (can be a float)
  • iters [default=10] - int (iters > 0)
    • The number of iterations to average for the feature importances
      • While it will run, don't want to set this value at 1 as there is quite a bit of random variation
      • Smaller values will run faster as it is running through XGBoost a smaller number of times
      • Scales linearly. iters=4 takes 2x time of iters=2 and 4x time of iters=1
  • max_rounds [default=100] - int (max_rounds > 0)
    • The number of times the core BoostARoota algorithm will run. Each round eliminates more and more features
      • Default is set high enough that it really shouldn't be reached under normal circumstances
      • You would want to set this value low if you felt that it was aggressively removing variables.
  • delta [default=0.1] - float (0 < delta <= 1)
    • Stopping criteria for whether another round is started
      • Regardless of this value, will not progress past max_rounds
      • A value of 0.1 means that at least 10% of the features must be removed in order to move onto the next round
      • Setting higher values will make it more difficult to move to follow on rounds (ex. setting at 1 guarantees only one round)
      • Setting too low of a delta may result in eliminating too many features and would be constrained by max_rounds
  • silent [default=False] - boolean
    • Set to True if don't want to see the BoostARoota output printed. Will still show any errors or warnings that may occur.

How it works

Similar in spirit to Boruta, BoostARoota creates shadow features, but modifies the removal step.

  1. One-Hot-Encode the feature set
  2. Double width of the data set, making a copy of all features in original dataset
  3. Randomly shuffle the new features created in (2). These duplicated and shuffled features are referred to as "shadow features"
  4. Run XGBoost classifier on the entire data set ten times. Running it ten times allows for random noise to be smoothed, resulting in more robust estimates of importance. The number of repeats is a parameter than can be changed.
  5. Obtain importance values for each feature. This is a simple importance metric that sums up how many times the particular feature was split on in the XGBoost algorithm.
  6. Compute "cutoff": the average feature importance value for all shadow features and divide by four. Shadow importance values are divided by four (parameter can be changed) to make it more difficult for the variables to be removed. With values lower than this, features are removed at too high of a rate.
  7. Remove features with average importance across the ten iterations that is less than the cutoff specified in (6)
  8. Go back to (2) until the number of features removed is less than ten percent of the total.
  9. Method returns the features remaining once completed.

Algorithm Performance

BoostARoota is shorted to BAR and the below table is utilizing the LSVT dataset from the UCI datasets. The algorithm has been tested on other datasets. If you are interested in the specifics of the testing please take a look at the testBAR.py script. The basics are that it is run through 5-fold CV, with the model selection performed on the training set and then predicting on the heldout test set. It is done this way to avoid overfitting the feature selection process.

All tests are run on a 12 core (hyperthreaded) Intel i7. - Future iterations will compare run times on a 28 core Xeon, 120 cores on Spark, and running xgboost on a GPU.

Data Set Target Boruta Time BoostARoota Time BoostARoota LogLoss Boruta LogLoss All Features LogLoss BAR >= All
LSVT 0/1 50.289s 0.487s 0.5617 0.6950 0.7311 Yes
HR 0/1 33.704s 0.485s 0.1046 0.1003 0.1047 Yes
Fraud 0/1 38.619s 1.790s 0.4333 0.4353 0.4333 Yes

As can be seen, the speed up from BoostARoota is around 100x with substantial reductions in log loss. Part of this speed up is that Boruta is running single threaded, while BoostARoota (on XGB) is running on all 12 cores. Not sure how this time speed up works with larger datasets as of yet.

This has also been tested on Kaggle's House Prices. With nothing done except running BoostARoota and evaluated on RMSE, all features scored .15669, while BoostARoota scored 0.1560.

Future Functionality (i.e. Current Shortcomings)

The text file FS_algo_basics.txt details how I was thinking through the algorithm and what additional functionality was thought about during the creation.

  • Preprocessing Steps - Need some first pass filters for reducing dimensionality right off the bat
    • Check and drop identical features, leaving option to drop highly correlated variables
    • Drop variables with near-zero-variance to target variable (creating threshold will be difficult)
    • LDA, PCA, PLS rankings
      • Challenge with these is they remove based on linear relationships whereas trees are able to pick out the non-linear relationships and a variable with a low linear dependency may be powerful when combined with others.
    • t-SNE - Has shown some promise in high-dimensional data
  • Algorithm could use a better stopping criteria
    • Next step is to test it against Y and the eval_metric to see when it is falling off.
  • Expand compute to handle larger datasets (if user has the hardware)
    • Run on Dask - Issue was opened up and Chase is working on it
    • Run on PySpark: make it easy enough that can just pass in SparkContext - will require some refactoring
    • Run XGBoost on GPU - although may run into memory issues with the shadow features.

Updates

  • 1/22/18 - Added functionality to insert any tree based classifier from sklearn into BoostARoota.
  • 10/26/17 - Modified Structure to resemble sklearn classes and added tuning parameters.
  • 9/22/17 - Uploaded to PyPI and expanded tests
  • 9/8/17 - Added Support for multi-class classification, but only for the logloss eval_metric. Need to pass in eval="mlogloss"
  • 9/6/17 - have implemented in BoostARoota2() a stopping criteria specifying that at least 10% of features need to be dropped to continue.
  • 8/25/17 - The testBAR.py testing framework was just completed and ran through a number of datasets

Want to Contribute?

This project has found some initial successes and there are a number of directions it can head. It would be great to have some additional help if you are willing/able. Whether it is directly contributing to the codebase or just giving some ideas, any help is appreciated. The goal is to make the algorithm as robust as possible. The primary focus right now is on the components under Future Implementations, but are in active development. Please reach out to see if there is anything you would like to contribute in that part to make sure we aren't duplicating work.

A special thanks to Progressive Leasing for sponsoring this research.

Comments
  • Configurable estimator similar to boruta_py

    Configurable estimator similar to boruta_py

    Would it be possible to make the estimator configurable? Currently you're requiring xgboost, but I'd prefer to be able to try to use lightgbm (https://github.com/szilard/GBM-perf) as it benchmarks a bit faster.

    opened by w1nk 5
  • Data must be 1-dimensional

    Data must be 1-dimensional

    I would appreciate if you could let me know how to deal with this error:

    X = np.array(pd.read_csv('tot_X_1.csv',header=None).values)
    y = np.array(pd.read_csv('tot_Y_1.csv',header=None).values.ravel())
    
    # Split data set to train and test data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,stratify=y, test_size=0.3, random_state=42)
    
    X_train=pd.get_dummies(X_train)
    
    
    br = BoostARoota(metric='f1')
    
    #Fit the model for the subset of variables
    br.fit(X_train,y_train)
    
    #Can look at the important variables - will return a pandas series
    br.keep_vars_
    
    #Then modify dataframe to only include the important variables
    br.transform(X_train)
    

    Error:

      File "D:/mifs-master_2/MU/learning-from-imbalanced-classes-master/learning-from-imbalanced-classes-master/continuous/Bankrupt_2/Bankrupt/data/chase.py", line 15, in <module>
        X_train=pd.get_dummies(X_train)
      File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 1215, in get_dummies
        sparse=sparse, drop_first=drop_first)
      File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\pandas\core\reshape\reshape.py", line 1222, in _get_dummies_1d
        codes, levels = _factorize_from_iterable(Series(data))
      File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\pandas\core\series.py", line 264, in __init__
        raise_cast_failure=True)
      File "C:\Users\Markazi.co\Anaconda3\lib\site-packages\pandas\core\series.py", line 3234, in _sanitize_array
        raise Exception('Data must be 1-dimensional')
    Exception: Data must be 1-dimensional
    

    Best regards,

    opened by shahlaebrahimi 5
  • ZeroDivisionError: division by zero

    ZeroDivisionError: division by zero


    ZeroDivisionError Traceback (most recent call last) in () 1 br = BoostARoota(metric='logloss',delta=0.05) ----> 2 br.fit(all_feats,target_data);

    ~/anaconda2/envs/py3k/lib/python3.6/site-packages/boostaroota/boostaroota.py in fit(self, x, y) 51 max_rounds=self.max_rounds, 52 delta=self.delta, ---> 53 silent=self.silent) 54 return self 55

    ~/anaconda2/envs/py3k/lib/python3.6/site-packages/boostaroota/boostaroota.py in _BoostARoota(x, y, metric, clf, cutoff, iters, max_rounds, delta, silent) 224 n_iterations=iters, 225 delta=delta, --> 226 silent=silent) 227 else: 228 crit, keep_vars = _reduce_vars_sklearn(new_x,

    ~/anaconda2/envs/py3k/lib/python3.6/site-packages/boostaroota/boostaroota.py in _reduce_vars_xgb(x, y, metric, this_round, cutoff, n_iterations, delta, silent) 139 #Check for the stopping criteria 140 #Basically looking to make sure we are removing at least 10% of the variables, or we should stop --> 141 if (len(real_vars['feature']) / len(x.columns)) > (1-delta): 142 criteria = True 143 else:

    ZeroDivisionError: division by zero

    opened by pbebbo 4
  • AttributeError: 'numpy.ndarray' object has no attribute 'columns'

    AttributeError: 'numpy.ndarray' object has no attribute 'columns'

    When running br.fit(train.values,labels.values) I get the following error:


    AttributeError Traceback (most recent call last) in () 1 br = BoostARoota(metric='logloss') 2 ----> 3 br.fit(train.values,labels.values) 4 len(train.columns) 5 len(br.keep_vars_)

    ~/anaconda2/envs/py3k/lib/python3.6/site-packages/boostaroota/boostaroota.py in fit(self, x, y) 51 max_rounds=self.max_rounds, 52 delta=self.delta, ---> 53 silent=self.silent) 54 return self 55

    ~/anaconda2/envs/py3k/lib/python3.6/site-packages/boostaroota/boostaroota.py in _BoostARoota(x, y, metric, clf, cutoff, iters, max_rounds, delta, silent) 224 n_iterations=iters, 225 delta=delta, --> 226 silent=silent) 227 else: 228 crit, keep_vars = _reduce_vars_sklearn(new_x,

    ~/anaconda2/envs/py3k/lib/python3.6/site-packages/boostaroota/boostaroota.py in _reduce_vars_xgb(x, y, metric, this_round, cutoff, n_iterations, delta, silent) 113 for i in range(1, n_iterations+1): 114 # Create the shadow variables and run the model to obtain importances --> 115 new_x, shadow_names = _create_shadow(x) 116 dtrain = xgb.DMatrix(new_x, label=y) 117 bst = xgb.train(param, dtrain, verbose_eval=False)

    ~/anaconda2/envs/py3k/lib/python3.6/site-packages/boostaroota/boostaroota.py in _create_shadow(x_train) 77 """ 78 x_shadow = x_train.copy() ---> 79 for c in x_shadow.columns: 80 np.random.shuffle(x_shadow[c].values) 81 # rename the shadow

    AttributeError: 'numpy.ndarray' object has no attribute 'columns'

    opened by pbebbo 2
  • merging outside of the loop

    merging outside of the loop

    Seems to me that in the "_reduce_vars..." functions you need to move the merging of DataFrames df + df2 out from the loop, outherwise you get more and more duplicate columns on each iteration in "df" DataFrame, so your df.mean(axis=1) method would return wrong value.

    opened by Rubyr0id 0
  • Repo structure

    Repo structure

    Added gitignore and MIT License.
    Moved code from init.py to boostaroota.py
    Removed dist/ and .egg-info/ folders (I don't think you need them in the repo since you have your setup.py)

    opened by ZRiddle 0
  • Added the timings for speed of algorithm

    Added the timings for speed of algorithm

    There is a testTimings.py and testTimings.R file that have been added to evaluate how fast the model executes in the same manner as in the Boruta Paper.

    opened by chasedehan 0
  • Sklearn implementation has an error

    Sklearn implementation has an error

    Hi @chasedehan,

    I think I found an error in the sklearn implementation.

    At the moment you add one column to df2 for every iteration that you are doing. And then df2 is joined to df again. Like this many duplicate columns are created that are diluting the mean of the feature importance later on. You can find this if you print out df after every iteration

    try:
        importance = clf.feature_importances_
        df2['fscore' + str(i)] = importance
    except ValueError:
        print("this clf doesn't have the feature_importances_ method.  Only Sklearn tree based methods allowed")
    
    # importance = sorted(importance.items(), key=operator.itemgetter(1))
    
    # df2 = pd.DataFrame(importance, columns=['feature', 'fscore'+str(i)])
    df2['fscore'+str(i)] = df2['fscore'+str(i)] / df2['fscore'+str(i)].sum()
    df = pd.merge(df, df2, on='feature', how='outer')
    if not silent:
        print("Round: ", this_round, " iteration: ", i)
    

    Here is a suggestion how to fix it:

    if len(getattr(clf, 'feature_importances_', [])) == 0:
        raise ValueError(
            "this clf doesn't have the feature_importances_ method. Only Sklearn tree based methods allowed"
        )
    
    if i == 1:
        df = pd.DataFrame({'feature': new_x.columns})
    
    # importance = sorted(importance.items(), key=operator.itemgetter(1))
    
    importance = clf.feature_importances_
    importance = np.column_stack([new_x.columns, importance])
    df2 = pd.DataFrame(importance, columns=['feature', 'fscore'+str(i)])
    df2['fscore'+str(i)] = df2['fscore'+str(i)] / df2['fscore'+str(i)].sum()
    df = pd.merge(df, df2, on='feature', how='outer')
    if not silent:
        print("Round: ", this_round, " iteration: ", i) ```
    
    
    opened by jjuppe 0
  • Add correlation preprocessing

    Add correlation preprocessing

    Hello

    I've written a library, that could be a implementation of your idea of correlation preprocessing: https://github.com/bukson/nancorrmp

    It is designed to work on multiple cores in parallel way and can handle nans and infs as feature values.

    I can contribute and add code in some way, but probably you want to establish some way of making preprocessing, so for now I am just creating an issue.

    Please contact me if you think that I can help you

    opened by bukson 0
  • Pyspark integration

    Pyspark integration

    Would it be possible to share your pyspark implementation of these functions? I have seen that the full integration is planned for future updates, however you mention that you've ran the tests using Spark already. Thanks! Vykintas

    opened by Vykintasj 1
  • Dask integration

    Dask integration

    Much like your idea for pyspark integration, I would like to see simliar support for passing in a dask client as is supported by the dask-xgboost library. I have found initial success in reducing high dimensional data using the BoostaRoota library but find the bottleneck to be during the initial load of the parquet file repository. I'll offer what assitance I can regarding this work.

    Ben.

    opened by bendruitt 3
Owner
Chase DeHan
Data Scientist, Economist, Skier, Coffee Junkie
Chase DeHan
Python implementations of the Boruta all-relevant feature selection method.

boruta_py This project hosts Python implementations of the Boruta all-relevant feature selection method. Related blog post How to install Install with

null 1.2k Jan 4, 2023
A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Master status: Development status: Package information: scikit-rebate This package includes a scikit-learn-compatible Python implementation of ReBATE,

Epistasis Lab at UPenn 374 Dec 15, 2022
An open source python library for automated feature engineering

"One of the holy grails of machine learning is to automate more and more of the feature engineering process." ― Pedro Domingos, A Few Useful Things to

alteryx 6.4k Jan 5, 2023
a feature engineering wrapper for sklearn

Few Few is a Feature Engineering Wrapper for scikit-learn. Few looks for a set of feature transformations that work best with a specified machine lear

William La Cava 47 Nov 18, 2022
A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction.

Master status: Development status: Package information: MDR A scikit-learn-compatible Python implementation of Multifactor Dimensionality Reduction (M

Epistasis Lab at UPenn 122 Jul 6, 2022
XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

XGBoost-Ray is a distributed backend for XGBoost, built on top of distributed computing framework Ray.

null 92 Dec 14, 2022
Video lie detector using xgboost - A video lie detector using OpenFace and xgboost

video_lie_detector_using_xgboost a video lie detector using OpenFace and xgboost

null 2 Jan 11, 2022
zoofs is a Python library for performing feature selection using an variety of nature inspired wrapper algorithms. The algorithms range from swarm-intelligence to physics based to Evolutionary. It's easy to use ,flexible and powerful tool to reduce your feature size.

zoofs is a Python library for performing feature selection using a variety of nature-inspired wrapper algorithms. The algorithms range from swarm-intelligence to physics-based to Evolutionary. It's easy to use , flexible and powerful tool to reduce your feature size.

Jaswinder Singh 168 Dec 30, 2022
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

郭飞 3.7k Jan 3, 2023
Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

scikit-opt Swarm Intelligence in Python (Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Algorithm, Immune Algorithm,A

郭飞 3.7k Jan 1, 2023
PyTorch Implementation for AAAI'21 "Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection"

UMS for Multi-turn Response Selection Implements the model described in the following paper Do Response Selection Models Really Know What's Next? Utte

Taesun Whang 47 Nov 22, 2022
Implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

Selection via Proxy: Efficient Data Selection for Deep Learning This repository contains a refactored implementation of "Selection via Proxy: Efficien

Stanford Future Data Systems 70 Nov 16, 2022
Abhijith Neil Abraham 2 Nov 5, 2021
Improving XGBoost survival analysis with embeddings and debiased estimators

xgbse: XGBoost Survival Embeddings "There are two cultures in the use of statistical modeling to reach conclusions from data

Loft 242 Dec 30, 2022
Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

AutoViz and Auto_ViML 519 Jan 3, 2023
XGBoost + Optuna

AutoXGB XGBoost + Optuna: no brainer auto train xgboost directly from CSV files auto tune xgboost using optuna auto serve best xgboot model using fast

abhishek thakur 517 Dec 31, 2022
Used Logistic Regression, Random Forest, and XGBoost to predict the outcome of Search & Destroy games from the Call of Duty World League for the 2018 and 2019 seasons.

Call of Duty World League: Search & Destroy Outcome Predictions Growing up as an avid Call of Duty player, I was always curious about what factors led

Brett Vogelsang 2 Jan 18, 2022
Mortality risk prediction for COVID-19 patients using XGBoost models

Mortality risk prediction for COVID-19 patients using XGBoost models Using demographic and lab test data received from the HM Hospitales in Spain, I b

null 1 Jan 19, 2022
Attention-based CNN-LSTM and XGBoost hybrid model for stock prediction

Attention-based CNN-LSTM and XGBoost hybrid model for stock prediction Requirements The code has been tested running under Python 3.7.4, with the foll

zshicode 84 Jan 1, 2023