stability-selection - A scikit-learn compatible implementation of stability selection

Overview

stability-selection - A scikit-learn compatible implementation of stability selection

Build Status Coverage Status CircleCI

stability-selection is a Python implementation of the stability selection feature selection algorithm, first proposed by Meinshausen and Buhlmann.

The idea behind stability selection is to inject more noise into the original problem by generating bootstrap samples of the data, and to use a base feature selection algorithm (like the LASSO) to find out which features are important in every sampled version of the data. The results on each bootstrap sample are then aggregated to compute a stability score for each feature in the data. Features can then be selected by choosing an appropriate threshold for the stability scores.

Installation

To install the module, clone the repository

git clone https://github.com/scikit-learn-contrib/stability-selection.git

Before installing the module you will need numpy, matplotlib, and sklearn. Install these modules separately, or install using the requirements.txt file:

pip install -r requirements.txt

and execute the following in the project directory to install stability-selection:

python setup.py install

Documentation and algorithmic details

See the documentation for details on the module, and the accompanying blog post for details on the algorithmic details.

Example usage

stability-selection implements a class StabilitySelection, that takes any scikit-learn compatible estimator that has either a feature_importances_ or coef_ attribute after fitting. Important other parameters are

  • lambda_name: the name of the penalization parameter of the base estimator (for example, C in the case of LogisticRegression).
  • lambda_grid: an array of values of the penalization parameter to iterate over.

After instantiation, the algorithm can be run with the familiar fit and transform calls.

Basic example

See below for an example:

import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import check_random_state
from stability_selection import StabilitySelection


def _generate_dummy_classification_data(p=1000, n=1000, k=5, random_state=123321):

    rng = check_random_state(random_state)

    X = rng.normal(loc=0.0, scale=1.0, size=(n, p))
    betas = np.zeros(p)
    important_betas = np.sort(rng.choice(a=np.arange(p), size=k))
    betas[important_betas] = rng.uniform(size=k)

    probs = 1 / (1 + np.exp(-1 * np.matmul(X, betas)))
    y = (probs > 0.5).astype(int)

    return X, y, important_betas

## This is all preparation of the dummy data set
n, p, k = 500, 1000, 5

X, y, important_betas = _generate_dummy_classification_data(n=n, k=k)
base_estimator = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(penalty='l1'))
])

## Here stability selection is instantiated and run
selector = StabilitySelection(base_estimator=base_estimator, lambda_name='model__C',
                              lambda_grid=np.logspace(-5, -1, 50)).fit(X, y)

print(selector.get_support(indices=True))

Bootstrapping strategies

stability-selection uses bootstrapping without replacement by default (as proposed in the original paper), but does support different bootstrapping strategies. [Shah and Samworth] proposed complementary pairs bootstrapping, where the data set is bootstrapped in pairs, such that the intersection is empty but the union equals the original data set. StabilitySelection supports this through the bootstrap_func parameter.

This parameter can be:

  • A string, which must be one of
    • 'subsample': For subsampling without replacement (default).
    • 'complementary_pairs': For complementary pairs subsampling [2].
    • 'stratified': For stratified bootstrapping in imbalanced classification.
  • A function that takes y, and a random state as inputs and returns a list of sample indices in the range (0, len(y)-1).

For example, the StabilitySelection call in the above example can be replaced with

selector = StabilitySelection(base_estimator=base_estimator,
                              lambda_name='model__C',
                              lambda_grid=np.logspace(-5, -1, 50),
                              bootstrap_func='complementary_pairs')
selector.fit(X, y)

to run stability selection with complementary pairs bootstrapping.

Feedback and contributing

Feedback and contributions are much appreciated. If you have any feedback, please post it on the issue tracker.

References

[1]: Meinshausen, N. and Buhlmann, P., 2010. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), pp.417-473.

[2] Shah, R.D. and Samworth, R.J., 2013. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), pp.55-80.

Comments
  • Typo and path highlight bug

    Typo and path highlight bug

    Two very small fixes:

    • One typo fix in the README.md file of the word 'complementary'
    • One conditional execution in stability selection plot; in the case where all paths were highlighted, this code would error on line 139 (plotting non-highlighted paths). There is an argument for saying then the regularisation parameter search range is not appropriate, but we can avoid the error by adding an if statement.
    opened by jwaton 2
  • Add stratified sampling

    Add stratified sampling

    Let me know if this looks useful :-)

    • adds stratified (group-by-group) bootstrap sampling in order to improve performance for imbalanced datasets
    • requires a small change in the function signature for all bootstrap functions
    opened by vrtsig 2
  • [MRG] Refactor bootstrap and allow user to pass in custom subsampling function

    [MRG] Refactor bootstrap and allow user to pass in custom subsampling function

    TODO:

    • Improve help docs
    • Write example that shows how to use complementary pairs bootstrapping
    • Allow keyword arguments to be passed to bootstrap_func
    opened by thuijskens 1
  • In (stability selection.py) joblib externals are no longer part of scikit learn (descontinued) + In (randomized_lasso.py) sklearn.linear_model.base returns errors after executing clone and configuration when we call in some code this way (from stability_selection import RandomizedLasso) .####HOW TO SOLVE STEP BY STEP#####

    In (stability selection.py) joblib externals are no longer part of scikit learn (descontinued) + In (randomized_lasso.py) sklearn.linear_model.base returns errors after executing clone and configuration when we call in some code this way (from stability_selection import RandomizedLasso) .####HOW TO SOLVE STEP BY STEP#####

    1 st.- Clone the project and run the commands as mentioned in the on main project page, but don't run setup.py yet.

    2st- In windows go to the folder where the project was cloned. in my case this is the way: C:\Users\andre\stability-selection

    3st- You will see two files named like this: randomized_lasso.py and stability_selection.py 3.1 - Open and edit them with IDLE python 3.2 - In randomized_lasso.py change it from sklearn.linear_model.base import _preprocess_data to from sklearn.linear_model._base import _preprocess_data 3.3 - In stability_selection.py change it from sklearn.externals.joblib import Parallel, delayed to import joblib as jb from joblib import Parallel, delayed ######don't forget to ctrl+s to save file######

    4st- In cmd navigate to the folder where you made the clone as in step 2 above and run the command: python setup.py install

    You will now be able to use stability selection in your codes, para new versions of in scikit learn and its dependencies###

    For those who already have the stability selection on their machine, they should delete any folder related to the package, even inside the python libs on the computer and start the above processes again.

    For Linux users, the process is the same! with the exception of the cmd paths above.

                                                               Hope this helps people who have the same problem.
    
    opened by 1993bio 1
  • Issues with importing StabilitySelection

    Issues with importing StabilitySelection

    Hi,

    I'm sure there's a better way to do this, but I had to change these two lines in order to import StabilitySelection from the latest version.

    stability_selection.py, line 26

    from joblib import Parallel, delayed

    random_lasso.py, line 21 from sklearn.linear_model._base import _preprocess_data

    Thanks for building this module!

    opened by trislett 0
  • suggested reading: paper with example usage of stability selection

    suggested reading: paper with example usage of stability selection

    Hey, thank you very much for developing this code, so all of us can use it through sk-learn :)

    I have seen your blog post that gives a very good understanding of the method. In the reference papers section you do cite the original paper. I would suggest to add this more practical paper about the methods: Bühlmann, Peter, Markus Kalisch, and Lukas Meier. "High-dimensional statistics with a view toward applications in biology." (2014). paper link

    At least this paper helped me to grasp the basics on these feature selection methods, without going to original publication. (of course the original publication of the method is the reading source, but depending on the statistics background a developer/researcher should choose the correct reading)

    Hopefully, this addition helps more people that will use your package to understand the stability selection method/idea. :)

    opened by damianosmel 0
  • Import joblib directly, as sklearn.externals.joblib was removed

    Import joblib directly, as sklearn.externals.joblib was removed

    sklearn.externals.joblib was deprecated in v0.21 of scikit-learn, and has been removed in v0.23 (the latest stable release).

    This PR imports directly from joblib and adds it as a dependency.

    Fixes #33

    opened by lcreteig 3
  • joblib no longer included in scikit-learn

    joblib no longer included in scikit-learn

    sklearn.externals.joblib was deprecated in v0.21 and has been removed in v0.23 of scikit-learn (the latest stable release).

    Therefore this line in stability_selection.py: from sklearn.externals.joblib import Parallel, delayed will throw an ImportError

    P.S. Thanks for making this available! It was really easy to get started with the code, and I learned a lot from the accompanying blog post.

    opened by lcreteig 0
Owner
scikit-learn compatible projects
null
PySpark + Scikit-learn = Sparkit-learn

Sparkit-learn PySpark + Scikit-learn = Sparkit-learn GitHub: https://github.com/lensacom/sparkit-learn About Sparkit-learn aims to provide scikit-lear

Lensa 1.1k Jan 4, 2023
Relevance Vector Machine implementation using the scikit-learn API.

scikit-rvm scikit-rvm is a Python module implementing the Relevance Vector Machine (RVM) machine learning technique using the scikit-learn API. Quicks

James Ritchie 204 Nov 18, 2022
A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

null 802 Jan 1, 2023
Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models

Highly interpretable, sklearn-compatible classifier based on decision rules This is a scikit-learn compatible wrapper for the Bayesian Rule List class

Tamas Madl 482 Nov 19, 2022
Automated Machine Learning with scikit-learn

auto-sklearn auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. Find the documentation here

AutoML-Freiburg-Hannover 6.7k Jan 7, 2023
Distributed scikit-learn meta-estimators in PySpark

sk-dist: Distributed scikit-learn meta-estimators in PySpark What is it? sk-dist is a Python package for machine learning built on top of scikit-learn

Ibotta 282 Dec 9, 2022
Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Iris Species Predictor Iris species predictor app is used to classify iris species using their sepal length, sepal width, petal length and petal width

Siva Prakash 5 Apr 5, 2022
Penguins species predictor app is used to classify penguins species created using python's scikit-learn, fastapi, numpy and joblib packages.

Penguins Classification App Penguins species predictor app is used to classify penguins species using their island, sex, bill length (mm), bill depth

Siva Prakash 3 Apr 5, 2022
Scikit learn library models to account for data and concept drift.

liquid_scikit_learn Scikit learn library models to account for data and concept drift. This python library focuses on solving data drift and concept d

null 7 Nov 18, 2021
Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

Samrat Mitra 2 Nov 18, 2021
K-Means clusternig example with Python and Scikit-learn

Unsupervised-Machine-Learning Flat Clustering K-Means clusternig example with Python and Scikit-learn Flat clustering Clustering algorithms group a se

Emin 1 Dec 13, 2021
Scikit-Learn useful pre-defined Pipelines Hub

Scikit-Pipes Scikit-Learn useful pre-defined Pipelines Hub Usage: Install scikit-pipes It's advised to install sklearn-genetic using a virtual env, in

Rodrigo Arenas 1 Apr 26, 2022
Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Clustering Clustering Application in Python Using scikit-learn This repository contains the prediction of baseball metric clusters using MLB Statcast

Tom Weichle 2 Apr 18, 2022
To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

Astitva Veer Garg 1 Jan 11, 2022
Painless Machine Learning for python based on scikit-learn

PlainML Painless Machine Learning Library for python based on scikit-learn. Install pip install plainml Example from plainml import KnnModel, load_ir

null 1 Aug 6, 2022
icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models

icepickle It's a cooler way to store simple linear models. The goal of icepickle is to allow a safe way to serialize and deserialize linear scikit-lea

vincent d warmerdam 24 Dec 9, 2022
scikit-multimodallearn is a Python package implementing algorithms multimodal data.

scikit-multimodallearn is a Python package implementing algorithms multimodal data. It is compatible with scikit-learn, a popul

null 12 Jun 29, 2022
scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly.

scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly. Its main purpose is the transformation of bilinear forms into sparse matrices and linear forms into vectors.

Tom Gustafsson 297 Dec 13, 2022
Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.

Time series analysis today is an important cornerstone of quantitative science in many disciplines, including natural and life sciences as well as eco

Christoph Mark 129 Dec 24, 2022