stability-selection - A scikit-learn compatible implementation of stability selection

Last update: Dec 3, 2022

Related tags

Overview

stability-selection - A scikit-learn compatible implementation of stability selection

stability-selection is a Python implementation of the stability selection feature selection algorithm, first proposed by Meinshausen and Buhlmann.

The idea behind stability selection is to inject more noise into the original problem by generating bootstrap samples of the data, and to use a base feature selection algorithm (like the LASSO) to find out which features are important in every sampled version of the data. The results on each bootstrap sample are then aggregated to compute a stability score for each feature in the data. Features can then be selected by choosing an appropriate threshold for the stability scores.

Installation

To install the module, clone the repository

git clone https://github.com/scikit-learn-contrib/stability-selection.git

Before installing the module you will need numpy, matplotlib, and sklearn. Install these modules separately, or install using the requirements.txt file:

pip install -r requirements.txt

and execute the following in the project directory to install stability-selection:

python setup.py install

Documentation and algorithmic details

See the documentation for details on the module, and the accompanying blog post for details on the algorithmic details.

Example usage

stability-selection implements a class StabilitySelection, that takes any scikit-learn compatible estimator that has either a feature_importances_ or coef_ attribute after fitting. Important other parameters are

lambda_name: the name of the penalization parameter of the base estimator (for example, C in the case of LogisticRegression).
lambda_grid: an array of values of the penalization parameter to iterate over.

After instantiation, the algorithm can be run with the familiar fit and transform calls.

Basic example

See below for an example:

import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils import check_random_state
from stability_selection import StabilitySelection


def _generate_dummy_classification_data(p=1000, n=1000, k=5, random_state=123321):

    rng = check_random_state(random_state)

    X = rng.normal(loc=0.0, scale=1.0, size=(n, p))
    betas = np.zeros(p)
    important_betas = np.sort(rng.choice(a=np.arange(p), size=k))
    betas[important_betas] = rng.uniform(size=k)

    probs = 1 / (1 + np.exp(-1 * np.matmul(X, betas)))
    y = (probs > 0.5).astype(int)

    return X, y, important_betas

## This is all preparation of the dummy data set
n, p, k = 500, 1000, 5

X, y, important_betas = _generate_dummy_classification_data(n=n, k=k)
base_estimator = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(penalty='l1'))
])

## Here stability selection is instantiated and run
selector = StabilitySelection(base_estimator=base_estimator, lambda_name='model__C',
                              lambda_grid=np.logspace(-5, -1, 50)).fit(X, y)

print(selector.get_support(indices=True))

Bootstrapping strategies

stability-selection uses bootstrapping without replacement by default (as proposed in the original paper), but does support different bootstrapping strategies. [Shah and Samworth] proposed complementary pairs bootstrapping, where the data set is bootstrapped in pairs, such that the intersection is empty but the union equals the original data set. StabilitySelection supports this through the bootstrap_func parameter.

This parameter can be:

A string, which must be one of
- 'subsample': For subsampling without replacement (default).
- 'complementary_pairs': For complementary pairs subsampling [2].
- 'stratified': For stratified bootstrapping in imbalanced classification.
A function that takes y, and a random state as inputs and returns a list of sample indices in the range (0, len(y)-1).

For example, the StabilitySelection call in the above example can be replaced with

selector = StabilitySelection(base_estimator=base_estimator,
                              lambda_name='model__C',
                              lambda_grid=np.logspace(-5, -1, 50),
                              bootstrap_func='complementary_pairs')
selector.fit(X, y)

to run stability selection with complementary pairs bootstrapping.

Feedback and contributing

Feedback and contributions are much appreciated. If you have any feedback, please post it on the issue tracker.

References

[1]: Meinshausen, N. and Buhlmann, P., 2010. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4), pp.417-473.

[2] Shah, R.D. and Samworth, R.J., 2013. Variable selection with error control: another look at stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1), pp.55-80.

Comments

Typo and path highlight bug
Two very small fixes:

One typo fix in the README.md file of the word 'complementary'

One conditional execution in stability selection plot; in the case where all paths were highlighted, this code would error on line 139 (plotting non-highlighted paths). There is an argument for saying then the regularisation parameter search range is not appropriate, but we can avoid the error by adding an if statement.
opened by jwaton 2
Add stratified sampling
Let me know if this looks useful :-)

adds stratified (group-by-group) bootstrap sampling in order to improve performance for imbalanced datasets

requires a small change in the function signature for all bootstrap functions
opened by vrtsig 2
[MRG] Refactor bootstrap and allow user to pass in custom subsampling function
TODO:

Improve help docs

Write example that shows how to use complementary pairs bootstrapping

Allow keyword arguments to be passed to bootstrap_func
opened by thuijskens 1
In (stability selection.py) joblib externals are no longer part of scikit learn (descontinued) + In (randomized_lasso.py) sklearn.linear_model.base returns errors after executing clone and configuration when we call in some code this way (from stability_selection import RandomizedLasso) .####HOW TO SOLVE STEP BY STEP#####
1 st.- Clone the project and run the commands as mentioned in the on main project page, but don't run setup.py yet.

2st- In windows go to the folder where the project was cloned. in my case this is the way: C:\Users\andre\stability-selection

3st- You will see two files named like this: randomized_lasso.py and stability_selection.py 3.1 - Open and edit them with IDLE python 3.2 - In randomized_lasso.py change it from sklearn.linear_model.base import _preprocess_data to from sklearn.linear_model._base import _preprocess_data 3.3 - In stability_selection.py change it from sklearn.externals.joblib import Parallel, delayed to import joblib as jb from joblib import Parallel, delayed ######don't forget to ctrl+s to save file######

4st- In cmd navigate to the folder where you made the clone as in step 2 above and run the command: python setup.py install

You will now be able to use stability selection in your codes, para new versions of in scikit learn and its dependencies###

For those who already have the stability selection on their machine, they should delete any folder related to the package, even inside the python libs on the computer and start the above processes again.

For Linux users, the process is the same! with the exception of the cmd paths above.

Hope this helps people who have the same problem.
opened by 1993bio 1
Issues with importing StabilitySelection

Hi,

I'm sure there's a better way to do this, but I had to change these two lines in order to import StabilitySelection from the latest version.

stability_selection.py, line 26

from joblib import Parallel, delayed

random_lasso.py, line 21 from sklearn.linear_model._base import _preprocess_data

Thanks for building this module!

opened by trislett 0
suggested reading: paper with example usage of stability selection

Hey, thank you very much for developing this code, so all of us can use it through sk-learn :)

I have seen your blog post that gives a very good understanding of the method. In the reference papers section you do cite the original paper. I would suggest to add this more practical paper about the methods: Bühlmann, Peter, Markus Kalisch, and Lukas Meier. "High-dimensional statistics with a view toward applications in biology." (2014). paper link

At least this paper helped me to grasp the basics on these feature selection methods, without going to original publication. (of course the original publication of the method is the reading source, but depending on the statistics background a developer/researcher should choose the correct reading)

Hopefully, this addition helps more people that will use your package to understand the stability selection method/idea. :)

opened by damianosmel 0
Import joblib directly, as sklearn.externals.joblib was removed

sklearn.externals.joblib was deprecated in v0.21 of scikit-learn, and has been removed in v0.23 (the latest stable release).

This PR imports directly from joblib and adds it as a dependency.

Fixes #33

opened by lcreteig 3
joblib no longer included in scikit-learn

sklearn.externals.joblib was deprecated in v0.21 and has been removed in v0.23 of scikit-learn (the latest stable release).

Therefore this line in stability_selection.py: from sklearn.externals.joblib import Parallel, delayed will throw an ImportError

P.S. Thanks for making this available! It was really easy to get started with the code, and I learned a lot from the accompanying blog post.

opened by lcreteig 0

Owner

scikit-learn compatible projects

GitHub

PySpark + Scikit-learn = Sparkit-learn

Sparkit-learn PySpark + Scikit-learn = Sparkit-learn GitHub: https://github.com/lensacom/sparkit-learn About Sparkit-learn aims to provide scikit-lear

1.1k Jan 4, 2023

Relevance Vector Machine implementation using the scikit-learn API.

scikit-rvm scikit-rvm is a Python module implementing the Relevance Vector Machine (RVM) machine learning technique using the scikit-learn API. Quicks

204 Nov 18, 2022

A scikit-learn based module for multi-label et. al. classification

scikit-multilearn scikit-multilearn is a Python module capable of performing multi-label learning tasks. It is built on-top of various scientific Pyth

802 Jan 1, 2023

Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models

Highly interpretable, sklearn-compatible classifier based on decision rules This is a scikit-learn compatible wrapper for the Bayesian Rule List class

482 Nov 19, 2022

Automated Machine Learning with scikit-learn

auto-sklearn auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator. Find the documentation here

6.7k Jan 7, 2023

Distributed scikit-learn meta-estimators in PySpark

sk-dist: Distributed scikit-learn meta-estimators in PySpark What is it? sk-dist is a Python package for machine learning built on top of scikit-learn

282 Dec 9, 2022

Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Iris Species Predictor Iris species predictor app is used to classify iris species using their sepal length, sepal width, petal length and petal width

5 Apr 5, 2022

Penguins species predictor app is used to classify penguins species created using python's scikit-learn, fastapi, numpy and joblib packages.

Penguins Classification App Penguins species predictor app is used to classify penguins species using their island, sex, bill length (mm), bill depth

3 Apr 5, 2022

Scikit learn library models to account for data and concept drift.

liquid_scikit_learn Scikit learn library models to account for data and concept drift. This python library focuses on solving data drift and concept d

7 Nov 18, 2021

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets Datasets Used: Iris dataset,

2 Nov 18, 2021

K-Means clusternig example with Python and Scikit-learn

Unsupervised-Machine-Learning Flat Clustering K-Means clusternig example with Python and Scikit-learn Flat clustering Clustering algorithms group a se

1 Dec 13, 2021

Scikit-Learn useful pre-defined Pipelines Hub

Scikit-Pipes Scikit-Learn useful pre-defined Pipelines Hub Usage: Install scikit-pipes It's advised to install sklearn-genetic using a virtual env, in

1 Apr 26, 2022

Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

Clustering Clustering Application in Python Using scikit-learn This repository contains the prediction of baseball metric clusters using MLB Statcast

2 Apr 18, 2022

To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

1 Jan 11, 2022

stability-selection - A scikit-learn compatible implementation of stability selection

Related tags

Overview

stability-selection - A scikit-learn compatible implementation of stability selection

Installation

Documentation and algorithmic details

Example usage

Basic example

Bootstrapping strategies

Feedback and contributing

References

Comments

Typo and path highlight bug

Add stratified sampling

[MRG] Refactor bootstrap and allow user to pass in custom subsampling function

You will now be able to use stability selection in your codes, para new versions of in scikit learn and its dependencies###

Issues with importing StabilitySelection

suggested reading: paper with example usage of stability selection

Import joblib directly, as sklearn.externals.joblib was removed

joblib no longer included in scikit-learn

Owner

PySpark + Scikit-learn = Sparkit-learn

Relevance Vector Machine implementation using the scikit-learn API.

A scikit-learn based module for multi-label et. al. classification

Highly interpretable classifiers for scikit learn, producing easily understood decision rules instead of black box models

Automated Machine Learning with scikit-learn

Distributed scikit-learn meta-estimators in PySpark

Iris species predictor app is used to classify iris species created using python's scikit-learn, fastapi, numpy and joblib packages.

Penguins species predictor app is used to classify penguins species created using python's scikit-learn, fastapi, numpy and joblib packages.

Scikit learn library models to account for data and concept drift.

Interactive Web App with Streamlit and Scikit-learn that applies different Classification algorithms to popular datasets

K-Means clusternig example with Python and Scikit-learn

Scikit-Learn useful pre-defined Pipelines Hub

Predicting Baseball Metric Clusters: Clustering Application in Python Using scikit-learn

To design and implement the Identification of Iris Flower species using machine learning using Python and the tool Scikit-Learn.

Painless Machine Learning for python based on scikit-learn

icepickle is to allow a safe way to serialize and deserialize linear scikit-learn models

scikit-multimodallearn is a Python package implementing algorithms multimodal data.

scikit-fem is a lightweight Python 3.7+ library for performing finite element assembly.

Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.