PyImpetus is a Markov Blanket based feature subset selection algorithm that considers features both separately and together as a group in order to provide not just the best set of features but also the best combination of features

Atif Hassan

Last update: Dec 14, 2022

Related tags

Deep Learning statistics machine-learning-algorithms probability feature-selection t-test markov-blanket minimal-features

Overview

PyImpetus

PyImpetus is a Markov Blanket based feature selection algorithm that selects a subset of features by considering their performance both individually as well as a group. This allows the algorithm to not only select the best set of features, but also select the best set of features that play well with each other. For example, the best performing feature might not play well with others while the remaining features, when taken together could out-perform the best feature. PyImpetus takes this into account and produces the best possible combination. Thus, the algorithm provides a minimal feature subset. So, you do not have to decide on how many features to take. PyImpetus selects the optimal set for you.

PyImpetus has been completely revamped and now supports binary classification, multi-class classification and regression tasks. It has been tested on 14 datasets and outperformed state-of-the-art Markov Blanket learning algorithms on all of them along with traditional feature selection algorithms such as Forward Feature Selection, Backward Feature Elimination and Recursive Feature Elimination.

How to install?

pip install PyImpetus

Functions and parameters

# The initialization of PyImpetus takes in multiple parameters as input
# PPIMBC is for classification
model = PPIMBC(model, p_val_thresh, num_simul, simul_size, simul_type, sig_test_type, cv, verbose, random_state, n_jobs)

model - estimator object, default=DecisionTreeClassifier() The model which is used to perform classification in order to find feature importance via significance-test.
p_val_thresh - float, default=0.05 The p-value (in this case, feature importance) below which a feature will be considered as a candidate for the final MB.
num_simul - int, default=30 (This feature has huge impact on speed) Number of train-test splits to perform to check usefulness of each feature. For large datasets, the value should be considerably reduced though do not go below 5.
simul_size - float, default=0.2 The size of the test set in each train-test split
simul_type - boolean, default=0 To apply stratification or not
- 0 means train-test splits are not stratified.
- 1 means the train-test splits will be stratified.
sig_test_type - string, default="non-parametric" This determines the type of significance test to use.
- "parametric" means a parametric significance test will be used (Note: This test selects very few features)
- "non-parametric" means a non-parametric significance test will be used
cv - cv object/int, default=0 Determines the number of splits for cross-validation. Sklearn CV object can also be passed. A value of 0 means CV is disabled.
verbose - int, default=2 Controls the verbosity: the higher, more the messages.
random_state - int or RandomState instance, default=None Pass an int for reproducible output across multiple function calls.
n_jobs - int, default=-1 The number of CPUs to use to do the computation.
- None means 1 unless in a :obj:joblib.parallel_backend context.
- -1 means using all processors.

# The initialization of PyImpetus takes in multiple parameters as input
# PPIMBR is for regression
model = PPIMBR(model, p_val_thresh, num_simul, simul_size, sig_test_type, cv, verbose, random_state, n_jobs)

model - estimator object, default=DecisionTreeRegressor() The model which is used to perform regression in order to find feature importance via significance-test.
p_val_thresh - float, default=0.05 The p-value (in this case, feature importance) below which a feature will be considered as a candidate for the final MB.
num_simul - int, default=30 (This feature has huge impact on speed) Number of train-test splits to perform to check usefulness of each feature. For large datasets, the value should be considerably reduced though do not go below 5.
simul_size - float, default=0.2 The size of the test set in each train-test split
sig_test_type - string, default="non-parametric" This determines the type of significance test to use.
- "parametric" means a parametric significance test will be used (Note: This test selects very few features)
- "non-parametric" means a non-parametric significance test will be used
cv - cv object/int, default=0 Determines the number of splits for cross-validation. Sklearn CV object can also be passed. A value of 0 means CV is disabled.
verbose - int, default=2 Controls the verbosity: the higher, more the messages.
random_state - int or RandomState instance, default=None Pass an int for reproducible output across multiple function calls.
n_jobs - int, default=-1 The number of CPUs to use to do the computation.
- None means 1 unless in a :obj:joblib.parallel_backend context.
- -1 means using all processors.

# To fit PyImpetus on provided dataset and find recommended features
fit(data, target)

data - A pandas dataframe upon which feature selection is to be applied
target - A numpy array, denoting the target variable

# This function returns the names of the columns that form the MB (These are the recommended features)
transform(data)

data - A pandas dataframe which needs to be pruned

# To fit PyImpetus on provided dataset and return pruned data
fit_transform(data, target)

data - A pandas dataframe upon which feature selection is to be applied
target - A numpy array, denoting the target variable

# To plot XGBoost style feature importance
feature_importance()

How to import?

from PyImpetus import PPIMBC, PPIMBR

Usage

# Import the algorithm. PPIMBC is for classification and PPIMBR is for regression
from PyImeptus import PPIMBC, PPIMBR
# Initialize the PyImpetus object
model = PPIMBC(model=SVC(random_state=27, class_weight="balanced"), p_val_thresh=0.05, num_simul=30, simul_size=0.2, simul_type=0, sig_test_type="non-parametric", cv=5, random_state=27, n_jobs=-1, verbose=2)
# The fit_transform function is a wrapper for the fit and transform functions, individually.
# The fit function finds the MB for given data while transform function provides the pruned form of the dataset
df_train = model.fit_transform(df_train.drop("Response", axis=1), df_train["Response"].values)
df_test = model.transform(df_test)
# Check out the MB
print(model.MB)
# Check out the feature importance scores for the selected feature subset
print(model.feat_imp_scores)
# Get a plot of the feature importance scores
model.feature_importance()

For better accuracy

Note: Play with the values of num_simul, simul_size, simul_type and p_val_thresh because sometimes a specific combination of these values will end up giving best results

~~Increase the cv value~~ In all experiments, cv did not help in getting better accuracy. Use this only when you have extremely small dataset
Increase the num_simul value
Try one of these values for simul_size = {0.1, 0.2, 0.3, 0.4}
Use non-linear models for feature selection. Apply hyper-parameter tuning on models
Increase value of p_val_thresh in order to increase the number of features to include in thre Markov Blanket

For better speeds

Decrease the cv value. For large datasets cv might not be required. Therefore, set cv=0 to disable the aggregation step. This will result in less robust feature subset selection but at much faster speeds
Decrease the num_simul value but don't decrease it below 5
Set n_jobs to -1
Use linear models

For selection of less features

Try reducing the p_val_thresh value
Try out sig_test_type = "parametric"

Performance in terms of Accuracy (classification) and MSE (regression)

Dataset	# of samples	# of features	Task Type	Score using all features	Score using featurewiz	Score using PyImpetus	# of features selected	% of features selected	Tutorial
Ionosphere	351	34	Classification	88.01%		92.86%	14	42.42%	tutorial here
Arcene	100	10000	Classification	82%		84.72%	304	3.04%
AlonDS2000	62	2000	Classification	80.55%	86.98%	88.49%	75	3.75%
slice_localization_data	53500	384	Regression	6.54		5.69	259	67.45%	tutorial here

Note: Here, for the first, second and third tasks, a higher accuracy score is better while for the fourth task, a lower MSE (Mean Squared Error) is better.

Performance in terms of Time (in seconds)

Dataset	# of samples	# of features	Time (with PyImpetus)
Ionosphere	351	34	35.37
Arcene	100	10000	1570
AlonDS2000	62	2000	125.511
slice_localization_data	53500	384	1296.13

Future Ideas

Let me know

Feature Request

Drop me an email at [email protected] if you want any particular feature

Please cite this work as

Reference to the upcoming paper will be added here

Comments

Facing exception while processing 1 Million data..

Refer the exception below

A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9), SIGKILL(-9)}

@atif-hassan

opened by aswinjose89 6

Usage with time series data

Hello. This library looks very interesting, thanks for sharing it!

Have you tested it with any time series dataset? If so, did it perform well?

I was also wondering if it makes sense to use this library as a second step in feature selection, where the first step would be PCMCI (amongst others, in an ensemble). Any thoughts?

Thank you!

opened by brunofacca 6
sklearn integration, joblib, restructuring
This pull request:

Changes package structure to a simpler one and removes junk (making sure it doesn't come back with a .gitignore file

Changes the inter_IAMB class to conform with scikit-learn API better. Changes include:

fit() returns self

fit(), fit_transform() take groups parameter

Renamed data and target params to X and y. X is now required to be the data without target, while y is the target Series or DF.

CV-related parameters rolled into one cv parameter, functioning as cv in eg. RandomGridSearchCV

random_seed param can now be set for reproductibility

added basic parallelism with joblib - each CV fold is now done in parallel. Can probably be extended.

class inherits from sklearn objects

private functions are prefixed with an underscore

added extensive docstrings

minor code tweaks

code has been formatted with black

updated tutorial notebook to match

Those changes improve compatibility with the rest of sklearn, allowing the transformer to be used in Pipelines. Furthermore, parallelism speeds up fitting by a good margin (though it still could be improved). Docstrings provide explanations to users.

No changes have been made to the logic itself. The changes have been tested and confirmed to be working.
opened by Yard1 0
'ascii' codec can't encode characters in position 18-20: ordinal not in range(128)

Hi, I have met a problem when I run the Regression Tutorial, just run use the code in the tutorial, when I run at the block Modelling with Decision Tree using PyImpetus, it has an error ascii' codec can't encode characters in position 18-20: ordinal not in range(128) , I just don't know how to fix it.

opened by yutianfanxing 8
Enhancement request: Build a Directed Acyclic Graph
Dear Atif Hassan,

Many thanks for making available this package!

I read with great interest your publication, and I noticed that Pylmpetus report a complete Markov Blanket (parents, children, spouses) in a single run! Other algorithms require at least two runs to get such Markov Blanket (https://cran.r-project.org/web/packages/MXM/vignettes/MMPC_tutorial.html).

I would like to submit the following enhancements that I believe can help to facilitate the interpretation of the Markov Blanket:

Be able to recognize from selected features which ones are either parents, children, or spouses

Have a functionality to build a Directed Acyclic Graph (as shown in this R package https://github.com/malcolmbarrett/ggdag)

In the Directed Acyclic Graph, the nodes corresponding to parents have a specific color, which in turn it is different to the color assigned to either children or spouses. The remaining of features not belonging to Markov Blanket their nodes are plotted without color.

Kind regards,

Ivan
opened by ivan-marroquin 2
Feature importance value's are similar while processing 50k samples

I have five features with target feature called 'mmsi' from dataset merged_ais_data_1618929435192.pkl.bz2

Markov Blanket: ['cog', 'sog', 'beam', 'latitude', 'longitude', 'heading', 'length'] Feature importance: [21.69653315944471, 21.69653315944471, 21.69653315944471, 21.69653315944471, 21.69653315944471, 21.69653315944471, 21.69653315944471]

Refer dataset from https://www.kaggle.com/aswinjose/ais-maritime-data Filename: merged_ais_data_1618929435192.pkl.bz2

opened by aswinjose89 2

PyImpetus is a Markov Blanket based feature subset selection algorithm that considers features both separately and together as a group in order to provide not just the best set of features but also the best combination of features

Related tags

Overview

PyImpetus

How to install?

Functions and parameters

How to import?

Usage

For better accuracy

For better speeds

For selection of less features

Performance in terms of Accuracy (classification) and MSE (regression)

Performance in terms of Time (in seconds)

Future Ideas

Feature Request

Please cite this work as

Comments

Facing exception while processing 1 Million data..

Usage with time series data

sklearn integration, joblib, restructuring

'ascii' codec can't encode characters in position 18-20: ordinal not in range(128)

Enhancement request: Build a Directed Acyclic Graph

Feature importance value's are similar while processing 50k samples

Owner

Atif Hassan

pyhsmm - library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and explicit-duration Hidden semi-Markov Models (HSMMs), focusing on the Bayesian Nonparametric extensions, the HDP-HMM and HDP-HSMM, mostly with weak-limit approximations.

Official implementation of "Not only Look, but also Listen: Learning Multimodal Violence Detection under Weak Supervision" ECCV2020

This repo uses a combination of logits and feature distillation method to teach the PSPNet model of ResNet18 backbone with the PSPNet model of ResNet50 backbone. All the models are trained and tested on the PASCAL-VOC2012 dataset.

Genetic Algorithm, Particle Swarm Optimization, Simulated Annealing, Ant Colony Optimization Algorithm,Immune Algorithm, Artificial Fish Swarm Algorithm, Differential Evolution and TSP(Traveling salesman)

PyTorch Implementation for AAAI'21 "Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection"

Implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

A Planar RGB-D SLAM which utilizes Manhattan World structure to provide optimal camera pose trajectory while also providing a sparse reconstruction containing points, lines and planes, and a dense surfel-based reconstruction.

Automatically download the cwru data set, and then divide it into training data set and test data set

Voxel Set Transformer: A Set-to-Set Approach to 3D Object Detection from Point Clouds (CVPR 2022)

Ejemplo Algoritmo Viterbi - Example of a Viterbi algorithm applied to a hidden Markov model on DNA sequence

A simple but complete full-attention transformer with a set of promising experimental features from various papers

DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Genetic feature selection module for scikit-learn

Hack Camera, Microphone, Location, Clipboard With Just a Link. Also, Get Many Details About Victim's Device. And So On...

Code in PyTorch for the convex combination linear IAF and the Householder Flow, J.M. Tomczak & M. Welling

A tool to analyze leveraged liquidity mining and find optimal option combination for hedging.

MAGMA - a GPT-style multimodal model that can understand any combination of images and language

a practicable framework used in Deep Learning. So far UDL only provide DCFNet implementation for the ICCV paper (Dynamic Cross Feature Fusion for Remote Sensing Pansharpening)

Opinionated code formatter, just like Python's black code formatter but for Beancount