open-source feature selection repository in python

Overview

scikit-feature

Feature selection repository scikit-feature in Python.

scikit-feature is an open-source feature selection repository in Python developed by Data Mining and Machine Learning Lab at Arizona State University. It is built upon one widely used machine learning package scikit-learn and two scientific computing packages Numpy and Scipy. scikit-feature contains around 40 popular feature selection algorithms, including traditional feature selection algorithms and some structural and streaming feature selection algorithms.

It serves as a platform for facilitating feature selection application, research and comparative study. It is designed to share widely used feature selection algorithms developed in the feature selection research, and offer convenience for researchers and practitioners to perform empirical evaluation in developing new feature selection algorithms.

Installing scikit-feature

Prerequisites:

Python 2.7 and Python 3

NumPy

SciPy

Scikit-learn

Steps:

For Linux users, you can install the repository by the following command:

python setup.py install

For Windows users, you can also install the repository by the following command:

setup.py install

Project website

Instructions of using this repository can be found in our project webpage at http://featureselection.asu.edu/

Citation

If you find scikit-feature feature selection reposoitory useful in your research, please consider citing the following paper::

@article{li2018feature,
title={Feature selection: A data perspective},
author={Li, Jundong and Cheng, Kewei and Wang, Suhang and Morstatter, Fred and Trevino, Robert P and Tang, Jiliang and Liu, Huan},
journal={ACM Computing Surveys (CSUR)},
volume={50},
number={6},
pages={94},
year={2018},
publisher={ACM}
}

Contact

Jundong Li E-mail: [email protected]

Comments
  • Please add Python 3 support

    Please add Python 3 support

    A large portion of the Python user base doesn't use Python 2 any more, and therefore can't use a package that doesn't support Python 3. Please add Python 3 support; it shouldn't take too much extra work.

    opened by rhiever 10
  • AttributeError: module 'skfeature.function.similarity_based.reliefF' has no attribute 'feature_ranking'

    AttributeError: module 'skfeature.function.similarity_based.reliefF' has no attribute 'feature_ranking'

    I already install skfeature-chappers (1.0.2) ,but i got an AttributeError.

    AttributeError: module 'skfeature.function.similarity_based.reliefF' has no attribute 'feature_ranking'

    opened by Andy1314Chen 3
  • Example code for calculating fisher score

    Example code for calculating fisher score

    Hey! Could you please provide an example code for calculating fisher score present in the path

    skfeature.function.similarity_based.fisher_score

    Could you please help me with what class labels have to be provided as the function should extract the fisher features and provide the labels.

    opened by z-saj 2
  • Parallelism added to information theory based measures

    Parallelism added to information theory based measures

    Using concurrent futures, which requires Python 3.2 and newer (released February 20th, 2011).

    Tested on Windows Server 2016 DataCenter 64 bit, with Python 3.6 using the Conda Python distribution.

    Note that on Windows you need to have in the calling code: if __name__ == '__main__': This is a general Python on windows issue with spawning new processes, but including it does not stop it working elsewhere.

    This parallel version gives consistent results to the previous version, in my testing.

    opened by Josh-Ring-jisc 2
  • MemoryError

    MemoryError

    Hello:

    I am using the unsupervised feature selection using Laplacian Score. But I am facing the below error message

    Traceback (most recent call last): File "fs.py", line 12, in W = construct_W.construct_W(frame, *_kwargs_W) File "/usr/local/lib/python2.7/dist-packages/skfeature/utility/construct_W.py", line 141, in construct_W D = pairwise_distances(X) File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.py", line 1207, in pairwise_distances return _parallel_pairwise(X, Y, func, n_jobs, *_kwds) File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.py", line 1054, in _parallel_pairwise return func(X, Y, **kwds) File "/usr/local/lib/python2.7/dist-packages/sklearn/metrics/pairwise.py", line 231, in euclidean_distances distances = safe_sparse_dot(X, Y.T, dense_output=True) File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/extmath.py", line 184, in safe_sparse_dot return fast_dot(a, b) MemoryError

    I have updated the scikit-learn; but the issue persists. Any inputs will be helpful

    Regards Sayantan Guha

    opened by guhasayantan 2
  • Fisher Score (Possible Bug?)

    Fisher Score (Possible Bug?)

    In the fisher score calculation (https://github.com/jundongl/scikit-feature/blob/48cffad4e88ff4b9d2f1c7baffb314d1b3303792/skfeature/function/similarity_based/fisher_score.py) , it was mentioned L = D - W (line 10) but in the code, L was simply defined as L = W (line 38)

    I'm not sure is this a bug or if I miss out anything? Anyone came across the same issue?

    opened by crixus5678 1
  • Error occours when run an example on Python3.

    Error occours when run an example on Python3.

    Hi, thanks for your great work. And I run into an error when I try to run the test_MRMR.py example. It seems like some code are on the Python2.x, but when try to run on the Python3.x, it crashed.

    Could you please update these part of code, make it fit in Python3? Thx a lot.

    opened by Williamongh 1
  • In MCFS, mod of matrix should be taken before calculating maximum value

    In MCFS, mod of matrix should be taken before calculating maximum value

    According to reference paper, 'Unsupervised Feature Selection for Multi-Cluster Data' by Cai,Deng, in equation 4, max of mod value is taken. I think that needs to be corrected here in MCFS.py in line 69 from W.max(1) to np.absolute(W).max(1)

    opened by mayankagrawal93 1
  • missing LCSI .  What is this? and how I could resolve this error?

    missing LCSI . What is this? and how I could resolve this error?

    D:\prj>python findFeatures.py Traceback (most recent call last): File "findFeatures.py", line 7, in from skfeature.function.information_theoretical_based import MIM # infogain File "D:\ProgramData\Anaconda3\lib\site-packages\skfeature\function\information_theoretical_based\MIM.py", line 1, in import LCSI ModuleNotFoundError: No module named 'LCSI'

    opened by sdsdjr 1
  • The previous version used random sampling with duplicates

    The previous version used random sampling with duplicates

    I am using this library for my research and I noticed some instability with the algorithm.

    I then checked the code and it seemed like the algorithm was using random sampling with duplicates (when iterating over n_samples anyway...).

    So I made a change and a pull request. The detailed explanation is below:

    The previous version used random sampling with duplicate examples from the dataset. The algorithm idea states that it should use random sampling of m examples without duplicates because duplicates cause unstable results where the effect of some examples will be aplified and some will be ignored. Whereas since the current code set m to n_samples (all examples), there's really no point in random sampling since you use all examples anyway. The order of sample processing doesn't matter either, so I removed all this random sampling bit and just let it loop through results.

    In future I can modify the algorithm to use another argument m and use a random subset of examples properly (like stated in the article), but for now I did it like this.

    The algorithm is stable now. It might be cook to have this sorted on master as well since some people are using it and it doesn't perform too well now.

    I can also do the m samples subset modification later on. Let me know.

    Sincerely, Tadej Magajna

    opened by tadejmagajna 1
  • Get

    Get " sparse matrix length is ambiguous; use getnnz() or shape[0]" error!

    corpus, categories = get_detail_content_category()
    
    vectorizer = TfidfVectorizer(max_df=1.0, max_features=6000, min_df=1,
                                 stop_words=get_cn_stopwords(),
                                 encoding='utf-8', decode_error='ignore',
                                 analyzer='word', tokenizer=cn_tokenize)
    
    X = vectorizer.fit_transform(corpus)
    y = categories
    
    idx = ICAP.icap(X, y, n_selected_features=1000)
    selected_X = X[:, idx[0:1000]]
    

    After I run this code, i get error like title. I don't why, any help is appreciable.

    opened by hiber-niu 1
  • FCBF with np.zeros

    FCBF with np.zeros

    Around line 39 in the code for FCBF uses np.zeros((n_features, 2), dtypes='object') which throws an error. Numpy documentation states the parameter should be dtype (without the 's'). So the line should be np.zeros((n_features, 2), dtype='object') instead. It was an easy enough fix on my own and I saved a separate .py file with the fixed version and it worked.

    opened by seth602 0
  • Update CFS.py

    Update CFS.py

    The original function involves a lot of repeated operations, and when the number of features is large, it takes a lot of time. Using a dict R to store the calculated correlation values can improve the speed by 1-2 orders of magnitude.

    opened by poetair 0
  • Maybe a mistake in reliefF.py

    Maybe a mistake in reliefF.py

    In https://github.com/jundongl/scikit-feature/blob/master/skfeature/function/similarity_based/reliefF.py

    In the Algorithm Relief-F

    image ( This figure is from https://link.springer.com/article/10.1023/A:1025667309714 )

    The equation P(C)/(1-P(Class(Ri))) is in the in the numerator

    The variable corresponding to this equation in the code is p_dict as below

            p_dict = dict()
            p_label_idx = float(len(y[y == y[idx]]))/float(n_samples)
    
            for label in c:
                p_label_c = float(len(y[y == label]))/float(n_samples)
                p_dict[label] = p_label_c/(1-p_label_idx)
                near_miss[label] = []
    

    So in the code of calculate weight, p_dict should be in the numerator and not the denominator.

    The code of

                score += near_miss_term[label]/(k*p_dict[label])
    

    should be

                score += (near_miss_term[label] * p_dict[label]) / k
    
    opened by agnes-yang 0
  • Question with Trace Ratio

    Question with Trace Ratio

    Based on the code implemented in the calculation of trace ratio, if we set s_within and s_between to have a length of n_selected_features. Then the variables "I" and "idx" will just have a length of n_selected_features. In this case, arent we just restricting the selected features to be the first n_selected_features in fs_idx and the loop is essentially updating/optimizing nothing?

    # preprocessing
    fs_idx = np.argsort(np.divide(s_between, s_within), 0)[::-1]
    k = np.sum(s_between[0:n_selected_features])/np.sum(s_within[0:n_selected_features])
    s_within = s_within[fs_idx[0:n_selected_features]]
    s_between = s_between[fs_idx[0:n_selected_features]]
    
    # iterate util converge
    count = 0
    while True:
        score = np.sort(s_between-k*s_within)[::-1]
        I = np.argsort(s_between-k*s_within)[::-1]
        idx = I[0:n_selected_features]
        old_k = k
        k = np.sum(s_between[idx])/np.sum(s_within[idx])
        if verbose:
            print('obj at iter {0}: {1}'.format(count+1, k))
        count += 1
        if abs(k - old_k) < 1e-3:
            break
    
    opened by crixus5678 0
  • UDFS Errors

    UDFS Errors

    UDFS.py L97: The additive parameter \lambda should be independent from gamma (introduced in eq 8 in the paper). It should probably default to something small. It's just used to make the covariance invertible.

    Also, construction of S_i seems incorrect to me: UDFS.py L100: indexing on idx_new should be idx_new[:,q]?

    opened by choltz95 1
Owner
Jundong Li
Jundong Li
A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Master status: Development status: Package information: scikit-rebate This package includes a scikit-learn-compatible Python implementation of ReBATE,

Epistasis Lab at UPenn 374 Dec 15, 2022
A fast xgboost feature selection algorithm

BoostARoota A Fast XGBoost Feature Selection Algorithm (plus other sklearn tree-based classifiers) Why Create Another Algorithm? Automated processes l

Chase DeHan 187 Dec 22, 2022
An open source python library for automated feature engineering

"One of the holy grails of machine learning is to automate more and more of the feature engineering process." ― Pedro Domingos, A Few Useful Things to

alteryx 6.4k Jan 5, 2023
A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction.

Master status: Development status: Package information: MDR A scikit-learn-compatible Python implementation of Multifactor Dimensionality Reduction (M

Epistasis Lab at UPenn 122 Jul 6, 2022
a feature engineering wrapper for sklearn

Few Few is a Feature Engineering Wrapper for scikit-learn. Few looks for a set of feature transformations that work best with a specified machine lear

William La Cava 47 Nov 18, 2022
zoofs is a Python library for performing feature selection using an variety of nature inspired wrapper algorithms. The algorithms range from swarm-intelligence to physics based to Evolutionary. It's easy to use ,flexible and powerful tool to reduce your feature size.

zoofs is a Python library for performing feature selection using a variety of nature-inspired wrapper algorithms. The algorithms range from swarm-intelligence to physics-based to Evolutionary. It's easy to use , flexible and powerful tool to reduce your feature size.

Jaswinder Singh 168 Dec 30, 2022
PyTorch Implementation for AAAI'21 "Do Response Selection Models Really Know What's Next? Utterance Manipulation Strategies for Multi-turn Response Selection"

UMS for Multi-turn Response Selection Implements the model described in the following paper Do Response Selection Models Really Know What's Next? Utte

Taesun Whang 47 Nov 22, 2022
Implementation of "Selection via Proxy: Efficient Data Selection for Deep Learning" from ICLR 2020.

Selection via Proxy: Efficient Data Selection for Deep Learning This repository contains a refactored implementation of "Selection via Proxy: Efficien

Stanford Future Data Systems 70 Nov 16, 2022
Abhijith Neil Abraham 2 Nov 5, 2021
Python implementations of the Boruta all-relevant feature selection method.

boruta_py This project hosts Python implementations of the Boruta all-relevant feature selection method. Related blog post How to install Install with

null 1.2k Jan 4, 2023
A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Master status: Development status: Package information: scikit-rebate This package includes a scikit-learn-compatible Python implementation of ReBATE,

Epistasis Lab at UPenn 374 Dec 15, 2022
A fast xgboost feature selection algorithm

BoostARoota A Fast XGBoost Feature Selection Algorithm (plus other sklearn tree-based classifiers) Why Create Another Algorithm? Automated processes l

Chase DeHan 187 Dec 22, 2022
Genetic feature selection module for scikit-learn

sklearn-genetic Genetic feature selection module for scikit-learn Genetic algorithms mimic the process of natural selection to search for optimal valu

Manuel Calzolari 260 Dec 14, 2022
Regularization and Feature Selection in Least Squares Temporal Difference Learning

Regularization and Feature Selection in Least Squares Temporal Difference Learning Description This is Python implementations of Least Angle Regressio

Mina Parham 0 Jan 18, 2022
PaddleRobotics is an open-source algorithm library for robots based on Paddle, including open-source parts such as human-robot interaction, complex motion control, environment perception, SLAM positioning, and navigation.

简体中文 | English PaddleRobotics paddleRobotics是基于paddle的机器人开源算法库集,包括人机交互、复杂运动控制、环境感知、slam定位导航等开源算法部分。 人机交互 主动多模交互技术TFVT-HRI 主动多模交互技术是通过视觉、语音、触摸传感器等输入机器人

null 185 Dec 26, 2022
An open source python library for automated feature engineering

"One of the holy grails of machine learning is to automate more and more of the feature engineering process." ― Pedro Domingos, A Few Useful Things to

alteryx 6.4k Jan 5, 2023
A comprehensive, feature-rich, open source, and portable, collection of Solitaire games.

PySol Fan Club edition This is an open source and portable (Windows, Linux and Mac OS X) collection of Card Solitaire/Patience games written in Python

Shlomi Fish 368 Dec 28, 2022
NVIDIA Merlin is an open source library providing end-to-end GPU-accelerated recommender systems, from feature engineering and preprocessing to training deep learning models and running inference in production.

NVIDIA Merlin NVIDIA Merlin is an open source library designed to accelerate recommender systems on NVIDIA’s GPUs. It enables data scientists, machine

null 419 Jan 3, 2023