A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction.

Overview

Master status: Build Status Code Health Coverage Status

Development status: Build Status Code Health Coverage Status

Package information: Python 2.7 Python 3.6 License PyPI version

Join the chat at https://gitter.im/EpistasisLab/scikit-mdr

MDR

A scikit-learn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction. This project is still under active development and we encourage you to check back on this repository regularly for updates.

MDR is an effective feature construction algorithm that is capable of modeling higher-order interactions and capturing complex patterns in data sets.

MDR currently only works with categorical features and supports both binary classification and regression problems. We are working on expanding the algorithm to cover more problem types and provide more convenience features.

License

Please see the repository license for the licensing and usage information for the MDR package.

Generally, we have licensed the MDR package to make it as widely usable as possible.

Installation

MDR is built on top of the following existing Python packages:

  • NumPy

  • SciPy

  • scikit-learn

  • matplotlib

All of the necessary Python packages can be installed via the Anaconda Python distribution, which we strongly recommend that you use. We also strongly recommend that you use Python 3 over Python 2 if you're given the choice.

NumPy, SciPy, scikit-learn, and matplotlib can be installed in Anaconda via the command:

conda install numpy scipy scikit-learn matplotlib

Once the prerequisites are installed, you should be able to install MDR with a pip command:

pip install scikit-mdr

Please file a new issue if you run into installation problems.

Examples

MDR has been coded with a scikit-learn-like interface to be easy to use. The typical fit, transform, and fit_transform methods are available for every feature construction algorithm. For example, MDR can be used to construct a new feature composed from two existing features:

from mdr import MDR
import pandas as pd

genetic_data = pd.read_csv('https://github.com/EpistasisLab/scikit-mdr/raw/development/data/GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1.tsv.gz', sep='\t', compression='gzip')

features = genetic_data.drop('class', axis=1).values
labels = genetic_data['class'].values

my_mdr = MDR()
my_mdr.fit(features, labels)
my_mdr.transform(features)
>>>array([[1],
>>>       [1],
>>>       [1],
>>>       ...,
>>>       [0],
>>>       [0],
>>>       [0]])

You can also use MDR as a classifier, and evaluate the quality of the constructed feature with the score function:

from mdr import MDRClassifier
import pandas as pd

genetic_data = pd.read_csv('https://github.com/EpistasisLab/scikit-mdr/raw/development/data/GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1.tsv.gz', sep='\t', compression='gzip')

features = genetic_data.drop('class', axis=1).values
labels = genetic_data['class'].values

my_mdr = MDRClassifier()
my_mdr.fit(features, labels)
my_mdr.score(features, labels)
>>>0.998125

If you want to use MDR for regression problems, use ContinuousMDR:

from mdr import ContinuousMDR
import pandas as pd

genetic_data = pd.read_csv('https://github.com/EpistasisLab/scikit-mdr/raw/development/data/GAMETES_Epistasis_2-Way_continuous_endpoint_a_20s_1600her_0.4__maf_0.2_EDM-2_01.tsv.gz', sep='\t', compression='gzip')
features = genetic_data[['M0P0', 'M0P1']].values
targets = genetic_data['Class'].values

my_cmdr = ContinuousMDR()
my_cmdr.fit(features, targets)
my_cmdr.transform(features)
>>>array([[0],
>>>       [1],
>>>       [1],
>>>       ...,
>>>       [0],
>>>       [1],
>>>       [1]])

Contributing to MDR

We welcome you to check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to the MDR package, please file a new issue so we can discuss it.

Having problems or have questions about MDR?

Please check the existing open and closed issues to see if your issue has already been attended to. If it hasn't, file a new issue on this repository so we can review your issue.

Citing MDR

If you use this software in a publication, please consider citing it. You can cite the repository directly with the following DOI:

[blank for now]

Support for MDR

The MDR package was developed in the Computational Genetics Lab with funding from the NIH. We're incredibly grateful for their support during the development of this project.

Comments
  • Confusion with documentation and MDR feature construction output

    Confusion with documentation and MDR feature construction output

    Hi, in the first example in the README, it states:

    "For example, MDR can be used to construct a new feature composed from two existing features:"

    but "GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1" used in the example has 21 columns, not 2.

    The resulting output is a single column, which is a single feature -- is it that there's a single feature produced because that's what those 21 columns boiled down to, or is it because only 2 features from the dataframe were selected and used to construct the new feature? Or is there another reason?

    Thanks in advance! I will continue reading the MDR paper I found on pubmed in the meanwhile.

    question 
    opened by jay-reynolds 4
  • Update some tests for custom metrics

    Update some tests for custom metrics

    Fixed older tests by adding hard-coded result arrays for testing fit and fit_transform.

    Added custom metrics by importing accuracy_score, zero_one_loss from sklearn.

    opened by TuanNguyen27 1
  • Error when class labels aren't [0,1]

    Error when class labels aren't [0,1]

    An error occurs when the class labels are not 0 and 1. When counting the number of cases and controls for each cell in a grid,

    MDR code uses the following code (line 76-78):

           for row_i in range(features.shape[0]):
                feature_instance = tuple(features[row_i])
                self.class_count_matrix[feature_instance][classes[row_i]] += 1
    

    classes is an array of y values passed as a parameter. Think of class_count_matrix as a (# of possible feature_instances) by (# of classes). Then since MDR takes in only binary data, # of classes is always 2 and therefore appropriate indices would be 0 and 1 for the dimension. But if the class labels are 0 and 2 not 0 and 1, then the program will try to index the class_count_matrix as class_count_matrix[(tuple of a feature_instance)][2], which is out of bounds.

    Error message:

      File "<ipython-input-180-e1715a88facf>", line 10, in <module>
        mdr.fit(X_train, y_train)
    
      File "C:\Users\Hayley Son\Anaconda3\lib\site-packages\mdr\mdr.py", line 78, in fit
        self.class_count_matrix[feature_instance][classes[row_i]] += 1
    
    IndexError: index 2 is out of bounds for axis 0 with size 2
    
    
    bug 
    opened by hayleyson 0
  • Unit tests

    Unit tests

    Unit tests written for init, fit, transform, fit_transform using hard-coded dataset. Have not written test for scoring and have not implemented optional scoring metric from sklearn

    opened by TuanNguyen27 0
  • Starting point code for MDR implementation

    Starting point code for MDR implementation

    Assumption(s): Labels are only binary

    Implemented:

    fit(self, features, classes): simply build a dictionary that maps each instance of the feature vector to a tuple. The tuple keeps count of how many times a particular label value appears with that instance of feature vector. Key: tuple of feature values - Value: tuple of label frequency/label counts

    transform(self, features): After the dictionary is completed, combine each instance of feature vector above into one corresponding label that has the frequency ratio greater than its standard default ratio.

    score(self, features, classes): Compare the new combined feature vector with its corresponding class labels, and count the times the two match. Output the average accuracy by averaging the match count over the length of the new feature vector / classes vector.

    Implementation is tested in main() by training MDR on the training set and getting accuracy_score on the test set.

    opened by TuanNguyen27 0
  • Cannot read the genetic data from github scikit-mdr library

    Cannot read the genetic data from github scikit-mdr library

    genetic_data = pd.read_csv('https://github.com/EpistasisLab/scikit-mdr/blob/master/data/GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1.tsv.gz', sep='\t', compression='gzip')

    The above command is not working, error says : "BadGzipFile: Not a gzipped file (b'\n\n')"

    opened by prashant-iiitd 0
  • Example for finding features with epistatic effects with scikit-mdr

    Example for finding features with epistatic effects with scikit-mdr

    It seems that the utilities in mdr.utils is designed for this purpose but there is no documentation about how to use them. I have a quick look into those codes and made the demo for calculating scores for n-way combinations and I think it maybe a way to finding feature combinations with epistatic effect. Please let me know if it is the correct way.

    from mdr import MDRClassifier
    import pandas as pd
    from mdr.utils import n_way_models
    import operator
    
    genetic_data = pd.read_csv('https://github.com/EpistasisLab/scikit-mdr/raw/development/data/GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1.tsv.gz', sep='\t', compression='gzip')
    
    features = genetic_data.drop('class', axis=1).values
    labels = genetic_data['class'].values
    feature_names = list(genetic_data.columns)
    
    my_mdr = MDRClassifier()
    my_mdr.fit(features, labels)
    print("Score for using all features", my_mdr.score(features, labels))
    
    #n: list (default: [2])
    #The maximum size(s) of the MDR model to generate.
    #e.g., if n == [3], all 3-way models will be generated.
    n = [2]
    mdr_score_list = []
    #  Note that this function performs an exhaustive search through all feature combinations and can be computationally expensive.
    for _, mdr_model_score, model_features in n_way_models(my_mdr, features, labels, n=n, feature_names=feature_names):
        mdr_score_list.append((model_features, mdr_model_score))
    mdr_score_list.sort(key=operator.itemgetter(1), reverse=True)
    print("The combination with highest score:", mdr_score_list[0])
    

    Exported output:

    Score for using all features 0.998125
    The combination with highest score: (['P1', 'P2'], 0.793125)
    
    opened by weixuanfu 1
  • reduce 200x200000 into 200x1000

    reduce 200x200000 into 200x1000

    Hi, I have a ChIP-seq style dataset of RPKM values that I want to reduce from 200x200000 into 200x1000, so that I only end up with 1000 variables at the end of the MDR process, for my 200 records.

    What would be the recommended way to use scikit-mdr for this task?

    question 
    opened by avilella 3
  • cannot import module BaggingClassifier

    cannot import module BaggingClassifier

    I tried to install scikit-mdr on an Ubuntu 14.04 Linux via pip install but got this error below. To make sure it wasn't a versions issue with scikit-learn, I did a sudo pip install -U scikit-learn, which completed successfully, then tried to load MDR on a python console. See below.

    Any ideas?

    Successfully installed scikit-learn
    Cleaning up...
    avilella@ubuntu14:~$ 
    avilella@ubuntu14:~$ python
    Python 2.7.6 (default, Oct 26 2016, 20:30:19) 
    [GCC 4.8.4] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from mdr import MDR
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python2.7/dist-packages/mdr/__init__.py", line 23, in <module>
        from .mdr_ensemble import MDREnsemble
      File "/usr/local/lib/python2.7/dist-packages/mdr/mdr_ensemble.py", line 26, in <module>
        from sklearn.ensemble import BaggingClassifier
    ImportError: cannot import name BaggingClassifier
    
    question 
    opened by avilella 3
  • Divide utility functions into separate modules

    Divide utility functions into separate modules

    Instead of keeping all of the modules in the same utils.py file, break them out into separate submodules. This will help prevent situations where, for example, a user imports matplotlib when they're only using the n_way_models function (that doesn't use matplotlib).

    enhancement 
    opened by pschmitt52 0
Owner
Epistasis Lab at UPenn
Prof. Jason H. Moore's research lab at the University of Pennsylvania
Epistasis Lab at UPenn
A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Master status: Development status: Package information: scikit-rebate This package includes a scikit-learn-compatible Python implementation of ReBATE,

Epistasis Lab at UPenn 374 Dec 15, 2022
A set of tools for creating and testing machine learning features, with a scikit-learn compatible API

Feature Forge This library provides a set of tools that can be useful in many machine learning applications (classification, clustering, regression, e

Machinalis 380 Nov 5, 2022
An open source python library for automated feature engineering

"One of the holy grails of machine learning is to automate more and more of the feature engineering process." ― Pedro Domingos, A Few Useful Things to

alteryx 6.4k Jan 5, 2023
open-source feature selection repository in python

scikit-feature Feature selection repository scikit-feature in Python. scikit-feature is an open-source feature selection repository in Python develope

Jundong Li 1.3k Jan 5, 2023
Python implementations of the Boruta all-relevant feature selection method.

boruta_py This project hosts Python implementations of the Boruta all-relevant feature selection method. Related blog post How to install Install with

null 1.2k Jan 4, 2023
A fast xgboost feature selection algorithm

BoostARoota A Fast XGBoost Feature Selection Algorithm (plus other sklearn tree-based classifiers) Why Create Another Algorithm? Automated processes l

Chase DeHan 187 Dec 22, 2022
PyTorch implementation HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projections

HoroPCA This code is the official PyTorch implementation of the ICML 2021 paper: HoroPCA: Hyperbolic Dimensionality Reduction via Horospherical Projec

HazyResearch 52 Nov 14, 2022
Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python.

Fast Laplacian Eigenmaps in python Open-source Laplacian Eigenmaps for dimensionality reduction of large data in python. Comes with an wrapper for NMS

null 17 Jul 9, 2022
Dimensionality reduction in very large datasets using Siamese Networks

ivis Implementation of the ivis algorithm as described in the paper Structure-preserving visualisation of high dimensional single-cell datasets. Ivis

beringresearch 284 Jan 1, 2023
Dimensionality reduction in very large datasets using Siamese Networks

ivis Implementation of the ivis algorithm as described in the paper Structure-preserving visualisation of high dimensional single-cell datasets. Ivis

beringresearch 221 Jan 28, 2021
PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos

PyKale is a PyTorch library for multimodal learning and transfer learning as well as deep learning and dimensionality reduction on graphs, images, texts, and videos. By adopting a unified pipeline-based API design, PyKale enforces standardization and minimalism, via reusing existing resources, reducing repetitions and redundancy, and recycling learning models across areas.

PyKale 370 Dec 27, 2022
TLDR: Twin Learning for Dimensionality Reduction

TLDR (Twin Learning for Dimensionality Reduction) is an unsupervised dimensionality reduction method that combines neighborhood embedding learning with the simplicity and effectiveness of recent self-supervised learning losses.

NAVER 105 Dec 28, 2022
DimReductionClustering - Dimensionality Reduction + Clustering + Unsupervised Score Metrics

Dimensionality Reduction + Clustering + Unsupervised Score Metrics Introduction

null 11 Nov 15, 2022
A library of sklearn compatible categorical variable encoders

Categorical Encoding Methods A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques

null 2.1k Jan 2, 2023
a feature engineering wrapper for sklearn

Few Few is a Feature Engineering Wrapper for scikit-learn. Few looks for a set of feature transformations that work best with a specified machine lear

William La Cava 47 Nov 18, 2022
A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning.

Master status: Development status: Package information: scikit-rebate This package includes a scikit-learn-compatible Python implementation of ReBATE,

Epistasis Lab at UPenn 374 Dec 15, 2022
Contains an implementation (sklearn API) of the algorithm proposed in "GENDIS: GEnetic DIscovery of Shapelets" and code to reproduce all experiments.

GENDIS GENetic DIscovery of Shapelets In the time series classification domain, shapelets are small subseries that are discriminative for a certain cl

IDLab Services 90 Oct 28, 2022
The windML framework provides an easy-to-use access to wind data sources within the Python world, building upon numpy, scipy, sklearn, and matplotlib. Renewable Wind Energy, Forecasting, Prediction

windml Build status : The importance of wind in smart grids with a large number of renewable energy resources is increasing. With the growing infrastr

Computational Intelligence Group 125 Dec 24, 2022
Kaldi-compatible feature extraction with PyTorch, supporting CUDA, batch processing, chunk processing, and autograd

Kaldi-compatible feature extraction with PyTorch, supporting CUDA, batch processing, chunk processing, and autograd

Fangjun Kuang 119 Jan 3, 2023
SCons - a software construction tool

SCons - a software construction tool Welcome to the SCons development tree. The real purpose of this tree is to package SCons for production distribut

SCons Project 1.6k Jan 3, 2023