A pure Python implementation of a mixed effects random forest (MERF) algorithm

Related tags

Algorithms merf
Overview

Mixed Effects Random Forest

This repository contains a pure Python implementation of a mixed effects random forest (MERF) algorithm. It can be used, out of the box, to fit a MERF model and predict with it.

MERF Model

The MERF model is:

y_i = f(X_i) + Z_i * b_i + e_i

b_i ~ N(0, D)

e_i ~ N(0, R_i)

for each cluster i out of n total clusters.

In the above:

  • y_i -- the (n_i x 1) vector of responses for cluster i. These are given at at training.
  • X_i -- the (n_i x p) fixed effects covariates that are associated with the y_i. These are given at training.
  • Z_i -- the (n_i x q) random effects covariates that are associated with the y_i. These are given at training.
  • e_i -- the (n_i x 1) vector of errors for cluster i. This is unknown.
  • i is the cluster_id. This is given at training.

The learned parameters in MERF are:

  • f() -- which is a random forest that models the, potentially nonlinear, mapping from the fixed effect covariates to the response. It is common across all clusters.
  • D -- which is the covariance of the normal distribution from which each of the b_i are drawn. It is common across all clusters.
  • sigma^2 -- which is the variance of e_i, which is assumed to be white. It is common across all clusters.

Note that one key assumption of the MERF model is that the random effect is linear. Though, this is limiting in some regards, it is still broadly useful for many problems. It is better than not modelling the random effect at all.

The algorithms implemented in this repo were developed by Ahlem Hajjem, Francois Bellavance, and Denis Larocque and published in a paper here. Many thanks to Ahlem and Denis for providing an R reference and aiding in the debugging of this code. Quick note, the published paper has a small typo in the update equation for sigma^2 which is corrected in the source code here.

Using the Code

The MERF code is modelled after scikit-learn estimators. To use, you instantiate a MERF object. As of 1.0, you can pass any non-linear estimator for the fixed effect. By default this is a scikit-learn random forest, but you can pass any model you wish that conforms to the scikit-learn estimator API, e.g. LightGBM, XGBoost, a properly wrapped PyTorch neural net,

Then you fit the model using training data. As of 1.0, you can also pass a validation set to see the validation performance on it. This is meant to feel similar to PyTorch where you can view the validation loss after each epoch of training. After fitting you can predict responses from data, either from known (cluster in training set) or new (cluster not in training set) clusters.

For example:

> from merf import MERF
> merf = MERF()
> merf.fit(X_train, Z_train, clusters_train, y_train)
> y_hat = merf.predict(X_test, Z_test, clusters_test)

Alternatively:

> from lightgbm import LGBMRegressor
> lgbm = LGBMRegressor()
> mrf_lgbm = MERF(lgbm, max_iterations=15)
> mrf_lgbm.fit(X_train, Z_train, clusters_train, y_train, X_val, Z_val, clusters_val, y_val)
> y_hat = merf.predict(X_test, Z_test, clusters_test)

Note that training is slow because the underlying expectation-maximization (EM) algorithm requires many calls to the non-linear fixed effects model, e.g. random forest. That being said, this implemtataion has early stopping which aborts the EM algorithm if the generalized log-likelihood (GLL) stops significantly improving.

Tour of the Source Code

The \merf directory contains all the source code:

  • merf.py is the key module that contains the MERF class. It is imported at the package level.
  • merf_test.py contain some simple unit tests.
  • utils.py contains a class for generating synthetic data that can be used to test the accuracy of MERF. The process implemented is the same as that in this paper.
  • viz.py contains a plotting function that takes in a trained MERF object and plots various metrics of interest.

The \notebooks directory contains some useful notebooks that show you how to use the code and evaluate MERF performance. Most of the techniques implemented are the same as those in this paper.

Comments
  • Can not run examples given by notebooks

    Can not run examples given by notebooks

    Greetings,

    I can not reproduce notebooks given in the ./notebooks. For example, run code given in the Real World MERF Examples notebook thrower out two errors.

    When I tried to import the modules, the first error popped out

    No module named 'merf.evaluator'

    When I tried to call mrf.fit(X_train, Z_train, clusters_train, y_train), the error popped out

    ValueError: operands could not be broadcast together with shapes (10,) (162,)

    I also attached the full information below FYI. BTW, I installed the merf using pip install merf. It should be the latest version.

    Any suggestion is appreciated.

    ValueError Traceback (most recent call last) in () 13 clusters_train = train['Subject'] 14 y_train = train['Reaction'] ---> 15 mrf.fit(X_train, Z_train, clusters_train, y_train) 16 17 # Mixed Effects Random Forest Test

    ~/anaconda/lib/python3.6/site-packages/merf/merf.py in fit(self, X, Z, clusters, y) 142 143 # Compute y_star for this cluster and put back in right place --> 144 y_star_i = y_i - Z_i.dot(b_hat_i) 145 y_star[indices_i] = y_star_i 146

    ~/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(left, right, name, na_op) 713 lvalues = lvalues.values 714 --> 715 result = wrap_results(safe_na_op(lvalues, rvalues)) 716 return construct_result( 717 left,

    ~/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in safe_na_op(lvalues, rvalues) 674 try: 675 with np.errstate(all='ignore'): --> 676 return na_op(lvalues, rvalues) 677 except Exception: 678 if isinstance(rvalues, ABCSeries):

    ~/anaconda/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y) 650 try: 651 result = expressions.evaluate(op, str_rep, x, y, --> 652 raise_on_error=True, **eval_kwargs) 653 except TypeError: 654 if isinstance(y, (np.ndarray, ABCSeries, pd.Index)):

    ~/anaconda/lib/python3.6/site-packages/pandas/computation/expressions.py in evaluate(op, op_str, a, b, raise_on_error, use_numexpr, **eval_kwargs) 208 if use_numexpr: 209 return _evaluate(op, op_str, a, b, raise_on_error=raise_on_error, --> 210 **eval_kwargs) 211 return _evaluate_standard(op, op_str, a, b, raise_on_error=raise_on_error) 212

    ~/anaconda/lib/python3.6/site-packages/pandas/computation/expressions.py in _evaluate_numexpr(op, op_str, a, b, raise_on_error, truediv, reversed, **eval_kwargs) 119 120 if result is None: --> 121 result = _evaluate_standard(op, op_str, a, b, raise_on_error) 122 123 return result

    ~/anaconda/lib/python3.6/site-packages/pandas/computation/expressions.py in _evaluate_standard(op, op_str, a, b, raise_on_error, **eval_kwargs) 61 _store_test_result(False) 62 with np.errstate(all='ignore'): ---> 63 return op(a, b) 64 65

    ValueError: operands could not be broadcast together with shapes (10,) (162,)

    documentation 
    opened by Yutong-Dai 5
  • TypeError: __init__() got an unexpected keyword argument 'n_estimators'

    TypeError: __init__() got an unexpected keyword argument 'n_estimators'

    When trying to set arguments like n_estimators for MERF, MERF.fit returns the above initialisation error. It works fine with default setting. This seems to suggest that MERF to sklearn random forest API is broken. Is this a version issue? I'm using: Python 3.9.5 scikit-learn==0.24.2

    opened by jessie831024 3
  • Add partial dependency plotting function

    Add partial dependency plotting function

    It would be nice if there was a function for creating partial dependency plot data or plots. This would help with translating information to consumers and clients

    enhancement 
    opened by OliverFishCode 3
  • Random forest can now take user defined parameters.

    Random forest can now take user defined parameters.

    The current implementation of MERF only allows the user to set the number of trees in the random forest. This pull request allows the user to set any parameter they want but prevents them from changing important params like having the out of bag score on.

    I hope you accept this it has been useful for me.

    opened by arose13 3
  • Why not expose all the parameters of the RandomForestRegressor?

    Why not expose all the parameters of the RandomForestRegressor?

    Also, would using the ExtraTreesRegressor also in this framework? It seems simple to just add a base_estimator parameter, allowing the user to use whatever he/she wants?

    I can submit a pull request if it makes sense to add these features.

    opened by jonathanng 3
  • Typechecks

    Typechecks

    Thank you for this wonderful package and your efforts!

    For future users, would it be possible to add type-checks and more verbose error statements to the fit func of the MERF class when the input type deviates from the expected input type? This led to a bit of reverse engineering to figure out why the underlying linear alg was failing when, for example, inputting a NUMPY array in place of an expected pandas series.

    (Also happy to contribute...when I have the time).

    Thanks!

    opened by JCSyng 3
  • Regarding Merf Plot by Cluster

    Regarding Merf Plot by Cluster

    Screen Shot 2020-03-17 at 7 39 40 PM

    I am using Python 3.6.5 and Pandas version 1.0.1. I receive the above when attempting to plot a specific cluster by its respective id.

    I noticed here in the example notebook that Panel is deprecated. I welcome feedback on the best way to address this.

    opened by ahlusar1989 2
  • Add options

    Add options

    Add ability to specify additional parameters to RandomForestRegressor, with some minor error checking and two additional tests.

    Options are passed in as dictionary argument rf_opts. n_estimators is ignored in favor of the argument to the constructor; oob_score is ignored and always set to True; warm_start is ignored (assumed to be False).

    Two reasons this is helpful for me, at least: the random forest fit is often faster (particularly on my laptop) with n_jobs set to 1. I'm also hoping to investigate the shap interpretability package, but the underlying algorithms perform extremely poorly on deep trees, so the ability to specify max_depth would make that a bit easier.

    This is a repeat of another pull request, closed by me, attempting to fix formatting issues that cause continuous integration to fail.

    opened by cunningjames 2
  • Translation from MERF random effects nomenclature to align with other implementations

    Translation from MERF random effects nomenclature to align with other implementations

    First, thank you for creating this package! It is kind of exactly what I am looking for. However, I have some questions that stem from my prior experience with mixed models on other platforms (notably the lme4 package for R), and the documentation and examples aren't helping me translate my knowledge of how random effects are specified/named in MERF.

    For example, in lme4, random effects are designated as having slopes and intercepts (which I think correspond to "clusters" and "covariates" in MERF? With 1s in the covariates matrix indicating the intercepts?).

    So if "subject" (in an experiment) or "county") like in the radon example are grouping variables over which there are multiple observations, one could specify a random intercept for subject (or county). Then, one could additionally specify a random slope for some variable by the grouping variable (such as to specify a random slope for experimental condition by subject, to allow the model to estimate the variance of how much subjects differ in their response to the experimental manipulation, or that counties could have a random slope for floor, allowing counties to differ in how much each floor impacts random levels in the model). I'm not quite sure if I'm translating these to the MERF nomenclature correctly.

    Moreover, lme4 allows the researcher to specify multiple, crossed random effects (e.g., random intercepts for both experimental subjects and experimental items, as well as random slopes for variables by both subject and item).

    Getting to the point:

    1. My looking through the examples leads me to think that the clusters argument is the column containing the IDs for which random intercepts are generated: Is this the case?
    2. I get the inclination that the Z matrix includes 1s for the random intercepts, but can include a second column for random slopes (i.e., the covariates): Is this the case? Can there be more columns?

    More generally, some comments in the notebooks or documentation about what these variables are (concretely, and what form they must/can/cannot take, and possibly relationships to how random effects are specified in other mixed modeling packages) would be very helpful.

    1. Can MERF handle crossed random effects structures?
    2. Finally, does MERF provide variable importance measures from the fit forest (analogous to those produced in sklearn, and from randomForest and party::cforest in R)? I couldn't find mention of that in the readme or the notebooks.

    Thanks!

    documentation 
    opened by dstanner 2
  • Issue with git cloning and pip installing from github

    Issue with git cloning and pip installing from github

    Thanks for an excellent package! I am very keen to try out your latest updates from the last few days (including using lighgbm, SHAP values). However, I cannot seem to pip install directly from github or clone because of this error:

    Clone failed
    				Invalid path 'data/Rossmann Store Sales | Kaggle eval.pdf'
    				unable to checkout working tree
    				warning: Clone succeeded, but checkout failed.
    				You can inspect what was checked out with 'git status'
    				and retry with 'git restore --source=HEAD :/'
    

    It seems like this error is because of the file name 'Rossmann Store Sales | Kaggle eval.pdf' in the data directory. Is it possible to rename the two pdf files in that directory?

    opened by ritviksahajpal 1
  • saving merf as a pickle

    saving merf as a pickle

    Hi, it took me a long time to train the MERF on my dataset. After training, I tried saving it as a pickle object and reload it to make prediction on test data. However, I got an error saying the pkl is a dictionary that doesn't have a predict method. Any ideas on how to get around this? I don't want to train the model every time I pass new test data. Thanks.

    AttributeError                            Traceback (most recent call last)
    
    <ipython-input-55-0bd2bd5ab064> in <module>()
    ----> 1 y_hat_known_merf = loaded_model.predict(X_known, Z_known, clusters_known)
          2 y_hat_known_merf
    
    AttributeError: 'dict' object has no attribute 'predict'
    
    
    opened by jyu-theartofml 1
  • Bump numpy from 1.18.4 to 1.22.0 in /docker

    Bump numpy from 1.18.4 to 1.22.0 in /docker

    Bumps numpy from 1.18.4 to 1.22.0.

    Release notes

    Sourced from numpy's releases.

    v1.22.0

    NumPy 1.22.0 Release Notes

    NumPy 1.22.0 is a big release featuring the work of 153 contributors spread over 609 pull requests. There have been many improvements, highlights are:

    • Annotations of the main namespace are essentially complete. Upstream is a moving target, so there will likely be further improvements, but the major work is done. This is probably the most user visible enhancement in this release.
    • A preliminary version of the proposed Array-API is provided. This is a step in creating a standard collection of functions that can be used across application such as CuPy and JAX.
    • NumPy now has a DLPack backend. DLPack provides a common interchange format for array (tensor) data.
    • New methods for quantile, percentile, and related functions. The new methods provide a complete set of the methods commonly found in the literature.
    • A new configurable allocator for use by downstream projects.

    These are in addition to the ongoing work to provide SIMD support for commonly used functions, improvements to F2PY, and better documentation.

    The Python versions supported in this release are 3.8-3.10, Python 3.7 has been dropped. Note that 32 bit wheels are only provided for Python 3.8 and 3.9 on Windows, all other wheels are 64 bits on account of Ubuntu, Fedora, and other Linux distributions dropping 32 bit support. All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays.

    Expired deprecations

    Deprecated numeric style dtype strings have been removed

    Using the strings "Bytes0", "Datetime64", "Str0", "Uint32", and "Uint64" as a dtype will now raise a TypeError.

    (gh-19539)

    Expired deprecations for loads, ndfromtxt, and mafromtxt in npyio

    numpy.loads was deprecated in v1.15, with the recommendation that users use pickle.loads instead. ndfromtxt and mafromtxt were both deprecated in v1.17 - users should use numpy.genfromtxt instead with the appropriate value for the usemask parameter.

    (gh-19615)

    ... (truncated)

    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    dependencies 
    opened by dependabot[bot] 0
  • Line # 96 in merf.merf.py might be better when modifying

    Line # 96 in merf.merf.py might be better when modifying "len(indices_i)" to "sum(indices_i)"

    Hello, Thanks for your great work on merf! When I debug merf, I found that there is one line that does not work in any case:

    
    63   def predict(self, X: np.ndarray, Z: np.ndarray, clusters: pd.Series):
               ...
              for cluster_id in self.cluster_counts.index:
                    indices_i = clusters == cluster_id
    
                   # If cluster doesn't exist in test data that's ok. Just move on.
    96           if len(indices_i) == 0:  < ------------------ might revise to: if sum(indices_i) == 0
                        continue
    
                   # If cluster does exist, apply the correction.
                    ...
    

    I marked Line # 96 and suggested changing "len()" to "sum()", otw, this if will never run in any case because indices_i is a pd.Series has the same shape with the input cluster.

    And if possible, I would like to ask another question about the random effect matrix Z.

    In the given example notebook, I noticed that when considering one variance as a random effect, the Z is not simply composed of one column but also has another column of ones. Why are the ones necessary?

    Thanks in advance! meng

    opened by CCChen-Menggg 0
  • Incompatibility with scikit-learn: MERF does not implement get_params

    Incompatibility with scikit-learn: MERF does not implement get_params

    MERF does not implement get_params. This means that scikit-learn cannot clone a MERF predictor as needed for e.g., RandomSearchCV to optimize the hyper parameter of the fixed effects model.

    opened by NegatedObjectIdentity 0
  • An issue about the random state

    An issue about the random state

    Hi Dey, Thank you so much for providing this useful python pachage. I watched your video that is very enlighting, and is curretly using the merf package to handle my data. I'm just wondering if it is possible to specify the random_state for merf for repetition. I tried to specify random_state for RandomForestRegressor, but it still returns different results every time I run the code. Below are my codes, mrf = MERF(fixed_effects_model=RandomForestRegressor(n_estimators=100, n_jobs=-1,random_state = 2),max_iterations=20)

    Do you have any idea? Thanks, Zixiao

    opened by zixiao-yin 0
  • will the MERF select best model based on validation set?

    will the MERF select best model based on validation set?

    mrf.fit(X_train, Z_train, clusters_train, y_train, X_val, Z_val, clusters_val, y_val) y_hat_known = mrf.predict(X_known, Z_known, clusters_known)

    is the mrf model the best performance model based on X_val, Z_val, clusters_val, y_val?

    BW, Yang

    opened by baiyang4 0
Releases(1.0)
  • 1.0(May 19, 2020)

    This is major release that modifies the MERF API in a backward-incompatible way. There are a number of things added to this release. The highlights:

    • Added ability to add any fixed effects learner, e.g. gradient boosting, deep learning
    • Added ability to add validation set and track validation loss during training process
    • Updated code to work with pandas 1.0+
    • Added a plotting utility to look at results of training
    • Added documentation built using Sphinx
    • Added more unit tests for pickling, typing, plotting, etc.
    • Small bug fixes
    • Moved CI over to GitHub Actions
    • Published documentation using GitHub pages
    • Moved over to Orbyter 2.0
    • Moved to Makefile for builds, CI, linting
    Source code(tar.gz)
    Source code(zip)
  • 0.3(Mar 28, 2019)

    This release contain some additions from the community:

    • Ability to pass all random forest parameters
    • Early stopping
    • Addition of CircleCI
    • Removal of some unecessary files
    Source code(tar.gz)
    Source code(zip)
Owner
Manifold
Manifold is the applied AI company. Global companies partner with Manifold through joint discovery and development programs.
Manifold
Minimal pure Python library for working with little-endian list representation of bit strings.

bitlist Minimal Python library for working with bit vectors natively. Purpose This library allows programmers to work with a native representation of

Andrei Lapets 0 Jul 25, 2022
A lightweight, pure-Python mobile robot simulator designed for experiments in Artificial Intelligence (AI) and Machine Learning, especially for Jupyter Notebooks

aitk.robots A lightweight Python robot simulator for JupyterLab, Notebooks, and other Python environments. Goals A lightweight mobile robotics simulat

null 3 Oct 22, 2021
Multiple Imputation with Random Forests in Python

miceforest: Fast, Memory Efficient Imputation with lightgbm Fast, memory efficient Multiple Imputation by Chained Equations (MICE) with lightgbm. The

Samuel Wilson 202 Dec 31, 2022
A fast python implementation of the SimHash algorithm.

This Python package provides hashing algorithms for computing cohort ids of users based on their browsing history. As such, it may be used to compute cohort ids of users following Google's Federated Learning of Cohorts (FLoC) proposal.

Hybrid Theory 19 Dec 15, 2022
An implementation of ordered dithering algorithm in python as multimedia course project

One way of minimizing the size of an image is to simply reduce the number of bits you use to represent each pixel.

null 7 Dec 2, 2022
A python implementation of the Basic Photometric Stereo Algorithm

Photometric-Stereo A python implementation of the Basic Photometric Stereo Algorithm Result Usage run Photometric_Stereo.py Code Tree |data #原始数据,tga格

null 20 Dec 19, 2022
This is a Python implementation of the HMRF algorithm on networks with categorial variables.

Salad Salad is an Open Source Python library to segment tissues into different biologically relevant regions based on Hidden Markov Random Fields. The

null 1 Nov 16, 2021
A simple python implementation of A* and bfs algorithm solving Eight-Puzzle

A simple python implementation of A* and bfs algorithm solving Eight-Puzzle

null 2 May 22, 2022
This is an implementation of the QuickHull algorithm in Python. I

QuickHull This is an implementation of the QuickHull algorithm in Python. It randomly generates a set of points and finds the convex hull of this set

Anant Joshi 4 Dec 4, 2022
Python implementation of Aho-Corasick algorithm for string searching

Python implementation of Aho-Corasick algorithm for string searching

Daniel O'Sullivan 1 Dec 31, 2021
A minimal implementation of the IQRM interference flagging algorithm for radio pulsar and transient searches

A minimal implementation of the IQRM interference flagging algorithm for radio pulsar and transient searches. This module only provides the algorithm that infers a channel mask from some spectral statistic that measures the level of RFI contamination in a time-frequency data block. It should be useful as a reference implementation to developers who wish to integrate IQRM into an existing pipeline / search code.

Vincent Morello 6 Nov 29, 2022
A custom prime algorithm, implementation, and performance code & review

Colander A custom prime algorithm, implementation, and performance code & review Pseudocode Algorithm 1. given a number of primes to find, the followi

Finn Lancaster 3 Dec 17, 2021
This project is an implementation of a simple K-means algorithm

Simple-Kmeans-Clustering-Algorithm Abstract K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to

Saman Khamesian 7 Aug 9, 2022
A raw implementation of the nearest insertion algorithm to resolve TSP problems in a TXT format.

TSP-Nearest-Insertion A raw implementation of the nearest insertion algorithm to resolve TSP problems in a TXT format. Instructions Load a txt file wi

sjas_Phantom 1 Dec 2, 2021
Implementation of an ordered dithering algorithm used in computer graphics

Ordered Dithering Project In this project, we use an ordered dithering method to turn an RGB image, first to a gray scale image and then, turn the gra

null 1 Oct 26, 2021
Implementation of Apriori Algorithm for Association Analysis

Implementation of Apriori Algorithm for Association Analysis

null 3 Nov 14, 2021
implementation of the KNN algorithm on crab biometrics dataset for CS16

crab-knn implementation of the KNN algorithm in Python applied to biometrics data of purple rock crabs (leptograpsus variegatus) to classify the sex o

Andrew W. Chen 1 Nov 18, 2021
Our implementation of Gillespie's Stochastic Simulation Algorithm (SSA)

SSA Our implementation of Gillespie's Stochastic Simulation Algorithm (SSA) Requirements python >=3.7 numpy pandas matplotlib pyyaml Command line usag

Anoop Lab 1 Jan 27, 2022
A genetic algorithm written in Python for educational purposes.

Genea: A Genetic Algorithm in Python Genea is a Genetic Algorithm written in Python, for educational purposes. I started writing it for fun, while lea

Dom De Felice 20 Jul 6, 2022