Generalized Random Forests

Overview

generalized random forests

CRANstatus CRAN Downloads overall Build Status

A pluggable package for forest-based statistical estimation and inference. GRF currently provides non-parametric methods for least-squares regression, quantile regression, survival regression, and treatment effect estimation (optionally using instrumental variables), with support for missing values.

In addition, GRF supports 'honest' estimation (where one subset of the data is used for choosing splits, and another for populating the leaves of the tree), and confidence intervals for least-squares regression and treatment effect estimation.

Some helpful links for getting started:

The repository first started as a fork of the ranger repository -- we owe a great deal of thanks to the ranger authors for their useful and free package.

Installation

The latest release of the package can be installed through CRAN:

install.packages("grf")

conda users can install from the conda-forge channel:

conda install -c conda-forge r-grf

The current development version can be installed from source using devtools.

devtools::install_github("grf-labs/grf", subdir = "r-package/grf")

Note that to install from source, a compiler that implements C++11 is required (clang 3.3 or higher, or g++ 4.8 or higher). If installing on Windows, the RTools toolchain is also required.

Usage Examples

The following script demonstrates how to use GRF for heterogeneous treatment effect estimation. For examples of how to use types of forest, as for quantile regression and causal effect estimation using instrumental variables, please consult the R documentation on the relevant forest methods (quantile_forest, instrumental_forest, etc.).

library(grf)

# Generate data.
n <- 2000
p <- 10
X <- matrix(rnorm(n * p), n, p)
X.test <- matrix(0, 101, p)
X.test[, 1] <- seq(-2, 2, length.out = 101)

# Train a causal forest.
W <- rbinom(n, 1, 0.4 + 0.2 * (X[, 1] > 0))
Y <- pmax(X[, 1], 0) * W + X[, 2] + pmin(X[, 3], 0) + rnorm(n)
tau.forest <- causal_forest(X, Y, W)

# Estimate treatment effects for the training data using out-of-bag prediction.
tau.hat.oob <- predict(tau.forest)
hist(tau.hat.oob$predictions)

# Estimate treatment effects for the test sample.
tau.hat <- predict(tau.forest, X.test)
plot(X.test[, 1], tau.hat$predictions, ylim = range(tau.hat$predictions, 0, 2), xlab = "x", ylab = "tau", type = "l")
lines(X.test[, 1], pmax(0, X.test[, 1]), col = 2, lty = 2)

# Estimate the conditional average treatment effect on the full sample (CATE).
average_treatment_effect(tau.forest, target.sample = "all")

# Estimate the conditional average treatment effect on the treated sample (CATT).
average_treatment_effect(tau.forest, target.sample = "treated")

# Add confidence intervals for heterogeneous treatment effects; growing more trees is now recommended.
tau.forest <- causal_forest(X, Y, W, num.trees = 4000)
tau.hat <- predict(tau.forest, X.test, estimate.variance = TRUE)
sigma.hat <- sqrt(tau.hat$variance.estimates)
plot(X.test[, 1], tau.hat$predictions, ylim = range(tau.hat$predictions + 1.96 * sigma.hat, tau.hat$predictions - 1.96 * sigma.hat, 0, 2), xlab = "x", ylab = "tau", type = "l")
lines(X.test[, 1], tau.hat$predictions + 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[, 1], tau.hat$predictions - 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[, 1], pmax(0, X.test[, 1]), col = 2, lty = 1)

# In some examples, pre-fitting models for Y and W separately may
# be helpful (e.g., if different models use different covariates).
# In some applications, one may even want to get Y.hat and W.hat
# using a completely different method (e.g., boosting).

# Generate new data.
n <- 4000
p <- 20
X <- matrix(rnorm(n * p), n, p)
TAU <- 1 / (1 + exp(-X[, 3]))
W <- rbinom(n, 1, 1 / (1 + exp(-X[, 1] - X[, 2])))
Y <- pmax(X[, 2] + X[, 3], 0) + rowMeans(X[, 4:6]) / 2 + W * TAU + rnorm(n)

forest.W <- regression_forest(X, W, tune.parameters = "all")
W.hat <- predict(forest.W)$predictions

forest.Y <- regression_forest(X, Y, tune.parameters = "all")
Y.hat <- predict(forest.Y)$predictions

forest.Y.varimp <- variable_importance(forest.Y)

# Note: Forests may have a hard time when trained on very few variables
# (e.g., ncol(X) = 1, 2, or 3). We recommend not being too aggressive
# in selection.
selected.vars <- which(forest.Y.varimp / mean(forest.Y.varimp) > 0.2)

tau.forest <- causal_forest(X[, selected.vars], Y, W,
                            W.hat = W.hat, Y.hat = Y.hat,
                            tune.parameters = "all")

# Check whether causal forest predictions are well calibrated.
test_calibration(tau.forest)

Developing

In addition to providing out-of-the-box forests for quantile regression and causal effect estimation, GRF provides a framework for creating forests tailored to new statistical tasks. If you'd like to develop using GRF, please consult the algorithm reference and development guide.

Funding

Development of GRF is supported by the National Science Foundation, the Sloan Foundation, the Office of Naval Research (Grant N00014-17-1-2131) and Schmidt Futures.

References

Susan Athey and Stefan Wager. Estimating Treatment Effects with Causal Forests: An Application. Observational Studies, 5, 2019. [paper, arxiv]

Susan Athey, Julie Tibshirani and Stefan Wager. Generalized Random Forests. Annals of Statistics, 47(2), 2019. [paper, arxiv]

Yifan Cui, Michael R. Kosorok, Erik Sverdrup, Stefan Wager, and Ruoqing Zhu. Estimating Heterogeneous Treatment Effects with Right-Censored Data via Causal Survival Forests. 2020. [arxiv]

Rina Friedberg, Julie Tibshirani, Susan Athey, and Stefan Wager. Local Linear Forests. Journal of Computational and Graphical Statistics, 2020. [paper, arxiv]

Imke Mayer, Erik Sverdrup, Tobias Gauss, Jean-Denis Moyer, Stefan Wager and Julie Josse. Doubly Robust Treatment Effect Estimation with Missing Attributes. Annals of Applied Statistics, 14(3) 2020. [paper, arxiv]

Stefan Wager and Susan Athey. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. Journal of the American Statistical Association, 113(523), 2018. [paper, arxiv]

Comments
  • Using causal_forest to estimate average treatment effects over subgroups

    Using causal_forest to estimate average treatment effects over subgroups

    After reading a few of the papers on causal forests, it seems like the idea is the trees will decide to split more often on variables that cause the treatment effect to vary (that is, moderators of the experimental effect). However, the object returned by causal_forest seems to work like any other machine learning algorithm focused on prediction—not causal statistical inference—in the way the predict function associated with it works.

    How can I use the causal_forest function to find where the treatment varies?

    For example, I simulate a randomized experiment with a binary outcome where there is a positive effect for women and a negative effect for black men. There are also nuisance variables in here. How would I use causal_forest to tell me that, "The trees are tending to split most on gender, and then when it leads to male, it the trees also are more likely to split on race." This would help show me where the treatment effects are occurring.

    Here are the data:

    set.seed(1839)
    n <- 5000
    X <- data.frame(
      condition = factor(sample(c("control", "treatment"), n, TRUE)),
      gender = factor(sample(c("male", "female"), n, TRUE)),
      race = factor(sample(c("black", "white", "hispanic", "asian"), n, TRUE)),
      generation = factor(sample(c("millennial", "x", "babyboomer"), n, TRUE)),
      has_kids = factor(sample(c("no", "yes"), n, TRUE))
    )
    y <- factor(ifelse(
      X$gender == "female" & X$condition == "treatment",
      sample(c("positive", "negative"), n, TRUE, c(.50, .50)),
      ifelse(X$race == "black" & X$condition == "treatment",
        sample(c("positive", "negative"), n, TRUE, c(.12, .88)),
        sample(c("positive", "negative"), n, TRUE, c(.40, .60))
      )
    ))
    

    And then I fit a model using grf::causal_forest:

    library(grf)
    mod <- causal_forest(
      model.matrix(~ -1 + ., X[, -1]), # convert factors to dummies
      (as.numeric(y) - 1), # convert outcome to 0 or 1
      (as.numeric(X[, 1]) - 1), # convert treatment to 0 or 1
      num.trees = 4000
    )
    

    Looking at mod, I see some variable importance issues that hints at where the splits are occurring most:

    Variable importance: 
        1     2     3     4     5     6     7     8 
    0.233 0.203 0.267 0.061 0.055 0.046 0.081 0.054 
    

    However, this doesn't capture how multiple X variables may depend on one another in their interaction with W on Y (what may be called a three-way interaction in the general linear model world).

    If I use the predict() function, I can only look at individual-level conditional treatment effects and their associated variance estimates. For example, just the first row of my data:

    predict(mod, model.matrix(~ -1 + ., X[1, -1]), estimate.variance = TRUE)
    

    However, how could I estimate the treatment effect and variance for the category women, collapsing across all other variables? Or the combination of male and Black, collapsing across all other variables? Is this as simple as averaging the treatment effects on the training set within the groups of interest? And if so, how are confidence intervals calculated from those?

    Additionally, can the causal_forest function tell me where these variations are most likely to occur? It seems like this is possible from the papers I have read on causal forests (as well as earlier papers on causal trees and transformed outcome trees, etc.), but if I may very well be mistaken.

    help wanted feature question 
    opened by markhwhiteii 26
  • Continuous Treatment Memory Error

    Continuous Treatment Memory Error

    Description of the bug I am estimating causal forest on a large dataset (around 2-3 GB with 1.6 million observations and 8 independent variables, as well as 1 dependent variable) with 4000 trees. When I use a binary treatment W, the forest runs fine. However, when I switch to a continuous treatment W, R crashes.

    I am running it on a HPC with 32 CPUs and 12GB RAM per CPU.

    The error message is (I masked some of my memory addresses with *):

    *** Error in `/PATH/lib/R/bin/exec/R': break adjusted to free malloc space: 0x000******** ***
    ======= Backtrace: =========
    /lib64/libc.so.6(+0x82257)[0x7f5*********]
    /lib64/libc.so.6(+0x82cea)[0x7f5*********]
    /lib64/libc.so.6(__libc_malloc+0xc7)[0x7f5*********]
    /usr/local/Cluster-Apps/dmtcp/dmtcp-2.6.0-intel-17.0.4/lib/dmtcp/libdmtcp_alloc.so(malloc+0x22)[0x7f5*********]
    /PATH/lib/R/bin/exec/../../../libstdc++.so.6(_Znwm+0x15)[0x7f5*********]
    /PATH/lib/R/library/grf/libs/grf.so(_ZNSt6vectorImSaImEE17_M_default_appendEm+0xc3)[0x7f5*********]
    /PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf4Tree15find_leaf_nodesERKNS_4DataERKSt6vectorImSaImEE+0xb7)[0x7f5*********]
    /PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf11TreeTrainer21repopulate_leaf_nodesERKSt10unique_ptrINS_4TreeESt14default_deleteIS2_EERKNS_4DataERKSt6vectorImSaImEEb+0xe0)[0x7f5*********]
    /PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf11TreeTrainer5trainERKNS_4DataERNS_13RandomSamplerERKSt6vectorImSaImEERKNS_11TreeOptionsE+0x408)[0x7f5*********]
    /PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf13ForestTrainer14train_ci_groupERKNS_4DataERNS_13RandomSamplerERKNS_13ForestOptionsE+0x135)[0x7f5*********]
    /PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf13ForestTrainer11train_batchEmmRKNS_4DataERKNS_13ForestOptionsE+0x1a9)[0x7f5*********]
    /PATH/lib/R/library/grf/libs/grf.so(_ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_12_Task_setterIS0_INS1_7_ResultISt6vectorIS0_IN3grf4TreeESt14default_deleteISA_EESaISD_EEEES3_ENSt6thread8_InvokerISt5tupleIJMNS9_13ForestTrainerEKFSF_mmRKNS9_4DataERKNS9_13ForestOptionsEEPKSL_mmSO_SP_EEEESF_EEE9_M_invokeERKSt9_Any_data+0x67)[0x7f5*********]
    /PATH/lib/R/library/grf/libs/grf.so(+0x526e9)[0x7f5*********]
    /lib64/libpthread.so.0(+0x620b)[0x7f5*********]
    /PATH/lib/R/library/grf/libs/grf.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZNSt13__future_base17_Async_state_implINS1_IS2_IJMN3grf13ForestTrainerEKFSt6vectorISt10unique_ptrINS5_4TreeESt14default_deleteIS9_EESaISC_EEmmRKNS5_4DataERKNS5_13ForestOptionsEEPKS6_mmSH_SI_EEEESE_EC4EOSQ_EUlvE_EEEEE6_M_runEv+0x116)[0x7f5*********]
    /PATH/lib/R/bin/exec/../../../libstdc++.so.6(+0xc819d)[0x7f5*********]
    /usr/local/Cluster-Apps/dmtcp/dmtcp-2.6.0-intel-17.0.4/lib/dmtcp/libdmtcp.so(+0x35eeb)[0x7f5*********]
    /lib64/libpthread.so.0(+0x7ea5)[0x7f5*********]
    /usr/local/Cluster-Apps/dmtcp/dmtcp-2.6.0-intel-17.0.4/lib/dmtcp/libdmtcp.so(+0x35cad)[0x7f5*********]
    /lib64/libc.so.6(clone+0x6d)[0x7f5*********]
    

    Does continuous treatment consume more memory than binary treatment?

    GRF version development

    bug 
    opened by ginward 20
  • GRF refuse to report average treatment effect when the dataset is imbalanced

    GRF refuse to report average treatment effect when the dataset is imbalanced

    Description of the bug Download the dataset at this link. This is a dataset by LaLonde (1986), Dehejia and Wahba (1999) on the NSW training program. I am not sure if there is copyright - if so it belongs to them.

    The data set is very imbalanced - there are around 10000 untreated example and less than 300 treated example.

    When I run the code below in the development version, the average_treatment_effect(cps.forest, target.sample = "all") returns NaN. However, the code returns numbers in the releasedversion 0.10.2. It only returns NaN in the development branch.

    Is this supposed to be a feature (i.e. returns NaN when data is very imbalanced), or is it actually a bug?

    The average treatment effect for the treated and the overlap weighted treatment effects are all normal. Although the numbers differ slightly from the released branch as well.

    Steps to reproduce

    #compile the grf development code and include it into the library
    library(dplyr)
    select <- dplyr::select
    TREES_CPS=1000
    SEED=6000
    set.seed(SEED)
    dat=read.csv("bug.csv")
    forest=causal_forest(seed=SEED,num.trees = TREES_CPS, X=as.matrix(select(dat, re74, re75, age, education, black, hispanic, married, nodegree)), Y=dat$re78, W=dat$treat)
    average_treatment_effect(forest, target.sample = "all")
    

    Output in Development Version

    > average_treatment_effect(forest, target.sample = "all")
    estimate  std.err 
         NaN      NaN 
    

    Output in Release Version

    > average_treatment_effect(forest, target.sample = "all")
     estimate   std.err 
    -2058.140  1401.162 
    

    GRF version development

    question 
    opened by ginward 18
  • Address poor performance of honest forests on small datasets.

    Address poor performance of honest forests on small datasets.

    When honesty is enabled, the training subsample is further split in half before performing splitting. With small datasets, this may not leave enough information for the algorithm to determine high-quality splits.

    This issue is still pending a concrete proposal on how it should be addressed.

    feature 
    opened by jtibshirani 18
  • causal_forest without orthogonalization

    causal_forest without orthogonalization

    I understand that the default option of the causal_forest function is for it to use orthogonalization; that is, it incorporates both the actual values of the outcome (Y) and treatment (W), as well as the predicted values of the outcome (Y.hat) and treatment (W.hat) to estimate CATEs as well as to make splits throughout the forest. However, I have a question related to this:

    Q: Is there any way to turn "off" the orthogonalization? That is, for demonstration purposes (not practical purposes), I wanted to simulate how performance of the causal_forest degrades if it is trained on only the actual values of the outcome (Y) and treatment (W), and not the residual values (i.e., actual - predicted) ?

    opened by njawadekar 17
  • Questions about the function

    Questions about the function "causal_survival_forest"

    Hello, I am trying to use the Causal survival forest prediction function. I have some questions about the "causal_survival_forest" function as follows:

    1. I cannot use the arguments "target" and "horizon ";
    2. In the details about the function as follows,why "'D[Y >= Y.max] <- 1' and 'Y[Y >= Y.max] <- Y.max'." and not "'D[Y >= Y.max] <- 0' and 'Y[Y >= Y.max] <- Y.max'." Details An important assumption for identifying the conditional average treatment effect tau(X) is that there exists a fixed positive constant M such that the probability of observing an event time past the maximum follow-up time Y.max is at least M. This may be an issue with data where most endpoint observations are censored. The suggested resolution is to re-define the estimand as the treatment effect up to some suitable maximum follow-up time Y.max. One can do this in practice by thresholding Y before running causal_survival_forest: 'D[Y >= Y.max] <- 1' and 'Y[Y >= Y.max] <- Y.max'. For details see Cui et al. (2020). The computational complexity of this estimator scales with the cardinality of the event times Y. If the number of samples is large and the Y grid dense, consider rounding the event times (or supply a coarser grid with the 'failure.times' argument).
    3. When I used the "causal_survival_forest" function to analysis my RCT (random control trial) data, it's expected that tau(X) = E[Y(1) - Y(0) | X = x] > 0 (1:treatment arm; 0: control arm) where E[Y] is the expected survival time = integral(survival function), but the results of this analysis was tau(X) = E[Y(1) - Y(0) | X = x] < 0 contrary to what I thought, which were very unreasonable.

    Could you give me an answer.

    Thanks very much! Xin

    opened by ChenXinaha 17
  • add estimate_counterfactual_outcomes for W in {0,1} #403

    add estimate_counterfactual_outcomes for W in {0,1} #403

    A start to providing the functionality requested in https://github.com/grf-labs/grf/issues/403#issuecomment-486915981:

    Finally, it would be nice to provide a short function which performs the above steps, or at least add this information to our algorithm reference.

    This does not include the general calculation if multiple, or non {0,1} outcomes exist, but happy to consider that if helpful.

    opened by ras44 17
  • Difference between predict() and average_treatment_effect() for calculating CATEs in honest causal_forest

    Difference between predict() and average_treatment_effect() for calculating CATEs in honest causal_forest

    Hi, I am writing this in an effort to better understand the predict() and average_treatment_effect() functions, particularly in regards to when one should be utilized over the other to estimate conditional average treatment effects (CATEs) in an honest causal forest. Additional details related to this query are listed below:

    (1) Research Goal: After building an honest causal_forest on my dataset, I would now like to calculate Conditional Average Treatment Effects (CATEs) within specific strata of covariates on the same dataset.

    (2) Initial Plan: Based on this application paper by Athey & Wager, it seems that I should be using the predict() function in order to estimate these CATEs using an "honest" approach. Based on the documentation on predict(), it appears that by default, this function estimates the treatment effects such that these effects are estimated for every observation using only the trees in the forest which did not use that particular observation when it was modeled--so, out-of-bag estimation.

    (3) Question: However, I understand that there is additionally an average_treatment_effect function, which can also supposedly estimate Conditional Average Treatment Effects in a causal forest in a doubly robust fashion. I would like to better understand the differences between these two functions (predict() and average_treatment_effect()), and the different circumstances in which one function should be used over the other to estimate CATEs on data within an honest causal forest. Evidently, the math behind each function differs, as shown in my attached code that I ran on a mock dataset. This attached R code can be used to reproduce the very different conditional average treatment effects that I calculated for a specific subset of individuals when I used the predict() approach vs. average_treatment_effect().

    In addition to explaining which function is better for calculating conditional average treatment effects in various circumstances, could someone also please explain in layman's terms what each function is doing behind the scenes? For example, is the average_treatment_effect function using all of the trees to estimate the treatment effects (and not out-of-bag?)? Also - how are propensity scores utilized for the average_treatment_effect function?

    Thanks!

    Steps to reproduce Please find the attached code. cates_cf.txt

    GRF version 2.0.2

    opened by njawadekar 16
  • Add cobalt-style balance plots to causal forests

    Add cobalt-style balance plots to causal forests

    For propensity score matching, there are some ways to test whether the samples are indeed matched well.

    Are there similar methods to exam the quality of matching for the causal forest?

    Thanks a lot for your time!

    help wanted feature 
    opened by ZhangMengxia 16
  • Account for hierarchical structure

    Account for hierarchical structure

    Right now GRF is based on an IID assumption. It would be nice to be able to use GRF on data with a hierarchical structure. This is especially relevant in RCTs where studies occur across administrative units like provinces, towns, school districts, etc.

    opened by lminer 15
  • Question: How can I use the package to build a single causal tree?

    Question: How can I use the package to build a single causal tree?

    I'm trying to build a single causal tree using the following code:

    model <- causal_forest(X_train, y_train, z_train, num.trees = 1)

    However, I noticed that the causal_forest method has the parameter sample.fraction, which defines the fraction of the data that is used to build each tree (and is 0.5 by default). Because I want to use the entire data set to build the causal tree, I want to set this to 1, but when I run the following code:

    model <- causal_forest(X_train, y_train, z_train, num.trees = 1, sample.fraction=1)

    I get the following error message:

    "Error in causal_train(data$default, data$sparse, outcome.index, treatment.index, : When confidence intervals are enabled, the sampling fraction must be less than 0.5."

    Could you please tell me how to disable confidence intervals in order to build a tree using the entire sample? Thanks in advance!

    question 
    opened by ferlocar 14
  • Selecting balanced splits in instrumental forests

    Selecting balanced splits in instrumental forests

    Hi grf team,

    I am currently trying to follow how instrumental forests select balanced splits. I have read the corresponding section in the algorithm reference for causal forests, but I suppose that this does not fully apply to instrumental forests.

    In particular, I'm interested in:

    1. how min.node.size is determined
    2. what the node size measure is which is used together with alpha and imbalance.penalty
    3. what changes in 1. and 2. if stabilize.splits is set to FALSE

    Thank you for your support & best regards, Jens

    opened by JeGemm 0
  • Regarding model calibration and discrimination performance of individual treatment effect calculated from causal_survival_forest function.

    Regarding model calibration and discrimination performance of individual treatment effect calculated from causal_survival_forest function.

    I was wondering if you might be able to help me with using the causal_survival_forest function to estimate individual treatment effect in R environment. I am having some difficulty understanding how to assess model calibration and check discrimination performance when using this function. Would you happen to have any recommendations or resources that might be helpful in this regard? I would really appreciate any guidance you might be able to provide.

    question 
    opened by fukuokaya 2
  • Performance of Causal Forests in the tails of the covariate distribution

    Performance of Causal Forests in the tails of the covariate distribution

    I am trying to understand Causal Forest's behavior in the tail of the distribution of the covariates. I run simulations of the type below and often find that CF estimates are constant towards the tails (see plot) which means that CFs are biased there. In the simulation below I compare this to the naïve approach using random forests for each of Y0 and Y1 to predict the outcomes and then obtain tau by the difference in predictions \hat{Y1}-\hat{Y0}. That approach seemingly does not suffer from the bias problem in the tails (but has larger variance throughout). In larger samples (e.g. n=1e4) the problem seems to persist. Is there anything I can do to get a better fit in the tails? Thanks.

    library(grf)
    library(ranger)
    
    ## Simulate non-linear ps-scores and y-models
    e.x = function(x) 1/ (1 + exp(- ( 3 * x) ))
    mu.0 = function(x) sin(-1/2- 4*x)
    mu.1 = function(x) sin(1/2 + 4*x)
    tau  = function(x) (mu.1(x) - mu.0(x))
    m.x = function(x) mu.0(x) + e.x(x) * tau(x)
    
    ## Sample training data
    set.seed(2023)
    n = 1000
    sd.y = 0.6
    Z = rnorm(n, mean=0, sd = 0.3)
    y0 = mu.0(Z) + rnorm(n, sd = sd.y)
    y1 = mu.1(Z) + rnorm(n, sd = sd.y)
    e  = e.x(Z)
    W  = rbinom(n, 1, e)
    y  = W*y1 + (1-W) * y0
    df = data.frame(Z = Z, y, y0, y1, e, W = factor(W), W.num = W)
    
    ## Estimate causal forest with tuning on
    cf = causal_forest(X = cbind(df$Z), Y = df$y, W = df$W.num, num.trees = 2e3, tune.parameters = 'all')
    
    ## Estimate heterogeineity using conditional means random forests
    rf0 = ranger(y ~ Z, data = df[df$W.num==0,], num.trees = 5e3 )
    rf1 = ranger(y ~ Z, data = df[df$W.num==1,], num.trees = 5e3 )
    
    ## Create test data
    Z.test  = seq(-1,1,0.01)
    df.test = data.frame(Z = Z.test)
    tau.test = tau(Z.test)
    tau.test.cf = predict(cf, newdata = df.test)[,1]
    tau.test.cdmrf = predict(rf1, data= df.test)$predictions - predict(rf0, data= df.test)$predictions
    
    ## Compare fits
    par(mfrow=c(1,2))
    plot(Z.test,tau.test, ylim=c(-3,3),ty='l', main='Predictions vs true effect')
    lines(Z.test,tau.test.cf,col=2)
    lines(Z.test,tau.test.cdmrf,col=4)
    plot(Z.test,tau.test.cf-tau.test, ylim=c(-3,3),col=2, ty='l', main= 'Error')
    lines(Z.test,tau.test.cdmrf-tau.test,col=4)
    abline(h=0)
    legend('bottomright', legend = c('Causal Forest', 'Standard Forests'),lty=c(1,1),col=c(2,4))
    
    
    question 
    opened by thomasklausch2 3
  • Assessing treatment heterogeneity in instrumental_forest

    Assessing treatment heterogeneity in instrumental_forest

    Hi, everyone. Thank you for posting this fancy package. I want to test whether a heterogeneous treatment effect exists in my instrumental forest model. For your information, the sample size is around 2,300 and the number of covariates is around 40 in my data.

    There are 2 major challenges that I am facing: 1) test_calibration function does not support instrumental_forest. While the function supports some forest models, it is impossible to find any treatment heterogeneity in my instrumental forest model using the function. Is there any technical difficulty in supporting instrumental_forest in the test_calibration function? I wonder if there would be any plans for updates as best_linear_projection started to support instrumental_forest recently.

    2) Rank average treatment effect is unstable. Since I can not use the test_calibration function, I have tried to use the rank_average_treatment function. However, I found that the p-values vary significantly based on the parameters I am using. For example, if I change tune.parameters from ‘all’ to c('sample.fraction', 'mtry', 'min.node.size', 'alpha', 'imbalance.penalty'), the p-value increases from 0.06 to 0.97, or decreases from 0.59 to 0.20 depending on different Ys. Moreover, the p-value also varies a lot if I change the seed of the instrumental forest model(e.g., seed=123 to seed=119, and so on). The following is the code that I'm using:

    set.seed(123, kind = "Mersenne-Twister", normal.kind = "Inversion", sample.kind = "Rejection") 
    cf.priority <- instrumental_forest( X[train, ], Y[train], W[train], Z[train], 
    num.trees = 50000, 
    #tune.parameters = 'all', 
    tune.parameters = c('sample.fraction', 'mtry', 'min.node.size', 'alpha', 'imbalance.penalty'), 
    tune.num.trees = 4000, tune.num.reps = 250, tune.num.draws = 4500)
    set.seed(123, kind = "Mersenne-Twister", normal.kind = "Inversion", sample.kind = "Rejection")
    
    # Estimate AUTOC on held-out data.
    cf.eval <- instrumental_forest( X[-train, ], Y[-train], W[-train], Z[-train], 
    num.trees = 50000, 
    #tune.parameters = 'all', 
    tune.parameters = c('sample.fraction', 'mtry', 'min.node.size', 'alpha', 'imbalance.penalty'),
    tune.num.trees = 4000, tune.num.reps = 250, tune.num.draws = 4500)
    

    Can we say that if a certain hyperparameter set yields a low p-value, then the tuning is valid? I wonder if there is any rule of thumb in tuning these hyperparameters. Thank you for your time and all the work!

    Best, Minje

    question 
    opened by minnnjecho 1
  • NA handling regression forest vs. local linear forest covariate matrices

    NA handling regression forest vs. local linear forest covariate matrices

    Description of the bug Regression forest allows the X matrix to have some incomplete cases, but local linear forest returns an error. Not sure if there's some technical reason why ll forests can't handle incomplete cases?

    Steps to reproduce

    #Toy data 
    Y    <- as.vector(rnorm(100))
    X    <- data.frame(x1 = rnorm(100), x2 = rnorm(100))
    #
    #Add NAs 
    X$x1 <- ifelse(X$x1 > 0,X$x1,NA)
    #
    #Let's try an r forest 
    regression_forest(Y = Y,
                      X = X)
    #R forest runs fine 
    #
    # Now let's try ll forest 
    ll_regression_forest(Y = Y,
                         X = X)
    #
    #ll forest returns: Error in validate_X(X) : The feature matrix X contains at least one NA.
    

    GRF version GRF version 2.2.0

    question 
    opened by spocksdad 2
  • Calibration test -- description and source code

    Calibration test -- description and source code

    Hi everyone, Thank you for posting this package. I am trying to understand the intuition of the calibration test. I see from the source code that the calibration test does the following for a causal forest:

    preds <- predict(forest)$predictions mean.pred <- mean(preds) DF <- data.frame( target = unname(forest$Y.orig - forest$Y.hat), mean.forest.prediction = unname(forest$W.orig - forest$W.hat) * mean.pred,differential.forest.prediction = unname(forest$W.orig - forest$W.hat) *(preds - mean.pred))

    summary(lm(target~ mean.forest.prediction + differential.forest.prediction +0, data=DF))

    The target are the orthogonalized outcomes. But then, the target is not regressed on the mean forest prediction and the differential forest prediction alone, but on the product of those two and the orthogonalized treatments....

    I understand that those orthogonalized outcomes are the outcome variable of the forest. But i don't understand why the mean forest prediction needs to be multiplied by the orthogonalized treatments for the test to work.

    I am just so curious why the description of the function says that the test computes the best linear predictor of the target estimand using the forest prediction as well as the mean forest prediction as the sole two regressors. It seems to me that the test uses the forest prediction and the mean forest prediction multiplied by the orthogonalized treatment status as the sole two regressors.

    Or is this clarification redundant?

    Any guidance on this would be greatly appreciated.

    Lucy

    question 
    opened by lucy-temed 1
Random-Afg - Afghanistan Random Old Idz Cloner Tools

AFGHANISTAN RANDOM OLD IDZ CLONER TOOLS Install $ apt update $ apt upgrade $ apt

MAHADI HASAN AFRIDI 5 Jan 26, 2022
This is the source code for our ICLR2021 paper: Adaptive Universal Generalized PageRank Graph Neural Network.

GPRGNN This is the source code for our ICLR2021 paper: Adaptive Universal Generalized PageRank Graph Neural Network. Hidden state feature extraction i

Jianhao 92 Jan 3, 2023
A generalized framework for prototyping full-stack cooperative driving automation applications under CARLA+SUMO.

OpenCDA OpenCDA is a SIMULATION tool integrated with a prototype cooperative driving automation (CDA; see SAE J3216) pipeline as well as regular autom

UCLA Mobility Lab 726 Dec 29, 2022
code for ICCV 2021 paper 'Generalized Source-free Domain Adaptation'

G-SFDA Code (based on pytorch 1.3) for our ICCV 2021 paper 'Generalized Source-free Domain Adaptation'. [project] [paper]. Dataset preparing Download

Shiqi Yang 84 Dec 26, 2022
An official implementation of "Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation" (ICCV 2021) in PyTorch.

Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation This is an official implementation of the paper "Exploiting a Joint

CV Lab @ Yonsei University 35 Oct 26, 2022
GeDML is an easy-to-use generalized deep metric learning library

GeDML is an easy-to-use generalized deep metric learning library

Borui Zhang 32 Dec 5, 2022
Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechanism for Generalized Face Presentation Attack Detection

LMFD-PAD Note This is the official repository of the paper: LMFD-PAD: Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechani

null 28 Dec 2, 2022
Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation

Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation The code repository for "Audio-Visual Generalized Few-Shot Learning with

Kaiaicy 3 Jun 27, 2022
Generalized hybrid model for mode-locked laser diodes with an extended passive cavity

GenHybridMLLmodel Generalized hybrid model for mode-locked laser diodes with an extended passive cavity This hybrid simulation strategy combines a tra

Stijn Cuyvers 3 Sep 21, 2022
Generalized Jensen-Shannon Divergence Loss for Learning with Noisy Labels

The official code for the NeurIPS 2021 paper Generalized Jensen-Shannon Divergence Loss for Learning with Noisy Labels

null 13 Dec 22, 2022
Learning hidden low dimensional dyanmics using a Generalized Onsager Principle and neural networks

OnsagerNet Learning hidden low dimensional dyanmics using a Generalized Onsager Principle and neural networks This is the original pyTorch implemenati

Haijun.Yu 3 Aug 24, 2022
Official implementation of Generalized Data Weighting via Class-level Gradient Manipulation (NeurIPS 2021).

Generalized Data Weighting via Class-level Gradient Manipulation This repository is the official implementation of Generalized Data Weighting via Clas

null 9 Nov 3, 2021
Generalized Decision Transformer for Offline Hindsight Information Matching

Generalized Decision Transformer for Offline Hindsight Information Matching [arxiv] If you use this codebase for your research, please cite the paper:

Hiroki Furuta 35 Dec 12, 2022
PyTorch implementation of 'Gen-LaneNet: a generalized and scalable approach for 3D lane detection'

(pytorch) Gen-LaneNet: a generalized and scalable approach for 3D lane detection Introduction This is a pytorch implementation of Gen-LaneNet, which p

Yuliang Guo 233 Jan 6, 2023
An experiment to bait a generalized frontrunning MEV bot

Honeypot ?? A simple experiment that: Creates a honeypot contract Baits a generalized fronturnning bot with a unique transaction Analyze bot behaviour

0x1355 14 Nov 24, 2022
Mapping Conditional Distributions for Domain Adaptation Under Generalized Target Shift

This repository contains the official code of OSTAR in "Mapping Conditional Distributions for Domain Adaptation Under Generalized Target Shift" (ICLR 2022).

Matthieu Kirchmeyer 5 Dec 6, 2022
ViViT: Curvature access through the generalized Gauss-Newton's low-rank structure

ViViT is a collection of numerical tricks to efficiently access curvature from the generalized Gauss-Newton (GGN) matrix based on its low-rank structure. Provided functionality includes computing

Felix Dangel 12 Dec 8, 2022
Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022.

Nonuniform-to-Uniform Quantization This repository contains the training code of N2UQ introduced in our CVPR 2022 paper: "Nonuniform-to-Uniform Quanti

Zechun Liu 60 Dec 28, 2022
[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

EPro-PnP EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation In CVPR 2022 (Oral). [paper] Hanshen

 同济大学智能汽车研究所综合感知研究组 ( Comprehensive Perception Research Group under Institute of Intelligent Vehicles, School of Automotive Studies, Tongji University) 842 Jan 4, 2023