Generalized Random Forests

GRF Labs

Last update: Dec 25, 2022

Related tags

Overview

generalized random forests

A pluggable package for forest-based statistical estimation and inference. GRF currently provides non-parametric methods for least-squares regression, quantile regression, survival regression, and treatment effect estimation (optionally using instrumental variables), with support for missing values.

In addition, GRF supports 'honest' estimation (where one subset of the data is used for choosing splits, and another for populating the leaves of the tree), and confidence intervals for least-squares regression and treatment effect estimation.

Some helpful links for getting started:

The R package documentation contains usage examples and method reference.
The GRF reference gives a detailed description of the GRF algorithm and includes troubleshooting suggestions.
For community questions and answers around usage, see Github issues labelled 'question'.

The repository first started as a fork of the ranger repository -- we owe a great deal of thanks to the ranger authors for their useful and free package.

Installation

The latest release of the package can be installed through CRAN:

install.packages("grf")

conda users can install from the conda-forge channel:

conda install -c conda-forge r-grf

The current development version can be installed from source using devtools.

devtools::install_github("grf-labs/grf", subdir = "r-package/grf")

Note that to install from source, a compiler that implements C++11 is required (clang 3.3 or higher, or g++ 4.8 or higher). If installing on Windows, the RTools toolchain is also required.

Usage Examples

The following script demonstrates how to use GRF for heterogeneous treatment effect estimation. For examples of how to use types of forest, as for quantile regression and causal effect estimation using instrumental variables, please consult the R documentation on the relevant forest methods (quantile_forest, instrumental_forest, etc.).

library(grf)

# Generate data.
n <- 2000
p <- 10
X <- matrix(rnorm(n * p), n, p)
X.test <- matrix(0, 101, p)
X.test[, 1] <- seq(-2, 2, length.out = 101)

# Train a causal forest.
W <- rbinom(n, 1, 0.4 + 0.2 * (X[, 1] > 0))
Y <- pmax(X[, 1], 0) * W + X[, 2] + pmin(X[, 3], 0) + rnorm(n)
tau.forest <- causal_forest(X, Y, W)

# Estimate treatment effects for the training data using out-of-bag prediction.
tau.hat.oob <- predict(tau.forest)
hist(tau.hat.oob$predictions)

# Estimate treatment effects for the test sample.
tau.hat <- predict(tau.forest, X.test)
plot(X.test[, 1], tau.hat$predictions, ylim = range(tau.hat$predictions, 0, 2), xlab = "x", ylab = "tau", type = "l")
lines(X.test[, 1], pmax(0, X.test[, 1]), col = 2, lty = 2)

# Estimate the conditional average treatment effect on the full sample (CATE).
average_treatment_effect(tau.forest, target.sample = "all")

# Estimate the conditional average treatment effect on the treated sample (CATT).
average_treatment_effect(tau.forest, target.sample = "treated")

# Add confidence intervals for heterogeneous treatment effects; growing more trees is now recommended.
tau.forest <- causal_forest(X, Y, W, num.trees = 4000)
tau.hat <- predict(tau.forest, X.test, estimate.variance = TRUE)
sigma.hat <- sqrt(tau.hat$variance.estimates)
plot(X.test[, 1], tau.hat$predictions, ylim = range(tau.hat$predictions + 1.96 * sigma.hat, tau.hat$predictions - 1.96 * sigma.hat, 0, 2), xlab = "x", ylab = "tau", type = "l")
lines(X.test[, 1], tau.hat$predictions + 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[, 1], tau.hat$predictions - 1.96 * sigma.hat, col = 1, lty = 2)
lines(X.test[, 1], pmax(0, X.test[, 1]), col = 2, lty = 1)

# In some examples, pre-fitting models for Y and W separately may
# be helpful (e.g., if different models use different covariates).
# In some applications, one may even want to get Y.hat and W.hat
# using a completely different method (e.g., boosting).

# Generate new data.
n <- 4000
p <- 20
X <- matrix(rnorm(n * p), n, p)
TAU <- 1 / (1 + exp(-X[, 3]))
W <- rbinom(n, 1, 1 / (1 + exp(-X[, 1] - X[, 2])))
Y <- pmax(X[, 2] + X[, 3], 0) + rowMeans(X[, 4:6]) / 2 + W * TAU + rnorm(n)

forest.W <- regression_forest(X, W, tune.parameters = "all")
W.hat <- predict(forest.W)$predictions

forest.Y <- regression_forest(X, Y, tune.parameters = "all")
Y.hat <- predict(forest.Y)$predictions

forest.Y.varimp <- variable_importance(forest.Y)

# Note: Forests may have a hard time when trained on very few variables
# (e.g., ncol(X) = 1, 2, or 3). We recommend not being too aggressive
# in selection.
selected.vars <- which(forest.Y.varimp / mean(forest.Y.varimp) > 0.2)

tau.forest <- causal_forest(X[, selected.vars], Y, W,
                            W.hat = W.hat, Y.hat = Y.hat,
                            tune.parameters = "all")

# Check whether causal forest predictions are well calibrated.
test_calibration(tau.forest)

Developing

In addition to providing out-of-the-box forests for quantile regression and causal effect estimation, GRF provides a framework for creating forests tailored to new statistical tasks. If you'd like to develop using GRF, please consult the algorithm reference and development guide.

Funding

Development of GRF is supported by the National Science Foundation, the Sloan Foundation, the Office of Naval Research (Grant N00014-17-1-2131) and Schmidt Futures.

References

Susan Athey and Stefan Wager. Estimating Treatment Effects with Causal Forests: An Application. Observational Studies, 5, 2019. [paper, arxiv]

Susan Athey, Julie Tibshirani and Stefan Wager. Generalized Random Forests. Annals of Statistics, 47(2), 2019. [paper, arxiv]

Yifan Cui, Michael R. Kosorok, Erik Sverdrup, Stefan Wager, and Ruoqing Zhu. Estimating Heterogeneous Treatment Effects with Right-Censored Data via Causal Survival Forests. 2020. [arxiv]

Rina Friedberg, Julie Tibshirani, Susan Athey, and Stefan Wager. Local Linear Forests. Journal of Computational and Graphical Statistics, 2020. [paper, arxiv]

Imke Mayer, Erik Sverdrup, Tobias Gauss, Jean-Denis Moyer, Stefan Wager and Julie Josse. Doubly Robust Treatment Effect Estimation with Missing Attributes. Annals of Applied Statistics, 14(3) 2020. [paper, arxiv]

Stefan Wager and Susan Athey. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. Journal of the American Statistical Association, 113(523), 2018. [paper, arxiv]

Comments

Using causal_forest to estimate average treatment effects over subgroups
After reading a few of the papers on causal forests, it seems like the idea is the trees will decide to split more often on variables that cause the treatment effect to vary (that is, moderators of the experimental effect). However, the object returned by causal_forest seems to work like any other machine learning algorithm focused on prediction—not causal statistical inference—in the way the predict function associated with it works.

How can I use the causal_forest function to find where the treatment varies?

For example, I simulate a randomized experiment with a binary outcome where there is a positive effect for women and a negative effect for black men. There are also nuisance variables in here. How would I use causal_forest to tell me that, "The trees are tending to split most on gender, and then when it leads to male, it the trees also are more likely to split on race." This would help show me where the treatment effects are occurring.

Here are the data:

set.seed(1839) n <- 5000 X <- data.frame( condition = factor(sample(c("control", "treatment"), n, TRUE)), gender = factor(sample(c("male", "female"), n, TRUE)), race = factor(sample(c("black", "white", "hispanic", "asian"), n, TRUE)), generation = factor(sample(c("millennial", "x", "babyboomer"), n, TRUE)), has_kids = factor(sample(c("no", "yes"), n, TRUE)) ) y <- factor(ifelse( X$gender == "female" & X$condition == "treatment", sample(c("positive", "negative"), n, TRUE, c(.50, .50)), ifelse(X$race == "black" & X$condition == "treatment", sample(c("positive", "negative"), n, TRUE, c(.12, .88)), sample(c("positive", "negative"), n, TRUE, c(.40, .60)) ) ))

And then I fit a model using grf::causal_forest:

library(grf) mod <- causal_forest( model.matrix(~ -1 + ., X[, -1]), # convert factors to dummies (as.numeric(y) - 1), # convert outcome to 0 or 1 (as.numeric(X[, 1]) - 1), # convert treatment to 0 or 1 num.trees = 4000 )

Looking at mod, I see some variable importance issues that hints at where the splits are occurring most:

Variable importance: 1 2 3 4 5 6 7 8 0.233 0.203 0.267 0.061 0.055 0.046 0.081 0.054

However, this doesn't capture how multiple X variables may depend on one another in their interaction with W on Y (what may be called a three-way interaction in the general linear model world).

If I use the predict() function, I can only look at individual-level conditional treatment effects and their associated variance estimates. For example, just the first row of my data:

predict(mod, model.matrix(~ -1 + ., X[1, -1]), estimate.variance = TRUE)

However, how could I estimate the treatment effect and variance for the category women, collapsing across all other variables? Or the combination of male and Black, collapsing across all other variables? Is this as simple as averaging the treatment effects on the training set within the groups of interest? And if so, how are confidence intervals calculated from those?

Additionally, can the causal_forest function tell me where these variations are most likely to occur? It seems like this is possible from the papers I have read on causal forests (as well as earlier papers on causal trees and transformed outcome trees, etc.), but if I may very well be mistaken.
help wanted feature question
opened by markhwhiteii 26

Continuous Treatment Memory Error

Description of the bug I am estimating causal forest on a large dataset (around 2-3 GB with 1.6 million observations and 8 independent variables, as well as 1 dependent variable) with 4000 trees. When I use a binary treatment W, the forest runs fine. However, when I switch to a continuous treatment W, R crashes.

I am running it on a HPC with 32 CPUs and 12GB RAM per CPU.

The error message is (I masked some of my memory addresses with *):

*** Error in `/PATH/lib/R/bin/exec/R': break adjusted to free malloc space: 0x000******** ***
======= Backtrace: =========
/lib64/libc.so.6(+0x82257)[0x7f5*********]
/lib64/libc.so.6(+0x82cea)[0x7f5*********]
/lib64/libc.so.6(__libc_malloc+0xc7)[0x7f5*********]
/usr/local/Cluster-Apps/dmtcp/dmtcp-2.6.0-intel-17.0.4/lib/dmtcp/libdmtcp_alloc.so(malloc+0x22)[0x7f5*********]
/PATH/lib/R/bin/exec/../../../libstdc++.so.6(_Znwm+0x15)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNSt6vectorImSaImEE17_M_default_appendEm+0xc3)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf4Tree15find_leaf_nodesERKNS_4DataERKSt6vectorImSaImEE+0xb7)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf11TreeTrainer21repopulate_leaf_nodesERKSt10unique_ptrINS_4TreeESt14default_deleteIS2_EERKNS_4DataERKSt6vectorImSaImEEb+0xe0)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf11TreeTrainer5trainERKNS_4DataERNS_13RandomSamplerERKSt6vectorImSaImEERKNS_11TreeOptionsE+0x408)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf13ForestTrainer14train_ci_groupERKNS_4DataERNS_13RandomSamplerERKNS_13ForestOptionsE+0x135)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNK3grf13ForestTrainer11train_batchEmmRKNS_4DataERKNS_13ForestOptionsE+0x1a9)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNSt17_Function_handlerIFSt10unique_ptrINSt13__future_base12_Result_baseENS2_8_DeleterEEvENS1_12_Task_setterIS0_INS1_7_ResultISt6vectorIS0_IN3grf4TreeESt14default_deleteISA_EESaISD_EEEES3_ENSt6thread8_InvokerISt5tupleIJMNS9_13ForestTrainerEKFSF_mmRKNS9_4DataERKNS9_13ForestOptionsEEPKSL_mmSO_SP_EEEESF_EEE9_M_invokeERKSt9_Any_data+0x67)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(+0x526e9)[0x7f5*********]
/lib64/libpthread.so.0(+0x620b)[0x7f5*********]
/PATH/lib/R/library/grf/libs/grf.so(_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZNSt13__future_base17_Async_state_implINS1_IS2_IJMN3grf13ForestTrainerEKFSt6vectorISt10unique_ptrINS5_4TreeESt14default_deleteIS9_EESaISC_EEmmRKNS5_4DataERKNS5_13ForestOptionsEEPKS6_mmSH_SI_EEEESE_EC4EOSQ_EUlvE_EEEEE6_M_runEv+0x116)[0x7f5*********]
/PATH/lib/R/bin/exec/../../../libstdc++.so.6(+0xc819d)[0x7f5*********]
/usr/local/Cluster-Apps/dmtcp/dmtcp-2.6.0-intel-17.0.4/lib/dmtcp/libdmtcp.so(+0x35eeb)[0x7f5*********]
/lib64/libpthread.so.0(+0x7ea5)[0x7f5*********]
/usr/local/Cluster-Apps/dmtcp/dmtcp-2.6.0-intel-17.0.4/lib/dmtcp/libdmtcp.so(+0x35cad)[0x7f5*********]
/lib64/libc.so.6(clone+0x6d)[0x7f5*********]

Does continuous treatment consume more memory than binary treatment?

GRF version development

bug

opened by ginward 20

GRF refuse to report average treatment effect when the dataset is imbalanced
Description of the bug Download the dataset at this link. This is a dataset by LaLonde (1986), Dehejia and Wahba (1999) on the NSW training program. I am not sure if there is copyright - if so it belongs to them.

The data set is very imbalanced - there are around 10000 untreated example and less than 300 treated example.

When I run the code below in the development version, the average_treatment_effect(cps.forest, target.sample = "all") returns NaN. However, the code returns numbers in the releasedversion 0.10.2. It only returns NaN in the development branch.

Is this supposed to be a feature (i.e. returns NaN when data is very imbalanced), or is it actually a bug?

The average treatment effect for the treated and the overlap weighted treatment effects are all normal. Although the numbers differ slightly from the released branch as well.

Steps to reproduce

#compile the grf development code and include it into the library library(dplyr) select <- dplyr::select TREES_CPS=1000 SEED=6000 set.seed(SEED) dat=read.csv("bug.csv") forest=causal_forest(seed=SEED,num.trees = TREES_CPS, X=as.matrix(select(dat, re74, re75, age, education, black, hispanic, married, nodegree)), Y=dat$re78, W=dat$treat) average_treatment_effect(forest, target.sample = "all")

Output in Development Version

> average_treatment_effect(forest, target.sample = "all") estimate std.err NaN NaN

Output in Release Version

> average_treatment_effect(forest, target.sample = "all") estimate std.err -2058.140 1401.162

GRF version development
question
opened by ginward 18
Address poor performance of honest forests on small datasets.

When honesty is enabled, the training subsample is further split in half before performing splitting. With small datasets, this may not leave enough information for the algorithm to determine high-quality splits.

This issue is still pending a concrete proposal on how it should be addressed.
feature

opened by jtibshirani 18
causal_forest without orthogonalization

I understand that the default option of the causal_forest function is for it to use orthogonalization; that is, it incorporates both the actual values of the outcome (Y) and treatment (W), as well as the predicted values of the outcome (Y.hat) and treatment (W.hat) to estimate CATEs as well as to make splits throughout the forest. However, I have a question related to this:

Q: Is there any way to turn "off" the orthogonalization? That is, for demonstration purposes (not practical purposes), I wanted to simulate how performance of the causal_forest degrades if it is trained on only the actual values of the outcome (Y) and treatment (W), and not the residual values (i.e., actual - predicted) ?

opened by njawadekar 17
Questions about the function "causal_survival_forest"
Hello, I am trying to use the Causal survival forest prediction function. I have some questions about the "causal_survival_forest" function as follows:

I cannot use the arguments "target" and "horizon ";

In the details about the function as follows,why "'D[Y >= Y.max] <- 1' and 'Y[Y >= Y.max] <- Y.max'." and not "'D[Y >= Y.max] <- 0' and 'Y[Y >= Y.max] <- Y.max'." Details An important assumption for identifying the conditional average treatment effect tau(X) is that there exists a fixed positive constant M such that the probability of observing an event time past the maximum follow-up time Y.max is at least M. This may be an issue with data where most endpoint observations are censored. The suggested resolution is to re-define the estimand as the treatment effect up to some suitable maximum follow-up time Y.max. One can do this in practice by thresholding Y before running causal_survival_forest: 'D[Y >= Y.max] <- 1' and 'Y[Y >= Y.max] <- Y.max'. For details see Cui et al. (2020). The computational complexity of this estimator scales with the cardinality of the event times Y. If the number of samples is large and the Y grid dense, consider rounding the event times (or supply a coarser grid with the 'failure.times' argument).

When I used the "causal_survival_forest" function to analysis my RCT (random control trial) data, it's expected that tau(X) = E[Y(1) - Y(0) | X = x] > 0 (1:treatment arm; 0: control arm) where E[Y] is the expected survival time = integral(survival function), but the results of this analysis was tau(X) = E[Y(1) - Y(0) | X = x] < 0 contrary to what I thought, which were very unreasonable.

Could you give me an answer.

Thanks very much! Xin
opened by ChenXinaha 17
$add estimate_counterfactual_outcomes for W in {0,1} #403$

add estimate_counterfactual_outcomes for W in {0,1} #403

A start to providing the functionality requested in https://github.com/grf-labs/grf/issues/403#issuecomment-486915981:

Finally, it would be nice to provide a short function which performs the above steps, or at least add this information to our algorithm reference.

This does not include the general calculation if multiple, or non {0,1} outcomes exist, but happy to consider that if helpful.

opened by ras44 17
Difference between predict() and average_treatment_effect() for calculating CATEs in honest causal_forest

Hi, I am writing this in an effort to better understand the predict() and average_treatment_effect() functions, particularly in regards to when one should be utilized over the other to estimate conditional average treatment effects (CATEs) in an honest causal forest. Additional details related to this query are listed below:

(1) Research Goal: After building an honest causal_forest on my dataset, I would now like to calculate Conditional Average Treatment Effects (CATEs) within specific strata of covariates on the same dataset.

(2) Initial Plan: Based on this application paper by Athey & Wager, it seems that I should be using the predict() function in order to estimate these CATEs using an "honest" approach. Based on the documentation on predict(), it appears that by default, this function estimates the treatment effects such that these effects are estimated for every observation using only the trees in the forest which did not use that particular observation when it was modeled--so, out-of-bag estimation.

(3) Question: However, I understand that there is additionally an average_treatment_effect function, which can also supposedly estimate Conditional Average Treatment Effects in a causal forest in a doubly robust fashion. I would like to better understand the differences between these two functions (predict() and average_treatment_effect()), and the different circumstances in which one function should be used over the other to estimate CATEs on data within an honest causal forest. Evidently, the math behind each function differs, as shown in my attached code that I ran on a mock dataset. This attached R code can be used to reproduce the very different conditional average treatment effects that I calculated for a specific subset of individuals when I used the predict() approach vs. average_treatment_effect().

In addition to explaining which function is better for calculating conditional average treatment effects in various circumstances, could someone also please explain in layman's terms what each function is doing behind the scenes? For example, is the average_treatment_effect function using all of the trees to estimate the treatment effects (and not out-of-bag?)? Also - how are propensity scores utilized for the average_treatment_effect function?

Thanks!

Steps to reproduce Please find the attached code. cates_cf.txt

GRF version 2.0.2

opened by njawadekar 16
Add cobalt-style balance plots to causal forests

For propensity score matching, there are some ways to test whether the samples are indeed matched well.

Are there similar methods to exam the quality of matching for the causal forest?

Thanks a lot for your time!
help wanted feature

opened by ZhangMengxia 16
Account for hierarchical structure

Right now GRF is based on an IID assumption. It would be nice to be able to use GRF on data with a hierarchical structure. This is especially relevant in RCTs where studies occur across administrative units like provinces, towns, school districts, etc.

opened by lminer 15
Question: How can I use the package to build a single causal tree?

I'm trying to build a single causal tree using the following code:

model <- causal_forest(X_train, y_train, z_train, num.trees = 1)

However, I noticed that the causal_forest method has the parameter sample.fraction, which defines the fraction of the data that is used to build each tree (and is 0.5 by default). Because I want to use the entire data set to build the causal tree, I want to set this to 1, but when I run the following code:

model <- causal_forest(X_train, y_train, z_train, num.trees = 1, sample.fraction=1)

I get the following error message:

"Error in causal_train(data$default, data$sparse, outcome.index, treatment.index, : When confidence intervals are enabled, the sampling fraction must be less than 0.5."

Could you please tell me how to disable confidence intervals in order to build a tree using the entire sample? Thanks in advance!
question

opened by ferlocar 14
Selecting balanced splits in instrumental forests
Hi grf team,

I am currently trying to follow how instrumental forests select balanced splits. I have read the corresponding section in the algorithm reference for causal forests, but I suppose that this does not fully apply to instrumental forests.

In particular, I'm interested in:

how min.node.size is determined

what the node size measure is which is used together with alpha and imbalance.penalty

what changes in 1. and 2. if stabilize.splits is set to FALSE

Thank you for your support & best regards, Jens
opened by JeGemm 0
Regarding model calibration and discrimination performance of individual treatment effect calculated from causal_survival_forest function.

I was wondering if you might be able to help me with using the causal_survival_forest function to estimate individual treatment effect in R environment. I am having some difficulty understanding how to assess model calibration and check discrimination performance when using this function. Would you happen to have any recommendations or resources that might be helpful in this regard? I would really appreciate any guidance you might be able to provide.
question

opened by fukuokaya 2

Performance of Causal Forests in the tails of the covariate distribution

I am trying to understand Causal Forest's behavior in the tail of the distribution of the covariates. I run simulations of the type below and often find that CF estimates are constant towards the tails (see plot) which means that CFs are biased there. In the simulation below I compare this to the naïve approach using random forests for each of Y0 and Y1 to predict the outcomes and then obtain tau by the difference in predictions \hat{Y1}-\hat{Y0}. That approach seemingly does not suffer from the bias problem in the tails (but has larger variance throughout). In larger samples (e.g. n=1e4) the problem seems to persist. Is there anything I can do to get a better fit in the tails? Thanks.

library(grf)
library(ranger)

## Simulate non-linear ps-scores and y-models
e.x = function(x) 1/ (1 + exp(- ( 3 * x) ))
mu.0 = function(x) sin(-1/2- 4*x)
mu.1 = function(x) sin(1/2 + 4*x)
tau  = function(x) (mu.1(x) - mu.0(x))
m.x = function(x) mu.0(x) + e.x(x) * tau(x)

## Sample training data
set.seed(2023)
n = 1000
sd.y = 0.6
Z = rnorm(n, mean=0, sd = 0.3)
y0 = mu.0(Z) + rnorm(n, sd = sd.y)
y1 = mu.1(Z) + rnorm(n, sd = sd.y)
e  = e.x(Z)
W  = rbinom(n, 1, e)
y  = W*y1 + (1-W) * y0
df = data.frame(Z = Z, y, y0, y1, e, W = factor(W), W.num = W)

## Estimate causal forest with tuning on
cf = causal_forest(X = cbind(df$Z), Y = df$y, W = df$W.num, num.trees = 2e3, tune.parameters = 'all')

## Estimate heterogeineity using conditional means random forests
rf0 = ranger(y ~ Z, data = df[df$W.num==0,], num.trees = 5e3 )
rf1 = ranger(y ~ Z, data = df[df$W.num==1,], num.trees = 5e3 )

## Create test data
Z.test  = seq(-1,1,0.01)
df.test = data.frame(Z = Z.test)
tau.test = tau(Z.test)
tau.test.cf = predict(cf, newdata = df.test)[,1]
tau.test.cdmrf = predict(rf1, data= df.test)$predictions - predict(rf0, data= df.test)$predictions

## Compare fits
par(mfrow=c(1,2))
plot(Z.test,tau.test, ylim=c(-3,3),ty='l', main='Predictions vs true effect')
lines(Z.test,tau.test.cf,col=2)
lines(Z.test,tau.test.cdmrf,col=4)
plot(Z.test,tau.test.cf-tau.test, ylim=c(-3,3),col=2, ty='l', main= 'Error')
lines(Z.test,tau.test.cdmrf-tau.test,col=4)
abline(h=0)
legend('bottomright', legend = c('Causal Forest', 'Standard Forests'),lty=c(1,1),col=c(2,4))

question

opened by thomasklausch2 3

Assessing treatment heterogeneity in instrumental_forest
Hi, everyone. Thank you for posting this fancy package. I want to test whether a heterogeneous treatment effect exists in my instrumental forest model. For your information, the sample size is around 2,300 and the number of covariates is around 40 in my data.

There are 2 major challenges that I am facing: 1) test_calibration function does not support instrumental_forest. While the function supports some forest models, it is impossible to find any treatment heterogeneity in my instrumental forest model using the function. Is there any technical difficulty in supporting instrumental_forest in the test_calibration function? I wonder if there would be any plans for updates as best_linear_projection started to support instrumental_forest recently.

2) Rank average treatment effect is unstable. Since I can not use the test_calibration function, I have tried to use the rank_average_treatment function. However, I found that the p-values vary significantly based on the parameters I am using. For example, if I change tune.parameters from ‘all’ to c('sample.fraction', 'mtry', 'min.node.size', 'alpha', 'imbalance.penalty'), the p-value increases from 0.06 to 0.97, or decreases from 0.59 to 0.20 depending on different Ys. Moreover, the p-value also varies a lot if I change the seed of the instrumental forest model(e.g., seed=123 to seed=119, and so on). The following is the code that I'm using:

set.seed(123, kind = "Mersenne-Twister", normal.kind = "Inversion", sample.kind = "Rejection") cf.priority <- instrumental_forest( X[train, ], Y[train], W[train], Z[train], num.trees = 50000, #tune.parameters = 'all', tune.parameters = c('sample.fraction', 'mtry', 'min.node.size', 'alpha', 'imbalance.penalty'), tune.num.trees = 4000, tune.num.reps = 250, tune.num.draws = 4500) set.seed(123, kind = "Mersenne-Twister", normal.kind = "Inversion", sample.kind = "Rejection") # Estimate AUTOC on held-out data. cf.eval <- instrumental_forest( X[-train, ], Y[-train], W[-train], Z[-train], num.trees = 50000, #tune.parameters = 'all', tune.parameters = c('sample.fraction', 'mtry', 'min.node.size', 'alpha', 'imbalance.penalty'), tune.num.trees = 4000, tune.num.reps = 250, tune.num.draws = 4500)

Can we say that if a certain hyperparameter set yields a low p-value, then the tuning is valid? I wonder if there is any rule of thumb in tuning these hyperparameters. Thank you for your time and all the work!

Best, Minje
question
opened by minnnjecho 1

NA handling regression forest vs. local linear forest covariate matrices

Description of the bug Regression forest allows the X matrix to have some incomplete cases, but local linear forest returns an error. Not sure if there's some technical reason why ll forests can't handle incomplete cases?

Steps to reproduce

#Toy data 
Y    <- as.vector(rnorm(100))
X    <- data.frame(x1 = rnorm(100), x2 = rnorm(100))
#
#Add NAs 
X$x1 <- ifelse(X$x1 > 0,X$x1,NA)
#
#Let's try an r forest 
regression_forest(Y = Y,
                  X = X)
#R forest runs fine 
#
# Now let's try ll forest 
ll_regression_forest(Y = Y,
                     X = X)
#
#ll forest returns: Error in validate_X(X) : The feature matrix X contains at least one NA.

GRF version GRF version 2.2.0

question

opened by spocksdad 2

Calibration test -- description and source code

Hi everyone, Thank you for posting this package. I am trying to understand the intuition of the calibration test. I see from the source code that the calibration test does the following for a causal forest:

preds <- predict(forest)$predictions mean.pred <- mean(preds) DF <- data.frame( target = unname(forest$Y.orig - forest$Y.hat), mean.forest.prediction = unname(forest$W.orig - forest$W.hat) * mean.pred,differential.forest.prediction = unname(forest$W.orig - forest$W.hat) *(preds - mean.pred))

summary(lm(target~ mean.forest.prediction + differential.forest.prediction +0, data=DF))

The target are the orthogonalized outcomes. But then, the target is not regressed on the mean forest prediction and the differential forest prediction alone, but on the product of those two and the orthogonalized treatments....

I understand that those orthogonalized outcomes are the outcome variable of the forest. But i don't understand why the mean forest prediction needs to be multiplied by the orthogonalized treatments for the test to work.

I am just so curious why the description of the function says that the test computes the best linear predictor of the target estimand using the forest prediction as well as the mean forest prediction as the sole two regressors. It seems to me that the test uses the forest prediction and the mean forest prediction multiplied by the orthogonalized treatment status as the sole two regressors.

Or is this clarification redundant?

Any guidance on this would be greatly appreciated.

Lucy
question

opened by lucy-temed 1

Owner

GRF Labs

GitHub https://grf-labs.github.io/grf/

Random-Afg - Afghanistan Random Old Idz Cloner Tools

AFGHANISTAN RANDOM OLD IDZ CLONER TOOLS Install $ apt update $ apt upgrade $ apt

5 Jan 26, 2022

This is the source code for our ICLR2021 paper: Adaptive Universal Generalized PageRank Graph Neural Network.

GPRGNN This is the source code for our ICLR2021 paper: Adaptive Universal Generalized PageRank Graph Neural Network. Hidden state feature extraction i

92 Jan 3, 2023

A generalized framework for prototyping full-stack cooperative driving automation applications under CARLA+SUMO.

OpenCDA OpenCDA is a SIMULATION tool integrated with a prototype cooperative driving automation (CDA; see SAE J3216) pipeline as well as regular autom

726 Dec 29, 2022

code for ICCV 2021 paper 'Generalized Source-free Domain Adaptation'

G-SFDA Code (based on pytorch 1.3) for our ICCV 2021 paper 'Generalized Source-free Domain Adaptation'. [project] [paper]. Dataset preparing Download

84 Dec 26, 2022

An official implementation of "Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation" (ICCV 2021) in PyTorch.

Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic Segmentation This is an official implementation of the paper "Exploiting a Joint

35 Oct 26, 2022

GeDML is an easy-to-use generalized deep metric learning library

32 Dec 5, 2022

Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechanism for Generalized Face Presentation Attack Detection

LMFD-PAD Note This is the official repository of the paper: LMFD-PAD: Learnable Multi-level Frequency Decomposition and Hierarchical Attention Mechani

28 Dec 2, 2022

Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation

Audio-Visual Generalized Few-Shot Learning with Prototype-Based Co-Adaptation The code repository for "Audio-Visual Generalized Few-Shot Learning with

3 Jun 27, 2022

Generalized hybrid model for mode-locked laser diodes with an extended passive cavity

GenHybridMLLmodel Generalized hybrid model for mode-locked laser diodes with an extended passive cavity This hybrid simulation strategy combines a tra

3 Sep 21, 2022

Generalized Jensen-Shannon Divergence Loss for Learning with Noisy Labels

The official code for the NeurIPS 2021 paper Generalized Jensen-Shannon Divergence Loss for Learning with Noisy Labels

13 Dec 22, 2022

Learning hidden low dimensional dyanmics using a Generalized Onsager Principle and neural networks

OnsagerNet Learning hidden low dimensional dyanmics using a Generalized Onsager Principle and neural networks This is the original pyTorch implemenati

3 Aug 24, 2022

Official implementation of Generalized Data Weighting via Class-level Gradient Manipulation (NeurIPS 2021).

Generalized Data Weighting via Class-level Gradient Manipulation This repository is the official implementation of Generalized Data Weighting via Clas

9 Nov 3, 2021

Generalized Decision Transformer for Offline Hindsight Information Matching

Generalized Decision Transformer for Offline Hindsight Information Matching [arxiv] If you use this codebase for your research, please cite the paper:

35 Dec 12, 2022

PyTorch implementation of 'Gen-LaneNet: a generalized and scalable approach for 3D lane detection'

(pytorch) Gen-LaneNet: a generalized and scalable approach for 3D lane detection Introduction This is a pytorch implementation of Gen-LaneNet, which p

233 Jan 6, 2023

An experiment to bait a generalized frontrunning MEV bot

Honeypot ?? A simple experiment that: Creates a honeypot contract Baits a generalized fronturnning bot with a unique transaction Analyze bot behaviour

14 Nov 24, 2022

Mapping Conditional Distributions for Domain Adaptation Under Generalized Target Shift

This repository contains the official code of OSTAR in "Mapping Conditional Distributions for Domain Adaptation Under Generalized Target Shift" (ICLR 2022).

5 Dec 6, 2022

ViViT: Curvature access through the generalized Gauss-Newton's low-rank structure

ViViT is a collection of numerical tricks to efficiently access curvature from the generalized Gauss-Newton (GGN) matrix based on its low-rank structure. Provided functionality includes computing

12 Dec 8, 2022

Nonuniform-to-Uniform Quantization: Towards Accurate Quantization via Generalized Straight-Through Estimation. In CVPR 2022.

Nonuniform-to-Uniform Quantization This repository contains the training code of N2UQ introduced in our CVPR 2022 paper: "Nonuniform-to-Uniform Quanti

60 Dec 28, 2022

[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

EPro-PnP EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation In CVPR 2022 (Oral). [paper] Hanshen

同济大学智能汽车研究所综合感知研究组 ( Comprehensive Perception Research Group under Institute of Intelligent Vehicles, School of Automotive Studies, Tongji University)

842 Jan 4, 2023