Updates:
-
Fix bugs (or typos) in causallearn/utils/KCI/KCI.py
to conform to the original Matlab implementation in http://people.tuebingen.mpg.de/kzhang/KCI-test.zip:
KCI_CInd.kernel_matrix
: KernelX
and KernelY
's empirical width depends on data_z
's shape. The same applies to KernelY
(Line 349, 373).
KCI_CInd.kernel_matrix
: centering kernel matrix kzx
(Line 404, 457, 462) before feeding it into KCI_V_statistic
. After double checking, the optimization @MarkDana did previously (https://github.com/cmu-phil/causal-learn/pull/49) still applies.
KCI_CInd.kernel_matrix
: change the data normalization way by adjusting the degrees of freedom correction in calculating the standard deviation.
-
Add test cases where the datasets are generated by nonGaussian distributions, e.g., exponential distribution, uniform distribution and the mixture of these distributions.
-
Remove print
commands and add assertions on the exact p-value. The ground truth p-value is computed from the current version 26d8e36
.
Resolving the concerns:
When the kernel width is set manually, there are two cases that may return a weird p-value even in the Matlab version. The details are listed in Weird_test_cases.pdf. After checking with the expert, these results are expected. More care should be taken as to what kernel width is set manually.
Comparison of the current Python version with the Matlab version on KCI-test results:
Since the Matlab version cannot support too flexible arguments, we only test the basic setting: applying Gaussian kernel to all variables X, Y and Z and setting the kernel width using empirical rules. The results for Python are as follows:
import numpy as np
import causallearn.utils.cit as cit
# np.random.seed(50)
# X = np.random.randn(30, 1)
# X_prime = np.random.randn(30, 1)
# Y = X + 0.5 * np.random.randn(30, 1)
# Z = Y + 0.5 * np.random.randn(30, 1)
# data = np.hstack((X, X_prime, Y, Z))
# np.savetxt("GaussianData.csv", data, delimiter=",")
data = np.loadtxt("GaussianData.csv", delimiter=",")
- If we use gamma approximation (
approx=True
), the resulting p-value is deterministic.
approx = True
for use_gp in [True, False]:
# The argument 'use_gp' only relates the conditional KCI, so only 'pXZY' will be affected.
cit_CIT = cit.CIT(data, 'kci', kernelX='Gaussian', kernelY='Gaussian', kernelZ='Gaussian',
est_width='empirical', approx=approx, use_gp=use_gp)
pXX = cit_CIT(0, 1)
pXZ = cit_CIT(0, 3)
pXZY = cit_CIT(0, 3, {2})
print('\napprox =', approx, ', use_gp =', use_gp, ':')
print('pXX:', round(pXX, 4))
print('pXZ:', format(pXZ, '.4e'))
print('pXZY', round(pXZY, 4))
Output:
approx = True , use_gp = True :
pXX: 0.2662
pXZ: 5.0105e-06
pXZY 0.3585
approx = True , use_gp = False :
pXX: 0.2662
pXZ: 5.0105e-06
pXZY 0.3977
- Otherwise (
approx=False
), the resulting p-value is not deterministic due to the simulation of null distribution in the function null_sample_spectral(xxx)
in the class KCI_UInd
and KCI_CInd
. (I've checked that other deterministic parts can have the same output with the Matlab version on the same input.) Therefore, we run the program 500 times and calculate the mean of p-values.
approx = False
for use_gp in [True, False]:
np.random.seed(50)
pXX = []
pXZ = []
pXZY = []
for i in range(500):
cit_CIT = cit.CIT(data, 'kci', kernelX='Gaussian', kernelY='Gaussian', kernelZ='Gaussian',
est_width='empirical', approx=approx, use_gp=use_gp)
p01 = cit_CIT(0, 1)
pXX.append(p01)
p03 = cit_CIT(0, 3)
pXZ.append(p03)
p032 = cit_CIT(0, 3, {2})
pXZY.append(p032)
print('\napprox =', approx, ', use_gp =', use_gp, ':')
print('pXX: mean = {}, var = {}'.format(round(np.mean(pXX), 4), format(np.var(pXX), '.4e')))
print('pXZ: mean = {}, var = {}'.format(round(np.mean(pXZ), 4), format(np.var(pXZ), '.4e')))
print('pXZY: mean = {}, var = {}'.format(round(np.mean(pXZY), 4), format(np.var(pXZY), '.4e')))
Output:
approx = False , use_gp = True :
pXX: mean = 0.246, var = 1.8853e-04
pXZ: mean = 0.0001, var = 6.6156e-08
pXZY: mean = 0.3371, var = 4.9257e-05
approx = False , use_gp = False :
pXX: mean = 0.2468, var = 1.8931e-04
pXZ: mean = 0.0001, var = 7.1376e-08
pXZY: mean = 0.3736, var = 4.5726e-05
Using the same dataset GaussianData.csv generated by Python, the Matlab version results are provided in Matlab_results.pdf.
We can see that they give the same p-value in deterministic cases and almost the same p-value in non-deterministic cases. This conclusion also applies to other datasets I tried on.
Test plan:
python -m unittest TestCIT_KCI # should pass
TODO:
- Design more comprehensive test cases, including designing dataset of corner cases (e.g., data generated by more complicated distribution, discrete data type, larger sample size).