Biased and unbiased estimators of distance covariance and distance correlation [SRB07].
Estimators of the partial distance covariance and partial distance covariance [SR14].
It also provides tests based on these E-statistics:
Test of homogeneity based on the energy distance.
Test of independence based on distance covariance.
Installation
dcor is on PyPi and can be installed using pip:
pip install dcor
It is also available for conda using the conda-forge channel:
conda install -c conda-forge dcor
Previous versions of the package were in the vnmabus channel. This channel will not be updated with new releases, and users are recommended to use the conda-forge channel.
Requirements
dcor is available in Python 3.5 or above and in Python 2.7, in all operating systems.
Gábor J. Székely and Maria L. Rizzo. Energy statistics: a class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249 – 1272, 2013. URL: http://www.sciencedirect.com/science/article/pii/S0378375813000633, doi:10.1016/j.jspi.2013.03.018.
Gábor J. Székely and Maria L. Rizzo. Partial distance correlation with methods for dissimilarities. The Annals of Statistics, 42(6):2382–2412, 12 2014. doi:10.1214/14-AOS1255.
[SRB07]
(1, 2) Gábor J. Székely, Maria L. Rizzo, and Nail K. Bakirov. Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769–2794, 12 2007. doi:10.1214/009053607000000505.
I am trying to do a pairwise distance correlation for every column in a pandas dataframe of shape (1000, 10000) - i want to do a pairwise correlation of all columns (so 10k pairwise correlations, each column by every other column.)
this takes far too long, many many hours and in some cases doesn't finish. Is there an implementation that is more optimised? any advice would be much appreciated.
Note: this builds on #27, and changes from that branch will appear here until that is merged.
This re-implements most of the energy distance functions and permutation tests using numba. This provides significant performance improvements. I have some benchmarks below, which compare numba to pure Python (note: this isn't comparing numba to original code that used numpy tricks, it's comparing my changes with and without the JIT). The results suggest that numba improves performance for any number of permutations above 250. I expect this will also be true of multiple different permutation tests in the same program.
And with slightly higher limits:
However, the costs are:
Some ugly numba workarounds, like re-implementing the permutation function using nested loops
We lose the ability to pass in arbitrary average functions, the average parameter is now a string which is either mean or median
We lose the use of the RandomState object, and have to rely on only np.random.seed()
Startup costs associated with JIT compiler. Thus, for less than 250 permutations, the JIT compilation slows down the task.
Hi, I've hit a bit of a problem. I was trying to work out why I was getting different results from the ecp R package versus dcor. After some intense investigation, I think the cause seems to be at the point of taking the mean of each within-sample distance. Note, this is before we apply the coefficient or consider the between-sample distances. Precisely, I'm referring to the mean taken here: https://github.com/vnmabus/dcor/blob/161a6f5928ec0f30ce89fcfd5e90e6ed9315e383/dcor/_energy.py#L41-L42
In all the Székely and Rizzo papers (e.g. Székely & Rizzo, 2004), this mean is defined as the arithmetic mean, and the same as you have used in dcor:
However in the Matteson and James papers I have been looking at (e.g. Matteson & James, 2014; James et al., 2016), they seem to define it as follows:
What they seem to be doing here is summing the lower triangle of the matrix, excluding the diagonal, and then divided by the combination n choose 2. So if we had a sample with 5 items, the full distance matrix would be 5 x 5 = 25 items, but the lower triangle would only have 10 items in it. They would sum these distances and divide by 5 choose 2, which is 10. So this is also taking the mean, but it's the mean excluding the diagonal, which is of course always 0 in a within-sample distance matrix. The ultimate outcome is that their "mean" is actually , which is larger than it should be, as it isn't counting the 0s on the diagonal.
Note that this is also visible in the implementation of their work, in the ecp package. Here, they sum the matrix but then divide by , which is equivalent to the above, but not equivalent to the true mean:
https://github.com/zwenyu/ecp/blob/65a9bb56308d25ce3c6be4d6388137f428118248/src/energyChangePoint.cpp#L112
My question is this: are they simply wrong? If no, is there any theory supporting this alternative formula? If there is, should this be something supported in dcor? Fortunately it kind of already is thanks to my customizable average feature. But it could be called out specifically. I appreciate your input here as you likely understand this domain better than I do.
Hello. Thank you for this very useful package. I need to query the version installed and check that it is >=0.5.3.
In dcor/init.py
try:
with open(_os.path.join(_os.path.dirname(file),
'..', 'VERSION'), 'r') as version_file:
version = version_file.read().strip()
except IOError as e:
if e.errno != _errno.ENOENT:
raise
__version__ = "0.0"
You are reading the version from the VERSION file and at the end anyway forcing the version number to be 0.0. This is always returning 0.0 when i do
I installed the Python dcor package, and I got the following error whenever I tried to import dcor.
Traceback (most recent call last):
File "", line 1, in
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/dcor/init.py", line 14, in
from . import independence # noqa
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/dcor/independence.py", line 13, in
from ._dcor import u_distance_correlation_sqr
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/dcor/_dcor.py", line 27, in
from ._fast_dcov_avl import _distance_covariance_sqr_avl_generic
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/dcor/_fast_dcov_avl.py", line 89, in
_generate_partial_sum_2d(compiled=True))
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/decorators.py", line 200, in wrapper
disp.compile(sig)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock
return func(*args, **kwargs)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/dispatcher.py", line 768, in compile
cres = self._compiler.compile(args, return_type)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/dispatcher.py", line 81, in compile
raise retval
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/dispatcher.py", line 91, in _compile_cached
retval = self._compile_core(args, return_type)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/dispatcher.py", line 109, in _compile_core
pipeline_class=self.pipeline_class)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 551, in compile_extra
return pipeline.compile_extra(func)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 331, in compile_extra
return self._compile_bytecode()
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 393, in _compile_bytecode
return self._compile_core()
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 373, in _compile_core
raise e
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler.py", line 364, in _compile_core
pm.run(self.state)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler_machinery.py", line 347, in run
raise patched_exception
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler_machinery.py", line 338, in run
self._runPass(idx, pass_inst, state)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler_lock.py", line 32, in _acquire_compile_lock
return func(*args, **kwargs)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler_machinery.py", line 302, in _runPass
mutated |= check(pss.run_pass, internal_state)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/compiler_machinery.py", line 275, in check
mangled = func(compiler_state)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/typed_passes.py", line 95, in run_pass
raise_errors=self._raise_errors)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/typed_passes.py", line 66, in type_inference_stage
infer.build_constraint()
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/typeinfer.py", line 938, in build_constraint
self.constrain_statement(inst)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/typeinfer.py", line 1274, in constrain_statement
self.typeof_assign(inst)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/typeinfer.py", line 1345, in typeof_assign
self.typeof_global(inst, inst.target, value)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/typeinfer.py", line 1444, in typeof_global
typ = self.resolve_value_type(inst, gvar.value)
File "/Users/sanghoonkim/anaconda3/lib/python3.7/site-packages/numba/typeinfer.py", line 1366, in resolve_value_type
raise TypingError(msg, loc=inst.loc)
numba.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name '_dyad_update': cannot determine Numba type of <class 'function'>
File "anaconda3/lib/python3.7/site-packages/dcor/_fast_dcov_avl.py", line 70:
def _partial_sum_2d(x, y, c, ix, iy, sx_c, sy_c, c_sum, l_max,
dyad_update = _dyad_update_compiled if compiled else _dyad_update
^
>>> import dcor
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/quentin/.local/lib/python3.8/site-packages/dcor/__init__.py", line 14, in <module>
from . import independence # noqa
File "/home/quentin/.local/lib/python3.8/site-packages/dcor/independence.py", line 11, in <module>
from ._dcor import u_distance_correlation_sqr
File "/home/quentin/.local/lib/python3.8/site-packages/dcor/_dcor.py", line 26, in <module>
from ._fast_dcov_mergesort import _distance_covariance_sqr_mergesort_generic
File "/home/quentin/.local/lib/python3.8/site-packages/dcor/_fast_dcov_mergesort.py", line 208, in <module>
_distance_covariance_sqr_mergesort_generic_impl_compiled = numba.njit(
File "/home/quentin/.local/lib/python3.8/site-packages/numba/core/decorators.py", line 221, in wrapper
disp.compile(sig)
File "/home/quentin/.local/lib/python3.8/site-packages/numba/core/dispatcher.py", line 891, in compile
cres = self._cache.load_overload(sig, self.targetctx)
File "/home/quentin/.local/lib/python3.8/site-packages/numba/core/caching.py", line 644, in load_overload
return self._load_overload(sig, target_context)
File "/home/quentin/.local/lib/python3.8/site-packages/numba/core/caching.py", line 651, in _load_overload
data = self._cache_file.load(key)
File "/home/quentin/.local/lib/python3.8/site-packages/numba/core/caching.py", line 495, in load
overloads = self._load_index()
File "/home/quentin/.local/lib/python3.8/site-packages/numba/core/caching.py", line 511, in _load_index
with open(self._index_path, "rb") as f:
OSError: [Errno 36] File name too long: '/home/quentin/.local/lib/python3.8/site-packages/dcor/__pycache__/_fast_dcov_mergesort._generate_distance_covariance_sqr_mergesort_generic_impl.locals._distance_covariance_sqr_mergesort_generic_impl-163.py38.nbi'
Have I covered all public APIs, ensuring they can all be configured?
The test statistic ends up being negative, and therefore with a p-value of 1 when used to compare a standard normal and t distribution in the test_different_distributions. Does this make sense, or is it revealing a flaw in the code somewhere?
I have started using dcor as as I need to find pairwise correlations between two variables/vectors for every pairwise comparison in a dataframe. I am using the distance correlation as i need to find correlations not just for linear pairwise correlations but also non-linear correlations.
Having read the documentation, I know this is the correct implementation for this purpose, however, as I understand it, Scipy also provides a distance correlation function. I am getting different results when using both dcor and scipy and was wondering if you could explain why? I am unsure if Scipy is actually using the same distance correlation, or if their implementation contains something obvious I have missed which leads to the different results:
dcor returns a scalar for the distance correlation of a matrix and a vector. I cannot yet understand why this is the case as isn't the distance correlation defined between two vectors and so I would expect a vector of the correlations as the output.
Hello,
I am trying to get the distance correlation between two very large vectors (25k each), and the dcor function gets killed due to out of memory error. How can we fix that?
We can implement energy distance in terms of distance covariance, as shown in https://arxiv.org/pdf/1910.08883.pdf.
We need to study:
How this affect the current parameters of energy distance.
How to allow users to optionally access the different implementations of distance covariance, as well as the old energy distance implementation (if needed).
As mentioned in https://doi.org/10.1016/j.jspi.2013.03.018 (https://pages.stat.wisc.edu/~wahba/stat860public/pdf4/Energy/JSPI5102.pdf), the energy distance can be used to implement a linkage method for hierarchical clustering.
In https://doi.org/10.1016/j.jspi.2013.03.018 (https://pages.stat.wisc.edu/~wahba/stat860public/pdf4/Energy/JSPI5102.pdf) a measure of asymmetry, distance skewness, is described, as well as a test of symmetry using it. We should attempt to implement it in this package.
Energy distance can be used to perform goodness-of-fit tests, as mentioned in https://doi.org/10.1016/j.jspi.2013.03.018 (https://pages.stat.wisc.edu/~wahba/stat860public/pdf4/Energy/JSPI5102.pdf).
It would be useful to create a new submodule goodness that could include some of the following:
[ ] Two-parameter exponential distribution goodness-of-fit test.
The computation of pairwise distances is the main bottleneck of the naive algorithm for distance covariance. Currently we use scipy's cdist for Numpy arrays, and a broadcasting computation in other case.
Any performance improvement to this function is thus well received.
enhancementhelp wanted
opened by vnmabus 0
Releases(0.6)
0.6(Dec 26, 2022)
What's Changed
Typing
Fixes wrong types in u_distance_stats_sqr.
Add missing types in rowwise.
Documentation
New documentation theme.
Added links in the theory.
Added examples to the documentation.
Warning added to partial distance correlation/covariance docstrings by @jltorrecilla in https://github.com/vnmabus/dcor/pull/47
Performance
Improve the computation time of distances for Numpy arrays, which improves performance for energy distance and the naive case of distance covariance/correlation.
Improve AVL algorithm for distance covariance performance to bring it closer to mergesort.
Refactor distance covariance to be able to compute distance correlation without additional calls to the covariance function.
New Contributors
@jltorrecilla made their first contribution in https://github.com/vnmabus/dcor/pull/47
Full Changelog: https://github.com/vnmabus/dcor/compare/0.5.7...0.6