A Numba-based two-point correlation function calculator using a grid decomposition

Overview

numba-2pcf

tests

A Numba-based two-point correlation function (2PCF) calculator using a grid decomposition. Like Corrfunc, but written in Numba, with simplicity and hackability in mind.

Aside from the 2PCF calculation, the particle_grid module is both simple and fast and may be useful on its own as a way to partition particle sets in 3D.

Installation

$ git clone https://github.com/lgarrison/numba-2pcf.git
$ cd numba-2pcf
$ python -m pip install -e .

Example

from numba_2pcf.cf import numba_2pcf
import numpy as np

rng = np.random.default_rng(123)
N = 10**6
box = 2.
pos = rng.random((N,3), dtype=np.float32)*box

res = numba_2pcf(pos, box, Rmax=0.05, nbin=10)
res.pprint_all()
        rmin                 rmax                 rmid                    xi            npairs 
-------------------- -------------------- -------------------- ----------------------- --------
                 0.0 0.005000000074505806 0.002500000037252903   -0.004519257448573177    65154
0.005000000074505806 0.010000000149011612  0.00750000011175871   0.0020113763064291135   459070
0.010000000149011612  0.01500000022351742 0.012500000186264515    0.000984359247434119  1244770
 0.01500000022351742 0.020000000298023225 0.017500000260770324  -6.616896085054336e-06  2421626
0.020000000298023225  0.02500000037252903 0.022500000335276125  0.00019365366488166558  3993210
 0.02500000037252903  0.03000000044703484 0.027500000409781934   5.769329601057471e-05  5956274
 0.03000000044703484  0.03500000052154064 0.032500000484287736   0.0006815801672250821  8317788
 0.03500000052154064  0.04000000059604645 0.037500000558793545    2.04711840243732e-05 11061240
 0.04000000059604645  0.04500000067055226 0.042500000633299354   9.313641918828885e-05 14203926
 0.04500000067055226  0.05000000074505806  0.04750000070780516 -0.00011690771042793813 17734818

Performance

The goal of this project is not to provide the absolute best performance that given hardware can produce, but it is a goal to provide as good performance as Numba will let us reach (while keeping the code readable). So we pay special attention to things like dtype (use float32 particle inputs when possible!), parallelization, and some early-exit conditions (when we know a pair can't fall in any bin).

As a demonstration that this code provides passably good performance, here's a dummy test of 107 unclustered data points in a 2 Gpc/h box (so number density 1.2e-3), with Rmax=200 Mpc/h and bin width of 1 Mpc/h:

from numba_2pcf.cf import numba_2pcf
import numpy as np

rng = np.random.default_rng(123)
N = 10**6
box = 2000
pos = rng.random((N,3), dtype=np.float32)*box

%timeit numba_2pcf(pos, box, Rmax=150, nbin=150, corrfunc=False, nthread=24)  # 3.5 s
%timeit numba_2pcf(pos, box, Rmax=150, nbin=150, corrfunc=True, nthread=24)  # 1.3 s

So within a factor of 3 of Corrfunc, and we aren't even exploiting the symmetry of the autocorrelation (i.e. we count every pair twice). Not bad!

Testing Against Corrfunc

The code is tested against Corrfunc. And actually, the numba_2pcf() function takes a flag corrfunc=True that calls Corrfunc instead of the Numba implementation to make such testing even easier.

Details

numba_2pcf works a lot like Corrfunc, or any other grid-based 2PCF code: the 3D volume is divided into a grid of cells at least Rmax in size, where Rmax is the maximum radius of the correlation function measurement. Then, we know all valid particle pairs must be in neighboring cells. So the task is simply to loop through each cell in the grid, pairing it with each of its 26 neighbors (plus itself). We parallelize over cell pairs, and add up all the pair counts across threads at the end.

This grid decomposition prunes distant pairwise comparisons, so even though the runtime still formally scales as O(N2), it makes the 2PCF tractable for many realistic problems in cosmology and large-scale structure.

A numba implementation isn't likely to beat Corrfunc on speed, but numba can still be fast enough to be useful (especially when the computation parallelizes well). The idea is that this code provides a "fast enough" parallel implementation while still being highly readable --- the 2PCF implementation is about 150 lines of code, and the gridding scheme 100 lines.

Branches

The particle-jackknife branch contains an implementation of an idea for computing the xi(r) variance based on the variance of the per-particle xi(r) measurements. It doesn't seem to be measuring the right thing, but the code is left for posterity.

Acknowledgments

This repo was generated from @DFM's Cookiecutter Template. Thanks, DFM!

You might also like...
A neural-based binary analysis tool

A neural-based binary analysis tool Introduction This directory contains the demo of a neural-based binary analysis tool. We test the framework using

Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.

Stock Statistics/Indicators Calculation Helper VERSION: 0.3.2 Introduction Supply a wrapper StockDataFrame based on the pandas.DataFrame with inline s

Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

weightedcalcs weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more. Features Plays we

Statistical package in Python based on Pandas
Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow
A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow

ZhuSuan is a Python probabilistic programming library for Bayesian deep learning, which conjoins the complimentary advantages of Bayesian methods and

Orchest is a browser based IDE for Data Science.
Orchest is a browser based IDE for Data Science.

Orchest is a browser based IDE for Data Science. It integrates your favorite Data Science tools out of the box, so you don’t have to. The application is easy to use and can run on your laptop as well as on a large scale cloud cluster.

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors

GWpy is a collaboration-driven Python package providing tools for studying data from ground-based gravitational-wave detectors. GWpy provides a user-f

A powerful data analysis package based on mathematical step functions.  Strongly aligned with pandas.
A powerful data analysis package based on mathematical step functions. Strongly aligned with pandas.

The leading use-case for the staircase package is for the creation and analysis of step functions. Pretty exciting huh. But don't hit the close button

Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

Owner
Lehman Garrison
Flatiron Research Fellow at the Center for Computational Astrophysics
Lehman Garrison
Active Learning demo using two small datasets

ActiveLearningDemo How to run step one put the dataset folder and use command below to split the dataset to the required structure run utils.py For ea

null 3 Nov 10, 2021
Pipetools enables function composition similar to using Unix pipes.

Pipetools Complete documentation pipetools enables function composition similar to using Unix pipes. It allows forward-composition and piping of arbit

null 186 Dec 29, 2022
A tax calculator for stocks and dividends activities.

Revolut Stocks calculator for Bulgarian National Revenue Agency Information Processing and calculating the required information about stock possession

Doino Gretchenliev 200 Oct 25, 2022
Python beta calculator that retrieves stock and market data and provides linear regressions.

Stock and Index Beta Calculator Python script that calculates the beta (β) of a stock against the chosen index. The script retrieves the data and resa

sammuhrai 4 Jul 29, 2022
Python script for transferring data between three drives in two separate stages

Waterlock Waterlock is a Python script meant for incrementally transferring data between three folder locations in two separate stages. It performs ha

David Swanlund 13 Nov 10, 2021
Two phase pipeline + StreamlitTwo phase pipeline + Streamlit

Two phase pipeline + Streamlit This is an example project that demonstrates how to create a pipeline that consists of two phases of execution. In betw

Rick Lamers 1 Nov 17, 2021
Python utility to extract differences between two pandas dataframes.

Python utility to extract differences between two pandas dataframes.

Jaime Valero 8 Jan 7, 2023
An extension to pandas dataframes describe function.

pandas_summary An extension to pandas dataframes describe function. The module contains DataFrameSummary object that extend describe() with: propertie

Mourad 450 Dec 30, 2022
A utility for functional piping in Python that allows you to access any function in any scope as a partial.

WithPartial Introduction WithPartial is a simple utility for functional piping in Python. The package exposes a context manager (used with with) calle

Michael Milton 1 Oct 26, 2021
Randomisation-based inference in Python based on data resampling and permutation.

Randomisation-based inference in Python based on data resampling and permutation.

null 67 Dec 27, 2022