Fast 1D and 2D histogram functions in Python

Overview

Azure Status asv

About

Sometimes you just want to compute simple 1D or 2D histograms with regular bins. Fast. No nonsense. Numpy's histogram functions are versatile, and can handle for example non-regular binning, but this versatility comes at the expense of performance.

The fast-histogram mini-package aims to provide simple and fast histogram functions for regular bins that don't compromise on performance. It doesn't do anything complicated - it just implements a simple histogram algorithm in C and keeps it simple. The aim is to have functions that are fast but also robust and reliable. The result is a 1D histogram function here that is 7-15x faster than numpy.histogram, and a 2D histogram function that is 20-25x faster than numpy.histogram2d.

To install:

pip install fast-histogram

or if you use conda you can instead do:

conda install -c conda-forge fast-histogram

The fast_histogram module then provides two functions: histogram1d and histogram2d:

from fast_histogram import histogram1d, histogram2d

Example

Here's an example of binning 10 million points into a regular 2D histogram:

In [1]: import numpy as np

In [2]: x = np.random.random(10_000_000)

In [3]: y = np.random.random(10_000_000)

In [4]: %timeit _ = np.histogram2d(x, y, range=[[-1, 2], [-2, 4]], bins=30)
935 ms ± 58.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: from fast_histogram import histogram2d

In [6]: %timeit _ = histogram2d(x, y, range=[[-1, 2], [-2, 4]], bins=30)
40.2 ms ± 624 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

(note that 10_000_000 is possible in Python 3.6 syntax, use 10000000 instead in previous versions)

The version here is over 20 times faster! The following plot shows the speedup as a function of array size for the bin parameters shown above:

Comparison of performance between Numpy and fast-histogram

as well as results for the 1D case, also with 30 bins. The speedup for the 2D case is consistently between 20-25x, and for the 1D case goes from 15x for small arrays to around 7x for large arrays.

Q&A

Why don't the histogram functions return the edges?

Computing and returning the edges may seem trivial but it can slow things down by a factor of a few when computing histograms of 10^5 or fewer elements, so not returning the edges is a deliberate decision related to performance. You can easily compute the edges yourself if needed though, using numpy.linspace.

Doesn't package X already do this, but better?

This may very well be the case! If this duplicates another package, or if it is possible to use Numpy in a smarter way to get the same performance gains, please open an issue and I'll consider deprecating this package :)

One package that does include fast histogram functions (including in n-dimensions) and can compute other statistics is vaex, so take a look there if you need more advanced functionality!

Are the 2D histograms not transposed compared to what they should be?

There is technically no 'right' and 'wrong' orientation - here we adopt the convention which gives results consistent with Numpy, so:

numpy.histogram2d(x, y, range=[[xmin, xmax], [ymin, ymax]], bins=[nx, ny])

should give the same result as:

fast_histogram.histogram2d(x, y, range=[[xmin, xmax], [ymin, ymax]], bins=[nx, ny])

Why not contribute this to Numpy directly?

As mentioned above, the Numpy functions are much more versatile, so they could not be replaced by the ones here. One option would be to check in Numpy's functions for cases that are simple and dispatch to functions such as the ones here, or add dedicated functions for regular binning. I hope we can get this in Numpy in some form or another eventually, but for now, the aim is to have this available to packages that need to support a range of Numpy versions.

Why not use Cython?

I originally implemented this in Cython, but found that I could get a 50% performance improvement by going straight to a C extension.

What about using Numba?

I specifically want to keep this package as easy as possible to install, and while Numba is a great package, it is not trivial to install outside of Anaconda.

Could this be parallelized?

This may benefit from parallelization under certain circumstances. The easiest solution might be to use OpenMP, but this won't work on all platforms, so it would need to be made optional.

Couldn't you make it faster by using the GPU?

Almost certainly, though the aim here is to have an easily installable and portable package, and introducing GPUs is going to affect both of these.

Why make a package specifically for this? This is a tiny amount of functionality

Packages that need this could simply bundle their own C extension or Cython code to do this, but the main motivation for releasing this as a mini-package is to avoid making pure-Python packages into packages that require compilation just because of the need to compute fast histograms.

Can I contribute?

Yes please! This is not meant to be a finished package, and I welcome pull request to improve things.

Comments
  • Add N-dimensional histogramming

    Add N-dimensional histogramming

    Hi all,

    I happen to have a (scientific) application where I need fast histograms, and I found this package very promising to be exactly what I need! If it would only support histograms in arbitrary dimensions, it would be perfect... and I think that I did in fact manage to pull it off. :)

    This PR adds the function histogramdd to the package.

    It is a straight generalization of the 1D and 2D code, only that the input sample is given as a tuple. The code discovers automatically how many arrays there are in the tuple and generates the output dimensions accordingly. If the dimension is D and the number of samples N, then the tuple must contain D arrays with N elements. Bins and ranges are passed as arrays as well with shapes (D,) and (D, 2), respectively to pass the number of bins as well as the ranges for each dimension.

    The output of histogram2d(x, y, range, bins) should be the same as histogramdd((x, y), range, bins). The speed of the generalized function is a bit slower, so it is still beneficial to use the dedicated 1D and 2D functions where appropriate.

    I'm also working on the weighted version but I thought I'd make a PR to get some feedback. This is still missing tests and documentation, and maybe someone with more C and numpy API experience could have a look if there is any possibility to make things a little bit faster still to match the performance of the 1D and 2D code.

    opened by atrettin 8
  • GIL release

    GIL release

    Nominally in C code that takes some time one can release the GIL with:

    Py_BEGIN_ALLOW_THREADS
    <Non-Python calling code>
    Py_END_ALLOW_THREADS
    

    It looks like it would be pretty trivial to enclose your for loops with these macros, is this something you're interested in?

    opened by robbmcleod 8
  • Make behavior at upper bin edge consistent with between 1D, 2D and DD

    Make behavior at upper bin edge consistent with between 1D, 2D and DD

    This PR contains one critical fix and one fix for consistency [issue #10] .

    Critically, when a data point was exactly at the upper edge of all binning dimensions, then the bin_idx in histogramdd, L636 could become equal to the array size and thus would cause it to write outside of the allocated memory, causing a segfault.

    I added checks whether a data point is at the uppermost border in each binning, and if so, the count will be added to the last bin, consistently with numpy's behavior. This does cause a little drop in performance and I tried various different ways of implementing this (if statements, triple operators, subtraction of ix/nx) and found that adding simple if statements cost the least in the end. The speedup is ~80 % of what it used to be on my machine vs. numpy.

    This is the benchmark result on my machine without the checks: speedup_compared_open_bins

    And this is what I am getting now with the check: speedup_compared_if_statements

    In my mind this would still be an acceptable performance, but we could also choose to break with consistency with numpy here. In that case, I would fix the issue with histogramdd by changing one > to an >= when I check whether the point is in range.

    opened by atrettin 7
  • Weighted histogram differs from numpy unless the data is explicity copied.

    Weighted histogram differs from numpy unless the data is explicity copied.

    First, thanks for the great library. I suspect I am hitting a memory issue. Unless I explictly copy the arrays before passing them to histogram2D, the results differ from numpy. So far, I only saw this happen when I am using the weighted version of the 2Dhistogram. Code and output follows

    #! /usr/bin/env python
    import numpy as np
    import fast_histogram as ft
    
    print(f"Numpy version {np.__version__}")
    print(f"Fast-histogram version {ft.__version__}")
    
    # Initial data
    #######################################################
    nbins=100
    max_val=17.64
    min_val=0.0
    set_range=[[min_val, max_val],[min_val, max_val]]
    with np.load("data.npz") as d:
        pot = d["pot"]
        displ_new = d["displ_new"]
        bond_ind = d["bond_ind"]
    
    
    # Without copy
    print("NOT EXPLICITLY COPYING THE DATA")
    pot_array = pot[:, bond_ind[0,0], bond_ind[0,1]]
    arr1 = displ_new[:,bond_ind[0,0]]
    arr2 = displ_new[:,bond_ind[0,1]]
    
    #Run the histograms
    h_np = np.histogram2d(arr1, arr2, nbins, range=set_range, weights=pot_array)[0]
    h_ft = ft.histogram2d(arr1, arr2, nbins, range=set_range, weights=pot_array)
    
    print("===FAST-HIST===")
    print(h_ft)
    print("===NUMPY===")
    print(h_np)
    
    
    # With copy
    print("EXPLICITLY COPYING THE DATA")
    pot_array = pot[:, bond_ind[0,0], bond_ind[0,1]].copy()
    arr1 = displ_new[:,bond_ind[0,0]].copy()
    arr2 = displ_new[:,bond_ind[0,1]].copy()
    
    #Run the histograms
    h_np = np.histogram2d(arr1, arr2, nbins, range=set_range, weights=pot_array)[0]
    h_ft = ft.histogram2d(arr1, arr2, nbins, range=set_range, weights=pot_array)
    
    print("===FAST-HIST===")
    print(h_ft)
    print("===NUMPY===")
    print(h_np)
    

    with output

    Numpy version 1.18.1
    Fast-histogram version 0.8
    NOT EXPLICITLY COPYING THE DATA
    ===FAST-HIST===
    [[0.         0.         0.         ... 0.         0.         0.        ]
     [0.0425809  0.         0.         ... 0.         0.         0.        ]
     [0.         0.         0.01108671 ... 0.         0.         0.        ]
     ...
     [0.         0.         0.         ... 0.         0.         0.        ]
     [0.         0.         0.         ... 0.         0.         0.        ]
     [0.         0.         0.         ... 0.         0.         0.        ]]
    ===NUMPY===
    [[0.         0.         0.         ... 0.         0.         0.        ]
     [0.00524697 0.11627543 0.00218829 ... 0.         0.         0.        ]
     [0.         0.0233833  0.08906353 ... 0.         0.         0.        ]
     ...
     [0.         0.         0.         ... 0.         0.         0.        ]
     [0.         0.         0.         ... 0.         0.         0.        ]
     [0.         0.         0.         ... 0.         0.         0.        ]]
    EXPLICITLY COPYING THE DATA
    ===FAST-HIST===
    [[0.         0.         0.         ... 0.         0.         0.        ]
     [0.00524697 0.11627543 0.00218829 ... 0.         0.         0.        ]
     [0.         0.0233833  0.08906353 ... 0.         0.         0.        ]
     ...
     [0.         0.         0.         ... 0.         0.         0.        ]
     [0.         0.         0.         ... 0.         0.         0.        ]
     [0.         0.         0.         ... 0.         0.         0.        ]]
    ===NUMPY===
    [[0.         0.         0.         ... 0.         0.         0.        ]
     [0.00524697 0.11627543 0.00218829 ... 0.         0.         0.        ]
     [0.         0.0233833  0.08906353 ... 0.         0.         0.        ]
     ...
     [0.         0.         0.         ... 0.         0.         0.        ]
     [0.         0.         0.         ... 0.         0.         0.        ]
     [0.         0.         0.         ... 0.         0.         0.        ]]
    

    The data file necessary to run the code is attached. Due to github restrictions, I had to zip it first. data.npz.zip

    bug help wanted 
    opened by LucasCampos 6
  • Implement histogram1d with sum of weights squared returned

    Implement histogram1d with sum of weights squared returned

    Hi,

    I'd be interested in returning the sum of weights squared for histograms. I took a stab at implementing the 1D case. If this is something you're not interested in, no hard feelings; if you think the project has a place for this, I'd be happy to implement unit tests and a 2D version.

    Thanks, Doug

    opened by douglasdavis 4
  • histogram result has strange spikes as compared to numpy histogram

    histogram result has strange spikes as compared to numpy histogram

    Data file : https://ufile.io/cnj9l

    Python Code:

    import numpy as np import matplotlib.pyplot as plt from fast_histogram import histogram1d

    data = np.load('x.npz') h_np, _ = np.histogram(data['x'], bins=1100, range=[0, 1100] ) h_fast = histogram1d(data['x'], bins=1100, range=[0, 1100] )

    plt.plot(h_np[:-1], 'r--', label='numpy') plt.plot(h_fast[:-1], label='fast') plt.legend()

    The result is as : image

    Also in histogram2d, the spikes are there.

    opened by sahaskn 3
  • Add float-32 support

    Add float-32 support

    • Adds npy_float32 functions. Casting determination done by a casting dictionary, fast_histogram._cast_to which mimics numpy.histogram.
    • Releases the GIL in pure-C operation, although this does not appear to beneficially impact performance.

    Sorry for the autistic changing of double -> npy_float64. I can revert that if you want.

    opened by robbmcleod 3
  • upstreaming

    upstreaming

    Do you have any intention of upstreaming this? I feel like it would be really useful (together with having a bincount2d which is what I'm usually missing).

    opened by amueller 2
  • setup_requires without install_requires not supported by NumPy

    setup_requires without install_requires not supported by NumPy

    In setup.py the build-time dependency on NumPy is defined via a setup_requires without an install_requires. This is poorly supported, and leads to the following problem on a clean python install

    pip3 install fast-histogram
    python3 -c 'import fast_histogram'
    ImportError: No module named 'numpy.core._multiarray_umath'
    

    While this might be an issue with Numpy (see numpy/numpy#6599), SciPy solved the problem via scipy/scipy#3566.

    opened by mattip 2
  • Allow to pass various types of arguments (like numpy)

    Allow to pass various types of arguments (like numpy)

    Hi, I would like to try your package, but my code keeps throwing errors. First, I do not use range parameter at all (I have no need). That is OK, I can add a few numbers. However, when I try to run it again, I get The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() since I use bins=[bin_x,bin_y].

    I can in general rewrite my code but my sets of bins are quite sophisticated and it would be a bit painful. Is it possible to update your code somehow?

    Sincerely, V.

    opened by vkhodygo 2
  • setup.py requires numpy

    setup.py requires numpy

    The current version of setup.py imports numpy to get its include path. I'm finding this to be a problem when you add fast-histogram as a dependency in your requirements.txt. Even if you also add numpy as a requirement, there is no way to enforce that numpy will have been installed by the time the setup.py in fast-histogram gets run.

    opened by albireox 2
  • Suggestion: memory allocation enhancement

    Suggestion: memory allocation enhancement

    Hi, Thomas

    First of all, thank you for the package, it's really handy. However, I have a particular problem that might require an additional modification of the code. I have to calculate a lot of histograms (starting from the trivial case with one bin only and up to probably 10^6 bins) using the same grid but passing different sets of data. That means constant allocations of memory whereas it is possible to do this once and just keep clearing the array.

    Is it potentially possible to implement this?

    V.

    opened by vkhodygo 0
  • Improve how we generate test cases with hypothesis

    Improve how we generate test cases with hypothesis

    At the moment, the test_1d_compare_with_numpy and test_2d_compare_with_numpy tests are slightly hacky in how they generate test cases. Here's the code for the 1-d case:

    https://github.com/astrofrog/fast-histogram/blob/4dbb898d1ad221587262cfb609ef5ae93bda7715/fast_histogram/tests/test_histogram.py#L17-L39

    What I'm trying to do is generate two arrays x and w which have the same dtype (either 32-bit or 64-bit floats, big or little endian), have the same 1-d size (sampled between 0 and 200), and have values in the range -1000, 1000. So at the moment I generate a 64-bit array, cast it inside the test, then split it into two. It would be cleaner to be able to directly generate the two arrays directly with the correct dtype, but I can't figure out how to do this.

    @Zac-HD - do you have any suggestions how I might be able to achieve this in a cleaner way?

    question 
    opened by astrofrog 1
  • Long doubles make histogram2d fail subtly, maybe add type check?

    Long doubles make histogram2d fail subtly, maybe add type check?

    I'm experimenting with various optimizations in the pulsar search code in HENDRICS (https://github.com/StingraySoftware/HENDRICS/pull/65). One of my bottlenecks was a 2d histogram operation, and I gave a shot to yours. Tested with simulated data, everything worked and histogram2d was not a bottleneck anymore. Yay! However... When I used it on real data, Jupyter notebook was giving "dead kernel" messages with no information whatsoever. After spending one day banging my head, I tried to cast the longdouble arrays to double, and everything worked again.

    It would help to add a type check, raising a ValueError or something similar if the number type is unsupported.

    bug 
    opened by matteobachetti 3
  • messing around with alternatives

    messing around with alternatives

    Hey. I was just playing around with this and was trying to see if there's a way to implement this efficiently with standard libs.

    My usual way to do things like this is using the scipy.sparse.coo_matrix construct.

    import scipy.sparse as sp
    
    def bincount2d(x, y, bins):
    	return sp.coo_matrix((np.ones(x.shape[0]), (x, y)), shape=(bins, bins), dtype=np.int)
    

    If the data was scaled so that making it ints would put it in the right bins, this would work.

    
    import numpy as np
    x = np.random.random(10_000_000)
    y = np.random.random(10_000_000)
    
    from fast_histogram import histogram2d
    %timeit _ = histogram2d(x, y, range=[[0, 1], [0, 1]], bins=30)
    

    36.8 ms ± 4.14 ms per loop

    xx = (x * 30)
    yy = (y * 30)
    
    %timeit bincount2d(xx, yy, bins=30)
    

    153 ms ± 4.04 ms per loop

    So your code would "only" be 5x faster, so it's about a 4x speedup over numpy. Unfortunately I cheated and didn't include shifting ``xx` so that the data aligns with the bins. I don't think it's possible to make this work without copying the data at least once, which is why I'm giving up on this route.

    Thought it might be of interest, though ...

    opened by amueller 0
  • Add OpenMP

    Add OpenMP

    I think it should be possible to rewrite the loops to include OpenMP pragmas. The final update could be protected with #pragma omp atomic and you can use default(shared) private(tx ty ix iy) firstprivate(dataptr) or something similar. You'll need to redo the loops, however, you'll need for loops instead of the inner while (or maybe the outer do while), and the stride part might need to be precalculated in some way.

    help wanted performance 
    opened by henryiii 2
Owner
Thomas Robitaille
Thomas Robitaille
Histogramming for analysis powered by boost-histogram

Hist Hist is an analyst-friendly front-end for boost-histogram, designed for Python 3.7+ (3.6 users get version 2.4). See what's new. Installation You

Scikit-HEP Project 97 Dec 25, 2022
Bcc2telegraf: An integration that sends ebpf-based bcc histogram metrics to telegraf daemon

bcc2telegraf bcc2telegraf is an integration that sends ebpf-based bcc histogram

Peter Bobrov 2 Feb 17, 2022
A Python library created to assist programmers with complex mathematical functions

libmaths was created not only as a learning experience for me, but as a way to make mathematical models in seconds for Python users using mat

Simple 73 Oct 2, 2022
Define fortify and autoplot functions to allow ggplot2 to handle some popular R packages.

ggfortify This package offers fortify and autoplot functions to allow automatic ggplot2 to visualize statistical result of popular R packages. Check o

Sinhrks 504 Dec 23, 2022
3D Vision functions with end-to-end support for deep learning developers, written in Ivy.

Ivy vision focuses predominantly on 3D vision, with functions for camera geometry, image projections, co-ordinate frame transformations, forward warping, inverse warping, optical flow, depth triangulation, voxel grids, point clouds, signed distance functions, and others. Check out the docs for more info!

Ivy 61 Dec 29, 2022
Functions for easily making publication-quality figures with matplotlib.

Data-viz utils ?? Functions for data visualization in matplotlib ?? API Can be installed using pip install dvu and then imported with import dvu. You

Chandan Singh 16 Sep 15, 2022
Function Plotter: a simple application with GUI to plot mathematical functions

Function-Plotter Function Plotter is a simple application with GUI to plot mathe

Mohamed Nabawe 4 Jan 3, 2022
Yata is a fast, simple and easy Data Visulaization tool, running on python dash

Yata is a fast, simple and easy Data Visulaization tool, running on python dash. The main goal of Yata is to provide a easy way for persons with little programming knowledge to visualize their data easily.

Cybercreek 3 Jun 28, 2021
Simple and fast histogramming in Python accelerated with OpenMP.

pygram11 Simple and fast histogramming in Python accelerated with OpenMP with help from pybind11. pygram11 provides functions for very fast histogram

Doug Davis 28 Dec 14, 2022
nptsne is a numpy compatible python binary package that offers a number of APIs for fast tSNE calculation.

nptsne nptsne is a numpy compatible python binary package that offers a number of APIs for fast tSNE calculation and HSNE modelling. For more detail s

Biomedical Visual Analytics Unit LUMC - TU Delft 29 Jul 5, 2022
nptsne is a numpy compatible python binary package that offers a number of APIs for fast tSNE calculation.

nptsne nptsne is a numpy compatible python binary package that offers a number of APIs for fast tSNE calculation and HSNE modelling. For more detail s

Biomedical Visual Analytics Unit LUMC - TU Delft 24 Feb 3, 2021
Fast data visualization and GUI tools for scientific / engineering applications

PyQtGraph A pure-Python graphics library for PyQt5/PyQt6/PySide2/PySide6 Copyright 2020 Luke Campagnola, University of North Carolina at Chapel Hill h

pyqtgraph 3.1k Jan 8, 2023
Fast data visualization and GUI tools for scientific / engineering applications

PyQtGraph A pure-Python graphics library for PyQt5/PyQt6/PySide2/PySide6 Copyright 2020 Luke Campagnola, University of North Carolina at Chapel Hill h

pyqtgraph 2.3k Feb 13, 2021
Fast data visualization and GUI tools for scientific / engineering applications

PyQtGraph A pure-Python graphics library for PyQt5/PyQt6/PySide2/PySide6 Copyright 2020 Luke Campagnola, University of North Carolina at Chapel Hill h

pyqtgraph 2.3k Feb 17, 2021
Fast visualization of radar_scenes based on oleschum/radar_scenes

RadarScenes Tools About This python package provides fast visualization for the RadarScenes dataset. The Open GL based visualizer is smoother than ole

Henrik Söderlund 2 Dec 9, 2021
Simple plotting for Python. Python wrapper for D3xter - render charts in the browser with simple Python syntax.

PyDexter Simple plotting for Python. Python wrapper for D3xter - render charts in the browser with simple Python syntax. Setup $ pip install PyDexter

D3xter 31 Mar 6, 2021
A Python Binder that merge 2 files with any extension by creating a new python file and compiling it to exe which runs both payloads.

Update ! ANONFILE MIGHT NOT WORK ! About A Python Binder that merge 2 files with any extension by creating a new python file and compiling it to exe w

Vesper 15 Oct 12, 2022
Debugging, monitoring and visualization for Python Machine Learning and Data Science

Welcome to TensorWatch TensorWatch is a debugging and visualization tool designed for data science, deep learning and reinforcement learning from Micr

Microsoft 3.3k Dec 27, 2022
Python module for drawing and rendering beautiful atoms and molecules using Blender.

Batoms is a Python package for editing and rendering atoms and molecules objects using blender. A Python interface that allows for automating workflows.

Xing Wang 1 Jul 6, 2022