Randomisation-based inference in Python based on data resampling and permutation.

Overview

resample

https://readthedocs.org/projects/resample/badge/?version=stable

https://img.shields.io/badge/coverage-96%25-green

Randomisation-based inference in Python based on data resampling and permutation.

Features

  • Bootstrap samples (ordinary or balanced with optional stratification) from N-D arrays
  • Apply parametric bootstrap (Gaussian, Poisson, gamma, etc.) on samples
  • Compute bootstrap confidence intervals (percentile or BCa) for any estimator
  • Jackknife estimates of bias and variance of any estimator
  • Permutation-based variants of traditional statistical tests (t-test, K-S test, etc.)
  • Tools for working with empirical distributions (CDF, quantile, etc.)

Dependencies

Installation requires only numpy and scipy.

Installation

The latest release can be installed from PyPI:

pip install resample

or using conda:

conda install resample -c conda-forge
Comments
  • To-do list for next major release

    To-do list for next major release

    • [x] Estimate statistical uncertainty of p-value from permutation tests
    • [x] Allow users to pass a target precision to the permutation test instead of asking for the number of bootstrap samples
    • [x] Generic permutation test for user-defined test statistics
    • [x] allow sample arguments in resample, bootstrap, etc.
    • [x] Include USP test (see #89)
    • [x] Document changes and write migration guide
    • [x] Fix the iterative algorithm to find the required number of permutation samples for usp and same_population

    As pointed out in the Wikipedia, we should give a confidence interval for the computed pvalue in permutation tests.

    With this one could also implement an automatic mode, where the number of required permutations is determined automatically that is needed to determine with near certainty whether the pvalue is below or above a given threshold α.

    The pvalue confidence interval can be added as another field to the PermutationResult class. In order to use the automatic mode, the user has to provide alpha instead of size when the test is called, so the option for the test should probably be called size_or_alpha.

    opened by HDembinski 16
  • Release 1.0.0

    Release 1.0.0

    @HDembinski Couple edits to setup.py before releasing the new version (I can also make you co-owner on PyPI). Let me know if there are any other loose ends you can think of that need to be taken care of before releasing (it's been two years since the last time I did this and need to refresh my memory).

    opened by dsaxton 13
  • Requiring numba?

    Requiring numba?

    The speed of the functions jackknife and bootstrap could be greatly enhanced with numba. While it is possible to have two implements for each function, one requiring numba and one just requiring numpy, I think it would make sense to require numba. It used to be a pain to install, but nowadays one can just pip install it and of course it is also available from anaconda.

    I know numba well enough to work on this. Another option (but less preferred), would be to write the implementations in C++ and wrap the code to Python with pybind11. I can do that as well, but then the package is not pure Python anymore and has to be compiled for every target platform. This is a huge added burden, and the C++ code may not run faster than numba (I did some tests a while ago, where numba beat my C++ implemenation).

    opened by HDembinski 13
  • new module `empirical`

    new module `empirical`

    I think there are several tools which fit very nicely in a separate module called empirical.

    utils.ecdf -> empirical.cdf utils.eqf -> empirical.quantile jackknife.empirical_influence -> empirical.influence

    @dsaxton Sound good?

    opened by HDembinski 11
  • Switch to two-branch `master` + `develop` model

    Switch to two-branch `master` + `develop` model

    Most professional repos use a two-branch model with master and develop (Boost, Scikit-HEP...), I think we should switch to that.

    develop:

    • default branch, commits and PRs go there
    • force updates allowed but should be kept to an absolute minimum
    • forks/feature branches are based on develop
    • gets rebased to master if master was updated for a release
    • RTD "latest" build points to develop

    master:

    • never force-updated (can and should be turned off in the settings)
    • only updated on making a release
    • on release, merge develop into master, update version number and changelog, make release
    • special CI jobs that check the release are only triggered on updates to master
    • RTD "stable" build points to master and is shown by default

    The two main advantages of this scheme are that

    • one can have special CI jobs that check releases, which are only triggered on updates to master
    • one can have RTD point to the latest release automatically (using the "stable" build)
    opened by HDembinski 10
  • Support for computations from pre-calculated replicates

    Support for computations from pre-calculated replicates

    As mentioned in #34, we need a way to allow computation from pre-calculated replicates.

    My original idea was to reuse the existing interface to do this, but documentation and argument types then become a bit ugly, which was also @dsaxton 's concern.

    We need to resolve this before publishing release 1.0, in case it has implications for our interface overhaul (not necessarily the case, but it could have).

    To address the points raised, I drafted a solution here where the interface of the bias function in the jackknife module is left as is. In addition, there is a bias_from_precalculated (name to be refined), which accepts the pre-calculated replicates. Internally, bias calls bias_from_precalculated, of course.

    @dsaxton Would that be a way to go for all functions? We need to introduce X_from_precalculated then for

    • bias
    • bias_corrected
    • variance
    • confidence_interval

    in both the jackknife and bootstrap modules.

    This would be an acceptable solution for me, but whenever I see a common prefix/suffix, I am thinking of namespaces. I think it would be more organized to put these in a separate module, so that one can do

    from resample.jackknife.precalculated import bias # the version in which you pass `theta` and `resampled_thetas`
    

    or

    from resample.jackknife import bias # the version in which you pass `fn` and `sample`
    

    I think to make this work, we need to make jackknife and bootstrap into sub-packages, which then can have a sub-module precalculated. The directory structure would look like this.

    resample
      __init__.py
      jackknife
         __init__.py
         precalculated.py
      bootstrap
         __init__.py
         precalculated.py
    
    opened by HDembinski 10
  • Contribution to PyHEP 2020?

    Contribution to PyHEP 2020?

    I took the liberty to submit an abstract to PyHEP 2020 virtual workshop on Python data analysis in high energy physics:

    https://indico.cern.ch/event/882824/overview

    I think this is a good place to advertise resample and what can be done with it. The workshop is very popular this year, so I went directly ahead in submitting the abstract without asking, but I can withdraw it if you don't think this is a good idea.

    The presentation will mainly consist of a Jupyter notebook which show-cases the capabilities of resample, there will be some general information, too, like who is working on it, status, plans, etc.

    opened by HDembinski 10
  • Interested in cooperation?

    Interested in cooperation?

    tl;dr: I am a very experienced developer and a bootstrap expert, care to join forces?

    I am the developer of Boost.Histogram and the maintainer of iminuit. I am a core member of the Scikit-HEP community. I am a senior developer in C++ and Python and a statistical expert. I read Efron's book "The introduction to the bootstrap" and a few others on resampling methods and think that resampling has a huge potential for my field (particle physics).

    I think that we need a high-quality resampling package in Python, which ideally should become part of scipy at some point. I am privately using the balanced bootstrap since many years. I have an implementation of the balanced bootstrap in my tool library pyik. I would like to move away from my own implementation, however, because that is not sustainable. I would be happy to support one of the projects that focus entirely on resampling, such as yours. Your package seems to be the most advanced in terms of quality and completeness from what I have seen on PyPI. Are you interested in collaboration?

    Best regards!

    opened by HDembinski 10
  • Add Read the Docs YAML

    Add Read the Docs YAML

    Closes https://github.com/dsaxton/resample/issues/37

    I'm not 100% positive this will work, but it seems it should prevent it from using Python 2 for the build.

    https://docs.readthedocs.io/en/stable/config-file/v2.html

    opened by dsaxton 8
  • add bias, bias_corrected, and variance to bootstrap module

    add bias, bias_corrected, and variance to bootstrap module

    I added a warning to the bias computation with the bootstrap. Wording should be further improved for clarity.

    I discovered that the bootstrap cannot detect biases of the kind (unbiased estimate + constant / N), where N is the number of observations, which the jackknife does detect and remove exactly. @dsaxton You can see it for yourself if you apply the bootstrap bias computation on the test function def bad_mean(x): return (np.sum(x) + 2) / len(x) that I used to test the jackknife. The bootstrap bias considers this function unbiased.

    Somehow the meaning of the two bias computations is different. The jackknife computes the bias of the estimator on the finite sample with respect to the asymptotic limit. The bootstrap bias seems to compute something different.

    opened by HDembinski 8
  • How to calculate p-values for my coefficients?

    How to calculate p-values for my coefficients?

    I am not sure if this question might be better placed on StackOverflow, but I guess it could be also a documentation issue so I guess it could also fit here. Besides, the code below might also help to explain what I meant with issue #112, so I hope, I can post it here on GitHub. I fitted a multinomial logistic regression to my data. Now I would like to get 95%-CI intervals for my beta-coefficients and additional p-values for each of them (and for each class, since with multinomial LR you get coefficients for each classes). How would I do that with your package?

    Here's some example code (as written above, besides from this issue, it might also help to explain what I described in #112):

    import numpy as np
    from sklearn.datasets import load_iris
    
    from sklearn.preprocessing import StandardScaler
    from sklearn.linear_model import LogisticRegression
    from sklearn.pipeline import Pipeline
    
    from resample.bootstrap import bootstrap
    
    # set random number generator
    rng = np.random.RandomState(42)
    
    # load data
    X,y = load_iris(return_X_y=True)
    
    # implement Pipeline
    # center data and fit multinomial logistic regression
    pipe = Pipeline([('scaler',StandardScaler()),
                      ('lr',LogisticRegression())])
    
    
    pipe.fit(X,y)
    
    # prepare data for bootstrapping with resample package
    # NOTE: This is a little tedious because we have to concatenate X an y first
    # to one array only to separate them again a few steps later within the function
    A = np.concatenate((y.reshape(-1,1),X),axis=1)
    
    # NOTE: Currently, the resample package does not offer to provide any further
    # arguments that are passed to the function besides A. So we have to rely on 
    # global scope (which might be a little bit unsafe?)
    def fit_mlr(A):
        
        X = A[:,1:]
        y = A[:,0]
        pipe.fit(X,y)
    
        return {'coef': pipe._final_estimator.coef_,
                'intercept': pipe._final_estimator.coef_}
    
    boot_coef = bootstrap(fit_mlr,sample=A,size=10)
    
    
    
    opened by JohannesWiesner 7
  • Precision targets for bootstrap functions

    Precision targets for bootstrap functions

    I worked out how to iterate the permutation tests until a precision target is reached.

    It would be great to implement such a functionality also for the functions

    • bootstrap.bias
    • bootstrap.bias_corrected
    • bootstrap.variance
    • bootstrap.confidence_interval

    This means adding keywords precision and max_size to the functions and to deprecate size (which would act like max_size=size, precision=0). We cannot do this for bootstrap.bootstrap, because we don't know what the user is computing.

    With the keyword return_error (default False) we can optionally return the calculated uncertainty in a backward compatible way.

    opened by HDembinski 0
  • Add option `threads`?

    Add option `threads`?

    One of the first questions I got after the presentation on resample at PyHEP was about parallelization.

    In principle, resampling methods are perfectly parallelizable, assuming that fn is pure (has no side-effects). That is generally a reasonable assumption. In Python, there are many ways to parallelize, you may want to parallelize on your own cores, or on some cluster of computers, or on the cloud. Therefore, offering direct access to resample is good, because it allows the user to user to chose their parallelization scheme.

    For the simple common cases, however, we may want to offer a threads option to our methods, which compute fn on the replicas using threads number of threads on the current computer, to better utilize common multi-core processors. This would an option for the functions bootstrap and jackknife and those that build on them, e.g. bias and variance etc. @dsaxton What do you think?

    opened by HDembinski 4
  • Implement

    Implement "jackknife-after-bootstrap"

    The jackknife-after-bootstrap method, as described in Efron and Tibshirani's book, is a clever way to compute an uncertainty for a bootstrap estimate, without computing additional replicas. It needs a bit of additional book-keeping, so it does not come for free, but it is a vast improvement over doing a full jackknife after the bootstrap.

    We could add this an keyword option in resample.bootstrap.bootstrap, or have a separate resample.bootstrap.jackknife_after_bootstrap function. I am leaning slightly towards the latter.

    opened by HDembinski 2
  • Need a way to pass precalculated replicas

    Need a way to pass precalculated replicas

    I started to use the library in practice and I found a caveat of our current design. Let's say I want to compute a bias correction and the variance of an estimator. If I naively call resample.jackknife.variance and resample.jackknife.bias_corrected, it computes the jackknife estimates twice (which is expensive). The interface should allow me to reuse precomputed jackknife estimates (I am talking about the jackknife but the same is true for the bootstrap).

    I am not sure yet how to best achieve this. Here is my idea so far.

    Currently, we have in resample.jackknife the signature def variance(fn, sample). It expects two mandatory arguments and I think that should not change. However, we could make it so that if one passes None for fn, then sample is interpreted as the precomputed replicas. This is not ambiguous, because fn is never None under normal circumstances.

    This approach works for all jackknife tools, but resample.bootstrap.confidence_level adds further complications. More precisely, when the "student" and "bca" methods are used, the baseline idea does not work. The "student" method also needs fn(sample) in addition to the replicas, and "bca" also needs fn(sample) and jackknife replicas on top.

    I think the basic idea can still work, if we make the call to confidence_interval like this

    thetas = bootstrap(my_fn, data)
    theta = my_fn(data)
    j_thetas = jackknife(my_fn, data)
    confidence_interval(None, thetas, ci_method="percentile") # ok, works
    confidence_interval(None, (thetas, theta), ci_method="student") # ok, additional information passed as tuple
    confidence_invertal(None, (thetas, theta, j_thetas), ci_method="bca") # ok, same
    

    Any thoughts?

    question 
    opened by HDembinski 4
Releases(v1.5.3)
  • v1.5.3(Dec 8, 2022)

  • v1.5.2(Oct 15, 2022)

    What's Changed

    • Update ci skripts and fix types #154
    • bootstrap.resample now works with method="extended" when input is multi-dimensional #153

    Full Changelog: https://github.com/scikit-hep/resample/compare/v1.5.1...v1.5.2

    Source code(tar.gz)
    Source code(zip)
  • v1.5.0-beta(Jan 31, 2022)

  • v1.0.1(Oct 28, 2020)

    1.0.1 (August 23, 2020)

    • Minor fix to allow building from source.

    1.0.0 (August 22, 2020)

    API Changes

    • Bootstrap and jackknife generators resample.bootstrap.resample and resample.jackknife.resample are now exposed to compute replicates lazily.
    • Jackknife functions have been split into their own namespace resample.jackknife.
    • Empirical distribution helper functions moved to a resample.empirical namespace.
    • Random number seeding is now done through using numpy generators rather than a global random state. As a result the minimum numpy version is now 1.17.
    • Parametric bootstrap now estimates both parameters of the t distribution.
    • Default confidence interval method changed from "percentile" to "bca".
    • Empirical quantile function no longer performs interpolation between quantiles.

    Enhancements

    • Added bootstrap estimate of bias.
    • Added bias_corrected function for jackknife and bootstrap, which computes the bias corrected estimates.
    • Performance of jackknife computation was increased.

    Bug fixes

    • Removed incorrect implementation of Studentized bootstrap.

    Deprecations

    • Smoothing of bootstrap samples is no longer supported.
    • Supremum norm and MISE functionals removed.

    Other

    • Benchmarks were added to track and compare performance of bootstrap and jackknife methods.
    Source code(tar.gz)
    Source code(zip)
Owner
null
Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

pgmpy pgmpy is a python library for working with Probabilistic Graphical Models. Documentation and list of algorithms supported is at our official sit

pgmpy 2.2k Dec 25, 2022
pyhsmm MITpyhsmm - Bayesian inference in HSMMs and HMMs. MIT

Bayesian inference in HSMMs and HMMs This is a Python library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and expli

Matthew Johnson 527 Dec 4, 2022
A probabilistic programming language in TensorFlow. Deep generative models, variational inference.

Edward is a Python library for probabilistic modeling, inference, and criticism. It is a testbed for fast experimentation and research with probabilis

Blei Lab 4.7k Jan 9, 2023
Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.

Amundsen 3.7k Jan 3, 2023
Elementary is an open-source data reliability framework for modern data teams. The first module of the framework is data lineage.

Data lineage made simple, reliable, and automated. Effortlessly track the flow of data, understand dependencies and analyze impact. Features Visualiza

null 898 Jan 9, 2023
🧪 Panel-Chemistry - exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.

???? ??. The purpose of the panel-chemistry project is to make it really easy for you to do DATA ANALYSIS and build powerful DATA AND VIZ APPLICATIONS within the domain of Chemistry using using Python and HoloViz Panel.

Marc Skov Madsen 97 Dec 8, 2022
Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

Created covid data pipeline using PySpark and MySQL that collected data stream from API and do some processing and store it into MYSQL database.

null 2 Nov 20, 2021
Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

FangWei 1 Jan 16, 2022
PrimaryBid - Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift

Transform application Lifecycle Data and Design and ETL pipeline architecture for ingesting data from multiple sources to redshift This project is composed of two parts: Part1 and Part2

Emmanuel Boateng Sifah 1 Jan 19, 2022
Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code

Tuplex is a parallel big data processing framework that runs data science pipelines written in Python at the speed of compiled code. Tuplex has similar Python APIs to Apache Spark or Dask, but rather than invoking the Python interpreter, Tuplex generates optimized LLVM bytecode for the given pipeline and input data set.

Tuplex 791 Jan 4, 2023
Catalogue data - A Python Scripts to prepare catalogue data

catalogue_data Scripts to prepare catalogue data. Setup Clone this repo. Install

BigScience Workshop 3 Mar 3, 2022
fds is a tool for Data Scientists made by DAGsHub to version control data and code at once.

Fast Data Science, AKA fds, is a CLI for Data Scientists to version control data and code at once, by conveniently wrapping git and dvc

DAGsHub 359 Dec 22, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 1, 2022
A data parser for the internal syncing data format used by Fog of World.

A data parser for the internal syncing data format used by Fog of World. The parser is not designed to be a well-coded library with good performance, it is more like a demo for showing the data structure.

Zed(Zijun) Chen 40 Dec 12, 2022
Fancy data functions that will make your life as a data scientist easier.

WhiteBox Utilities Toolkit: Tools to make your life easier Fancy data functions that will make your life as a data scientist easier. Installing To ins

WhiteBox 3 Oct 3, 2022
A Big Data ETL project in PySpark on the historical NYC Taxi Rides data

Processing NYC Taxi Data using PySpark ETL pipeline Description This is an project to extract, transform, and load large amount of data from NYC Taxi

Unnikrishnan 2 Dec 12, 2021
Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

Yongxian (Caroline) Lun 1 Dec 27, 2021
PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j.

PostQF Copyright © 2022 Ralph Seichter PostQF is a user-friendly Postfix queue data filter which operates on data produced by postqueue -j. See the ma

Ralph Seichter 11 Nov 24, 2022