A Python package for generating concise, high-quality summaries of a probability distribution

Microsoft

Last update: Oct 10, 2022

Related tags

Deep Learning goodpoints

Overview

GoodPoints

A Python package for generating concise, high-quality summaries of a probability distribution

GoodPoints is a collection of tools for compressing a distribution more effectively than independent sampling:

Given an initial summary of n input points, kernel thinning returns s << n output points with comparable integration error across a reproducing kernel Hilbert space
Compress++ reduces the runtime of generic thinning algorithms with minimal loss in accuracy

Installation

To install the goodpoints package, use the following pip command:

pip install goodpoints

Getting started

The primary kernel thinning function is thin in the kt module:

from goodpoints import kt
coreset = kt.thin(X, m, split_kernel, swap_kernel, delta=0.5, seed=123, store_K=False)
    """Returns kernel thinning coreset of size floor(n/2^m) as row indices into X
    
    Args:
      X: Input sequence of sample points with shape (n, d)
      m: Number of halving rounds
      split_kernel: Kernel function used by KT-SPLIT (typically a square-root kernel, krt);
        split_kernel(y,X) returns array of kernel evaluations between y and each row of X
      swap_kernel: Kernel function used by KT-SWAP (typically the target kernel, k);
        swap_kernel(y,X) returns array of kernel evaluations between y and each row of X
      delta: Run KT-SPLIT with constant failure probabilities delta_i = delta/n
      seed: Random seed to set prior to generation; if None, no seed will be set
      store_K: If False, runs O(nd) space version which does not store kernel
        matrix; if True, stores n x n kernel matrix
    """

For example uses, please refer to the notebook examples/kt/run_kt_experiment.ipynb.

The primary Compress++ function is compresspp in the compress module:

from goodpoints import compress
coreset = compress.compresspp(X, halve, thin, g)
    """Returns Compress++(g) coreset of size sqrt(n) as row indices into X

    Args: 
        X: Input sequence of sample points with shape (n, d)
        halve: Function that takes in an (n', d) numpy array Y and returns 
          floor(n'/2) distinct row indices into Y, identifying a halved coreset
        thin: Function that takes in an (n', d) numpy array Y and returns
          2^g sqrt(n') row indices into Y, identifying a thinned coreset
        g: Oversampling factor
    """

For example uses, please refer to the code examples/compress/construct_compresspp_coresets.py.

Examples

Code in the examples directory uses the goodpoints package to recreate the experiments of the following research papers.

Kernel Thinning

@article{dwivedi2021kernel,
  title={Kernel Thinning},
  author={Raaz Dwivedi and Lester Mackey},
  journal={arXiv preprint arXiv:2105.05842},
  year={2021}
}

The script examples/kt/submit_jobs_run_kt.py reproduces the vignette experiments of Kernel Thinning on a Slurm cluster by executing examples/kt/run_kt_experiment.ipynb with appropriate parameters. For the MCMC examples, it assumes that necessary data was downloaded and pre-processed following the steps listed in examples/kt/preprocess_mcmc_data.ipynb, where in the last code block we report the median heuristic based bandwidth parameteters (along with the code to compute it).
After all results have been generated, the notebook plot_results.ipynb can be used to reproduce the figures of Kernel Thinning.

Generalized Kernel Thinning

@article{dwivedi2021generalized,
  title={Generalized Kernel Thinning},
  author={Raaz Dwivedi and Lester Mackey},
  journal={arXiv preprint arXiv:2110.01593},
  year={2021}
}

The script examples/gkt/submit_gkt_jobs.py reproduces the vignette experiments of Generalized Kernel Thinning on a Slurm cluster by executing examples/gkt/run_generalized_kt_experiment.ipynb with appropriate parameters. For the MCMC examples, it assumes that necessary data was downloaded and pre-processed following the steps listed in examples/kt/preprocess_mcmc_data.ipynb.
Once the coresets are generated, examples/gkt/compute_test_function_errors.ipynb can be used to generate integration errors for different test functions.
After all results have been generated, the notebook examples/gkt/plot_gkt_results.ipynb can be used to reproduce the figures of Generalized Kernel Thinning.

Distribution Compression in Near-linear Time

@article{shetti2021distribution,
  title={Distribution Compression in Near-linear Time},
  author={Abhishek Shetty and Raaz Dwivedi and Lester Mackey},
  journal={arXiv preprint arXiv:2111.07941},
  year={2021}
}

The notebook examples/compress/script_to_deploy_jobs.ipynb reproduces the experiments of Distribution Compression in Near-linear Time in the following manner: 1a. It generates various coresets and computes their mmds by executing examples/compress/construct_{THIN}_coresets.py for THIN in {compresspp, kt, st, herding} with appropriate parameters, where the flag kt stands for kernel thinning, st stands for standard thinning (choosing every t-th point), and herding refers to kernel herding. 1b. It compute the runtimes of different algorithms by executing examples/compress/run_time.py. 1c. For the MCMC examples, it assumes that necessary data was downloaded and pre-processed following the steps listed in examples/kt/preprocess_mcmc_data.ipynb. 1d. The notebook currently deploys these jobs on a slurm cluster, but setting deploy_slurm = False in examples/compress/script_to_deploy_jobs.ipynb will submit the jobs as independent python calls on terminal.
After all results have been generated, the notebook examples/compress/plot_compress_results.ipynb can be used to reproduce the figures of Distribution Compression in Near-linear Time.
The script examples/compress/construct_compresspp_coresets.py contains the function recursive_halving that converts a halving algorithm into a thinning algorithm by recursively halving.
The script examples/compress/construct_herding_coresets.py contains the herding function that runs kernel herding algorithm introduced by Yutian Chen, Max Welling, and Alex Smola.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Comments

Feature request: replace calls to "print" with some form of logging that can be controlled

It would be nice if fprint and _print_elapsed logging calls could be silenced, either by using the logging module or some other means. I'm running a variety of things and all the special tracking updates are adding a lot of text that I'd rather avoid to my logs. Right now I'm silencing with redirect_stdout but it would be nicer still if I could just turn it off.

Thanks!

opened by jacksonloper 2

`compress()` exceeds recursion limit (jupyter exit code 3221225725) when the number of rows in X is not a power of 4

Hi, thanks for the great algorithms!

I think compress() does not stop when the number of rows in X is not a power of 2.

How to reproduce: Run compress() on data with 442 rows:

from sklearn.datasets import load_diabetes
import numpy as np
X, _ = load_diabetes(return_X_y=True)
print(X.shape)
# X = X[0:64, :] #:# this will work fine
# print(X.shape)
from goodpoints import kt, compress
def kernel_gaussian(y, X, gamma=1):
    k_vals = np.sum((X-y)**2, axis=1)
    return(np.exp(-gamma*k_vals/2))
f_halve = lambda x: kt.thin(X=x, m=1, split_kernel=kernel_gaussian, swap_kernel=kernel_gaussian)
id_compressed = compress.compress(X, halve=f_halve, g=0)
print(id_compressed)

I guess that all experiments in the Compress++ paper are on datasets of size 2**n.

What does 3221225725 (0xc00000fd) exit code means?

Edit: Apparently, 128 does not work also:

from sklearn.datasets import load_diabetes
import numpy as np
X, _ = load_diabetes(return_X_y=True)
print(X.shape)
X = X[0:128, :]
print(X.shape)
from goodpoints import kt, compress
def kernel_gaussian(y, X, gamma=1):
    k_vals = np.sum((X-y)**2, axis=1)
    return(np.exp(-gamma*k_vals/2))
f_halve = lambda x: kt.thin(X=x, m=1, split_kernel=kernel_gaussian, swap_kernel=kernel_gaussian)
id_compressed = compress.compress(X, halve=f_halve, g=0)
print(id_compressed)

opened by hbaniecki 1

Implementation of Common Image Evaluation Metrics by Sayed Nadim (sayednadim.github.io). The repo is built based on full reference image quality metrics such as L1, L2, PSNR, SSIM, LPIPS. and feature-level quality metrics such as FID, IS. It can be used for evaluating image denoising, colorization, inpainting, deraining, dehazing etc. where we have access to ground truth.

Image Quality Evaluation Metrics Implementation of some common full reference image quality metrics. The repo is built based on full reference image q

10 Jan 1, 2023

Python KNN model: Predicting a probability of getting a work visa. Tableau: Non-immigrant visas over the years.

The value of international students to the United States. Probability of getting a non-immigrant visa. Project timeline: Jan 2021 - April 2021 Project

2 Nov 21, 2021

Reliable probability face embeddings

ProbFace, arxiv This is a demo code of training and testing [ProbFace] using Tensorflow. ProbFace is a reliable Probabilistic Face Embeddging (PFE) me

34 Dec 31, 2022

Universal Probability Distributions with Optimal Transport and Convex Optimization

Sylvester normalizing flows for variational inference Pytorch implementation of Sylvester normalizing flows, based on our paper: Sylvester normalizing

172 Dec 13, 2022

A foreign language learning aid using a neural network to predict probability of translating foreign words

Langy Langy is a reading-focused foreign language learning aid orientated towards young children. Reading is an activity that every child knows. It is

6 Nov 17, 2021

Buffon’s needle: one of the oldest problems in geometric probability

Buffon-s-Needle Buffon’s needle is one of the oldest problems in geometric proba

3 Feb 18, 2022

Birthday-problem - The birthday problem asks for the probability that, in a set of n randomly chosen people, at least two will share a birthday

Birthday-problem In probability theory, the birthday problem asks for the probab

5 Jan 5, 2023

(CVPR 2022 - oral) Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry Official implementation of the paper Multi-View Depth Est

138 Dec 28, 2022

This is an official implementation of "Polarized Self-Attention: Towards High-quality Pixel-wise Regression"

Polarized Self-Attention: Towards High-quality Pixel-wise Regression This is an official implementation of: Huajun Liu, Fuqiang Liu, Xinyi Fan and Don

212 Jan 8, 2023

A Python package for generating concise, high-quality summaries of a probability distribution

Related tags

Overview

GoodPoints

A Python package for generating concise, high-quality summaries of a probability distribution

Installation

Getting started

Examples

Kernel Thinning

Generalized Kernel Thinning

Distribution Compression in Near-linear Time

Contributing

Trademarks

You might also like...

Python KNN model: Predicting a probability of getting a work visa. Tableau: Non-immigrant visas over the years.

Reliable probability face embeddings

Universal Probability Distributions with Optimal Transport and Convex Optimization

A foreign language learning aid using a neural network to predict probability of translating foreign words

Buffon’s needle: one of the oldest problems in geometric probability

Birthday-problem - The birthday problem asks for the probability that, in a set of n randomly chosen people, at least two will share a birthday

(CVPR 2022 - oral) Multi-View Depth Estimation by Fusing Single-View Depth Probability with Multi-View Geometry

This is an official implementation of "Polarized Self-Attention: Towards High-quality Pixel-wise Regression"

Comments

Feature request: replace calls to "print" with some form of logging that can be controlled

`compress()` exceeds recursion limit (jupyter exit code 3221225725) when the number of rows in X is not a power of 4

Owner

Microsoft

Implementation of Diverse Semantic Image Synthesis via Probability Distribution Modeling

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation

Automatic voice-synthetised summaries of latest research papers on arXiv

A New Approach to Overgenerating and Scoring Abstractive Summaries

Abstractive opinion summarization system (SelSum) and the largest dataset of Amazon product summaries (AmaSum). EMNLP 2021 conference paper.

Annotated notes and summaries of the TensorFlow white paper, along with SVG figures and links to documentation

DRLib：A concise deep reinforcement learning library, integrating HER and PER for almost off policy RL algos.

A clear, concise, simple yet powerful and efficient API for deep learning.

A concise but complete implementation of CLIP with various experimental improvements from recent papers

A concise but complete implementation of CLIP with various experimental improvements from recent papers