Learned model to estimate number of distinct values (NDV) of a population using a small sample.

Overview

Learned NDV estimator

Learned model to estimate number of distinct values (NDV) of a population using a small sample. The model approximates the maximum likelihood estimation of NDV, which is difficult to obtain analytically. See our VLDB 2022 paper Learning to be a Statistician: Learned Estimator for Number of Distinct Values for more details.

How to use

  1. Install the package

    pip install estndv

  2. Import and create an instance

   from estndv import ndvEstimator
   estimator = ndvEstimator()
  1. Assume your sample is S=[1,1,1,3,5,5,12] and the population size is N=100000. You can estimate population ndv by:

    ndv = estimator.sample_predict(S=[1,1,1,3,5,5,12], N=100000)

  2. If you have the sample profile e.g. f=[2,1,1], you can estimate population NDV by:

    ndv = estimator.profile_predict(f=[2,1,1], N=100000)

  3. If you have multiple samples/profiles from multiple populations, you can estimate population NDV for all of them in a batch by method estimator.sample_predict_batch() or estimator.profile_predict_batch().

How to train the ndv estimator

You can directly use our package on PyPI for your datasets, as the pre-trained model is agnostic to any workloads. However, if you want to train the model from scratch anyway, do the following:

  1. Go to the model_training folder cd model_training

  2. Install requirements

    pip install requirements.txt

  3. Generate training data. (This uses a lot of memory.)

    python training_data_generation.py

  4. Train model

    python model_training.py

  5. Save trained pytorch model parameters to numpy, this generates a file model_paras.npy

    python torch2npy.py

  6. Test with your model parameters by specifying a path to your model_paras.npy

    estimator = ndvEstimator(para_path=your path to model_paras.npy)

Citation

If you use our work or found it useful, please cite our paper:

@article{wu2022learning,
   author = {Wu, Renzhi and Ding, Bolin and Chu, Xu and Wei, Zhewei and Dai, Xiening and Guan, Tao and Zhou, Jingren},
   title = {Learning to Be a Statistician: Learned Estimator for Number of Distinct Values},
   year = {2021},
   issue_date = {October 2021},
   publisher = {VLDB Endowment},
   volume = {15},
   number = {2},
   issn = {2150-8097},
   url = {https://doi.org/10.14778/3489496.3489508},
   doi = {10.14778/3489496.3489508},
   journal = {Proc. VLDB Endow.},
   month = {oct},
   pages = {272–284},
   numpages = {13}
}
You might also like...
Code for Mesh Convolution Using a Learned Kernel Basis

Mesh Convolution This repository contains the implementation (in PyTorch) of the paper FULLY CONVOLUTIONAL MESH AUTOENCODER USING EFFICIENT SPATIALLY

Codes for realizing theories learned from Data Mining, Machine Learning, Deep Learning without using the present Python packages.

Codes-for-Algorithms Codes for realizing theories learned from Data Mining, Machine Learning, Deep Learning without using the present Python packages.

DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Differentiable Model Compression via Pseudo Quantization Noise DiffQ performs differentiable quantization using pseudo quantization noise. It can auto

Active and Sample-Efficient Model Evaluation
Active and Sample-Efficient Model Evaluation

Active Testing: Sample-Efficient Model Evaluation Hi, good to see you here! 👋 This is code for "Active Testing: Sample-Efficient Model Evaluation". P

A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

Sample Prior Guided Robust Model Learning to Suppress Noisy Labels

PGDF This repo is the official implementation of our paper "Sample Prior Guided Robust Model Learning to Suppress Noisy Labels ". Citation If you use

Automatic labeling, conversion of different data set formats, sample size statistics, model cascade
Automatic labeling, conversion of different data set formats, sample size statistics, model cascade

Simple Gadget Collection for Object Detection Tasks Automatic image annotation Conversion between different annotation formats Obtain statistical info

Python project to take sound as input and output as RGB + Brightness values suitable for DMX

sound-to-light Python project to take sound as input and output as RGB + Brightness values suitable for DMX Current goals: Get one pixel working: Vary

Convert Table data to approximate values with GUI

Table_Editor Convert Table data to approximate values with GUIs... usage - Import methods for extension Tables. Imported method supposed to have only

Owner
null
Using Streamlit to host a multi-page tool with model specs and classification metrics, while also accepting user input values for prediction.

Predicitng_viability Using Streamlit to host a multi-page tool with model specs and classification metrics, while also accepting user input values for

Gopalika Sharma 1 Nov 8, 2021
A parallel framework for population-based multi-agent reinforcement learning.

MALib: A parallel framework for population-based multi-agent reinforcement learning MALib is a parallel framework of population-based learning nested

MARL @ SJTU 348 Jan 8, 2023
Code for the Population-Based Bandits Algorithm, presented at NeurIPS 2020.

Population-Based Bandits (PB2) Code for the Population-Based Bandits (PB2) Algorithm, from the paper Provably Efficient Online Hyperparameter Optimiza

Jack Parker-Holder 22 Nov 16, 2022
Clustering with variational Bayes and population Monte Carlo

pypmc pypmc is a python package focusing on adaptive importance sampling. It can be used for integration and sampling from a user-defined target densi

null 45 Feb 6, 2022
Locally cache assets that are normally streamed in POPULATION: ONE

Population One Localizer This is no longer needed as of the build shipped on 03/03/22, thank you bigbox :) Locally cache assets that are normally stre

Ahman Woods 2 Mar 4, 2022
Learning to Estimate Hidden Motions with Global Motion Aggregation

Learning to Estimate Hidden Motions with Global Motion Aggregation (GMA) This repository contains the source code for our paper: Learning to Estimate

Shihao Jiang (Zac) 221 Dec 18, 2022
A very simple baseline to estimate 2D & 3D SMPL-compatible keypoints from a single color image.

Minimal Body A very simple baseline to estimate 2D & 3D SMPL-compatible keypoints from a single color image. The model file is only 51.2 MB and runs a

Yuxiao Zhou 49 Dec 5, 2022
A deep learning network built with TensorFlow and Keras to classify gender and estimate age.

Convolutional Neural Network (CNN). This repository contains a source code of a deep learning network built with TensorFlow and Keras to classify gend

Pawel Dziemiach 1 Dec 18, 2021
A deep learning network built with TensorFlow and Keras to classify gender and estimate age.

Convolutional Neural Network (CNN). This repository contains a source code of a deep learning network built with TensorFlow and Keras to classify gend

Pawel Dziemiach 1 Dec 19, 2021
Official Pytorch implementation of "Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes", CVPR 2022

Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes / 3DCrowdNet News ?? 3DCrowdNet achieves the state-of-the-art accuracy on 3D

Hongsuk Choi 113 Dec 21, 2022