Learned model to estimate number of distinct values (NDV) of a population using a small sample.

Last update: Nov 21, 2022

Related tags

Overview

Learned NDV estimator

Learned model to estimate number of distinct values (NDV) of a population using a small sample. The model approximates the maximum likelihood estimation of NDV, which is difficult to obtain analytically. See our VLDB 2022 paper Learning to be a Statistician: Learned Estimator for Number of Distinct Values for more details.

How to use

Install the package

pip install estndv
Import and create an instance

   from estndv import ndvEstimator
   estimator = ndvEstimator()

Assume your sample is S=[1,1,1,3,5,5,12] and the population size is N=100000. You can estimate population ndv by:

ndv = estimator.sample_predict(S=[1,1,1,3,5,5,12], N=100000)
If you have the sample profile e.g. f=[2,1,1], you can estimate population NDV by:

ndv = estimator.profile_predict(f=[2,1,1], N=100000)
If you have multiple samples/profiles from multiple populations, you can estimate population NDV for all of them in a batch by method estimator.sample_predict_batch() or estimator.profile_predict_batch().

How to train the ndv estimator

You can directly use our package on PyPI for your datasets, as the pre-trained model is agnostic to any workloads. However, if you want to train the model from scratch anyway, do the following:

Go to the model_training folder cd model_training
Install requirements

pip install requirements.txt
Generate training data. (This uses a lot of memory.)

python training_data_generation.py
Train model

python model_training.py
Save trained pytorch model parameters to numpy, this generates a file model_paras.npy

python torch2npy.py
Test with your model parameters by specifying a path to your model_paras.npy

estimator = ndvEstimator(para_path=your path to model_paras.npy)

Citation

If you use our work or found it useful, please cite our paper:

@article{wu2022learning,
   author = {Wu, Renzhi and Ding, Bolin and Chu, Xu and Wei, Zhewei and Dai, Xiening and Guan, Tao and Zhou, Jingren},
   title = {Learning to Be a Statistician: Learned Estimator for Number of Distinct Values},
   year = {2021},
   issue_date = {October 2021},
   publisher = {VLDB Endowment},
   volume = {15},
   number = {2},
   issn = {2150-8097},
   url = {https://doi.org/10.14778/3489496.3489508},
   doi = {10.14778/3489496.3489508},
   journal = {Proc. VLDB Endow.},
   month = {oct},
   pages = {272–284},
   numpages = {13}
}

Code for Mesh Convolution Using a Learned Kernel Basis

Mesh Convolution This repository contains the implementation (in PyTorch) of the paper FULLY CONVOLUTIONAL MESH AUTOENCODER USING EFFICIENT SPATIALLY

35 Jan 3, 2023

Codes for realizing theories learned from Data Mining, Machine Learning, Deep Learning without using the present Python packages.

Codes-for-Algorithms Codes for realizing theories learned from Data Mining, Machine Learning, Deep Learning without using the present Python packages.

1 Apr 12, 2022

DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Differentiable Model Compression via Pseudo Quantization Noise DiffQ performs differentiable quantization using pseudo quantization noise. It can auto

145 Dec 30, 2022

Active and Sample-Efficient Model Evaluation

Learned model to estimate number of distinct values (NDV) of a population using a small sample.

Related tags

Overview

Learned NDV estimator

How to use

How to train the ndv estimator

Citation

You might also like...

Code for Mesh Convolution Using a Learned Kernel Basis

Codes for realizing theories learned from Data Mining, Machine Learning, Deep Learning without using the present Python packages.

DiffQ performs differentiable quantization using pseudo quantization noise. It can automatically tune the number of bits used per weight or group of weights, in order to achieve a given trade-off between model size and accuracy.

Active and Sample-Efficient Model Evaluation

A graph neural network (GNN) model to predict protein-protein interactions (PPI) with no sample features

Sample Prior Guided Robust Model Learning to Suppress Noisy Labels

Automatic labeling, conversion of different data set formats, sample size statistics, model cascade

Python project to take sound as input and output as RGB + Brightness values suitable for DMX

Convert Table data to approximate values with GUI

Owner

Using Streamlit to host a multi-page tool with model specs and classification metrics, while also accepting user input values for prediction.

A parallel framework for population-based multi-agent reinforcement learning.

Code for the Population-Based Bandits Algorithm, presented at NeurIPS 2020.

Clustering with variational Bayes and population Monte Carlo

Locally cache assets that are normally streamed in POPULATION: ONE

Learning to Estimate Hidden Motions with Global Motion Aggregation

A very simple baseline to estimate 2D & 3D SMPL-compatible keypoints from a single color image.

A deep learning network built with TensorFlow and Keras to classify gender and estimate age.

A deep learning network built with TensorFlow and Keras to classify gender and estimate age.

Official Pytorch implementation of "Learning to Estimate Robust 3D Human Mesh from In-the-Wild Crowded Scenes", CVPR 2022