Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences

Last update: Dec 20, 2022

Related tags

Data Analysis weather finance data-science machine-learning statistics climate dependency-analysis xarray data-generation data-generator copula principal-component-analysis data-augmentation augmentation oversampling synthetic-data fpca data-modelling dependency-modeling functional-data

Overview

Overview

Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences. Copula and functional Principle Component Analysis (fPCA) are statistical models that allow these properties to be simulated (Joe 2014). As such, copula generated data have shown potential to improve the generalization of machine learning (ML) emulators (Meyer et al. 2021) or anonymize real-data datasets (Patki et al. 2016).

Synthia is an open source Python package to model univariate and multivariate data, parameterize data using empirical and parametric methods, and manipulate marginal distributions. It is designed to enable scientists and practitioners to handle labelled multivariate data typical of computational sciences. For example, given some vertical profiles of atmospheric temperature, we can use Synthia to generate new but statistically similar profiles in just three lines of code (Table 1).

Synthia supports three methods of multivariate data generation through: (i) fPCA, (ii) parametric (Gaussian) copula, and (iii) vine copula models for continuous (all), discrete (vine), and categorical (vine) variables. It has a simple and succinct API to natively handle xarray's labelled arrays and datasets. It uses a pure Python implementation for fPCA and Gaussian copula, and relies on the fast and well tested C++ library vinecopulib through pyvinecopulib's bindings for fast and efficient computation of vines. For more information, please see the website at https://dmey.github.io/synthia.

Table 1. Example application of Gaussian and fPCA classes in Synthia. These are used to generate random profiles of atmospheric temperature similar to those included in the source data. The xarray dataset structure is maintained and returned by Synthia.

Source	Synthetic with Gaussian Copula	Synthetic with fPCA
`ds = syn.util.load_dataset()`	`g = syn.CopulaDataGenerator()`	`g = syn.fPCADataGenerator()`
	`g.fit(ds, syn.GaussianCopula())`	`g.fit(ds)`
	`g.generate(n_samples=500)`	`g.generate(n_samples=500)`

Documentation

For installation instructions, getting started guides and tutorials, background information, and API reference summaries, please see the website.

How to cite

If you are using Synthia, please cite the following two papers using their respective Digital Object Identifiers (DOIs). Citations may be generated automatically using Crosscite's DOI Citation Formatter or from the BibTeX entries below.

Synthia Software	Software Application
DOI: 10.21105/joss.02863	DOI: 10.5194/gmd-14-5205-2021

@article{Meyer_and_Nagler_2021,
  doi = {10.21105/joss.02863},
  url = {https://doi.org/10.21105/joss.02863},
  year = {2021},
  publisher = {The Open Journal},
  volume = {6},
  number = {65},
  pages = {2863},
  author = {David Meyer and Thomas Nagler},
  title = {Synthia: multidimensional synthetic data generation in Python},
  journal = {Journal of Open Source Software}
}

@article{Meyer_and_Nagler_and_Hogan_2021,
  doi = {10.5194/gmd-14-5205-2021},
  url = {https://doi.org/10.5194/gmd-14-5205-2021},
  year = {2021},
  publisher = {Copernicus {GmbH}},
  volume = {14},
  number = {8},
  pages = {5205--5215},
  author = {David Meyer and Thomas Nagler and Robin J. Hogan},
  title = {Copula-based synthetic data augmentation for machine-learning emulators},
  journal = {Geoscientific Model Development}
}

If needed, you may also cite the specific software version with its corresponding Zendo DOI.

Contributing

If you are looking to contribute, please read our Contributors' guide for details.

Development notes

If you would like to know more about specific development guidelines, testing and deployment, please refer to our development notes.

Copyright and license

Acknowledgements

Special thanks to @letmaik for his suggestions and contributions to the project.

Comments

Explain how to run the test suite
Describe the bug There is a test suite, but the documentation does not explain how to run it.

Here is what works for me:

Install pytest.

Clone the source repository.

Run pytest in the root directory of the repository.
opened by khinsen 7
Review: Copula distribution usage and examples

Your package offers support for simulating vine copulas. However, I don't see examples demonstrating how to simulate data from a vine copula given desired conditional dependency requirements.

Is this possible with the current API? If not, how would I use the vine copula generator to achieve this?

Otherwise, can examples show the difference between simulating Gaussian and vine copulas? I only see examples for the Gaussian copula.

opened by mnarayan 5
fPCA documentation
Describe the bug

The documentation page on fPCA says:

PCA can be used to generate synthetic data for the high-dimensional vector $X$. For every instance $X_i$ in the data set, we compute the principal component scores $a_{i, 1}, \dots, a_{i, K}$. Because the principal components $v_1, \dots, v_K$ are orthogonal, the scores are necessarily uncorrelated and we may treat them as independent.

The claim that "because the principal components $v_1, \dots, v_K$ are orthogonal, the scores are necessarily uncorrelated" looks wrong to me. These scores are projections of the $X_i$ onto the elements of an orthonormal basis. That doesn't make them uncorrelated. There are lots of orthonormal bases one can project on, and for most of them the projections are not uncorrelated. You need some property of the distribution of $X$ to derive a zero correlation, for example a Gaussian distribution, for which the PCA basis yields approximately uncorrelated projections.
opened by khinsen 3
Review: Clarify API

It would be helpful to add/explain what the different classes do Data Generators, Parametrizer, Transformers somewhere in the introduction or usage component of the documentation. Explain the different classes and what each is supposed to do. If it is similar to or inspired by well-known API of a different package, please point to it.

I think generators and transformers are obvious but I only sort of understand Parametrizers. It is also confusing in the sense that people might think this has something to do with parametric distributions when you mean it to be something different.

Is this API for Parametrizers inspired by some convention elsewhere? If so it would be helpful to point to that. For instance, the generators are very similar to statsmodel generators.

opened by mnarayan 2
Small error in docs

Hi, just letting you know I noticed a small error in the documentation.

At the bottom of this page https://dmey.github.io/synthia/examples/fpca.html

The error is in line [6] of the code, under "Plot the results".

You have: plot_profiles(ds_true, 'temperature_fl')

But I believe it should be: plot_profiles(ds_synth, 'temperature_fl')

you want to plot results, not the original here.

Cheers & thanks for the cool project!

opened by BigTuna08 1
Review: Comparisons to other common packages

What are other packages people might use to simulate data (e.g. statsmodels comes to mind) and how is this package different? Your package supports generating data for multivariate copula distributions and via fPCA. I understand what this entails but I think this could use further elaboration.

This package supports nonparametric distributions much more than the typical parametric data generators found in common packages and it would be useful to highlight these explicitly.

opened by mnarayan 1
Support categorical data for pyvinecopulib

During fitting, category values are reindexed as integers starting from 0 and transformed to one-hot vectors. The opposite during generation. Any data type works for categories, including strings.

opened by letmaik 0

Add support for categorical data

We can treat categorical data as discrete but first we need to pre-process categorical values by one hot encoding to remove the order. Re API we can change the current version from

# Assuming  an xarray datasets ds with X1 discrete and and X2 categorical 
generator.fit(ds, copula=syn.VineCopula(controls=ctrl), is_discrete={'X1': True, 'X2': False})

to something like

with X3 continuous 
g.fit(ds, copula=syn.VineCopula(controls=ctrl), types={'X1': 'disc', 'X2': 'cat', 'X3': 'cont'})

opened by dmey 0

Add support for handling discrete quantities
Introduces the option to specify and model discrete quantities as follows:

# Assuming an xarray datasets ds with X1 discrete and and X2 continuous generator.fit(ds, copula=syn.VineCopula(controls=ctrl), is_discrete={'X1': True, 'X2': False})

This option is only supported for vine copulas
opened by dmey 0

Releases(1.1.0)

1.1.0(Sep 1, 2021)
Pin pyvinecopulib version to avoid issues between versions.

Add CI tests for Python 3.9 (#17).

Minor doc improvements.

Source code(tar.gz)
Source code(zip)
1.0.0(Apr 19, 2021)
1.0.0

Add JOSS summary paper (#26).

Improve docs and tutorials (#14, #13, #18, ...).

Enable CI on multiple OS and Python versions (#16).

Source code(tar.gz)
Source code(zip)
0.3.0(Nov 12, 2020)
Add support for handling categorical quantities (#10, #13).

Source code(tar.gz)
Source code(zip)
0.2.0(Nov 11, 2020)
Add support for handling discrete quantities (#9).

Add support for setting a seed when generating new samples (#11).

Drop support for Python 3.7.

Source code(tar.gz)
Source code(zip)
0.1.1(Oct 24, 2020)
Fix qrng argument for pyvinecopulib due to vinecopulib/pyvinecopulib#68 and vinecopulib/pyvinecopulib#69.

Source code(tar.gz)
Source code(zip)
0.1.0(Oct 22, 2020)
First public release.

Source code(tar.gz)
Source code(zip)

Owner

GitHub https://dmey.github.io/synthia

Physicochemical properties and indices for amino-acid sequences (ported from R).

peptides.py Physicochemical properties and indices for amino-acid sequences. ??️ Overview peptides.py is a pure-Python package to compute common descr

1 Oct 22, 2021

Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

Allen Institute Visual Behavior Analysis package This repository contains code for analyzing behavioral data from the Allen Brain Observatory: Visual

16 Nov 4, 2022

A model checker for verifying properties in epistemic models

Epistemic Model Checker This is a model checker for verifying properties in epistemic models. The goal of the model checker is to check for Pluralisti

2 Dec 22, 2021

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

SNV Pipeline SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

1 Nov 2, 2021

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

pgmpy pgmpy is a python library for working with Probabilistic Graphical Models. Documentation and list of algorithms supported is at our official sit

2.2k Dec 25, 2022

Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

8k Dec 29, 2022

Probabilistic reasoning and statistical analysis in TensorFlow

TensorFlow Probability TensorFlow Probability is a library for probabilistic reasoning and statistical analysis in TensorFlow. As part of the TensorFl

3.8k Jan 5, 2023

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine Intro This repo contains the python/stan version of the Statistical Rethinking

3 Nov 8, 2022

Retail-Sim is python package to easily create synthetic dataset of retaile store.

Retailer's Sale Data Simulation Retail-Sim is python package to easily create synthetic dataset of retaile store. Simulation Model Simulator consists

7 Sep 30, 2022

Describing statistical models in Python using symbolic formulas

Patsy is a Python library for describing statistical models (especially linear models, or models that have a linear component) and building design mat

866 Dec 16, 2022

Statistical package in Python based on Pandas

Pingouin is an open-source statistical package written in Python 3 and based mostly on Pandas and NumPy. Some of its main features are listed below. F

1.2k Dec 31, 2022

PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

PyStan PyStan is a Python interface to Stan, a package for Bayesian inference. Stan® is a state-of-the-art platform for statistical modeling and high-

229 Dec 29, 2022

statDistros is a Python library for dealing with various statistical distributions

StatisticalDistributions statDistros statDistros is a Python library for dealing with various statistical distributions. Now it provides various stati

1 Oct 3, 2021

Creating a statistical model to predict 10 year treasury yields

Predicting 10-Year Treasury Yields Intitially, I wanted to see if the volatility in the stock market, represented by the VIX index (data source), had

10 Oct 27, 2021

Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

1 Dec 27, 2021

follow-analyzer helps GitHub users analyze their following and followers relationship

follow-analyzer follow-analyzer helps GitHub users analyze their following and followers relationship by providing a report in html format which conta

2 May 2, 2022

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

6 Sep 7, 2022

Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions.

About Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions. The tool provides rich data and a summary g

9 Nov 16, 2022

A real data analysis and modeling project - restaurant inspections

A real data analysis and modeling project - restaurant inspections Jafar Pourbemany 9/27/2021 This project represents data analysis and modeling of re

2 Aug 21, 2022

Synthetic data need to preserve the statistical properties of real data in terms of their individual behavior and (inter-)dependences

Related tags

Overview

Overview

Documentation

How to cite

Contributing

Development notes

Copyright and license

Acknowledgements

Comments

you want to plot results, not the original here.

Releases(1.1.0)

1.1.0(Sep 1, 2021)

1.0.0(Apr 19, 2021)

0.3.0(Nov 12, 2020)

0.2.0(Nov 11, 2020)

0.1.1(Oct 24, 2020)

0.1.0(Oct 22, 2020)

Owner

Physicochemical properties and indices for amino-acid sequences (ported from R).

Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

A model checker for verifying properties in epistemic models

SNV calling pipeline developed explicitly to process individual or trio vcf files obtained from Illumina based pipeline (grch37/grch38).

Python Library for learning (Structure and Parameter) and inference (Statistical and Causal) in Bayesian Networks.

Statsmodels: statistical modeling and econometrics in Python

Probabilistic reasoning and statistical analysis in TensorFlow

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

Retail-Sim is python package to easily create synthetic dataset of retaile store.

Describing statistical models in Python using symbolic formulas

Statistical package in Python based on Pandas

PyStan, a Python interface to Stan, a platform for statistical modeling. Documentation: https://pystan.readthedocs.io

statDistros is a Python library for dealing with various statistical distributions

Creating a statistical model to predict 10 year treasury yields

Utilize data analytics skills to solve real-world business problems using Humana’s big data

follow-analyzer helps GitHub users analyze their following and followers relationship

A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Unsub is a collection analysis tool that assists libraries in analyzing their journal subscriptions.

A real data analysis and modeling project - restaurant inspections