Metrics to evaluate quality and efficacy of synthetic datasets.

Overview

DAI-Lab An Open Source Project from the Data to AI Lab, at MIT

Development Status PyPI Shield Downloads Tests Coverage Status

Metrics for Synthetic Data Generation Projects

Overview

The SDMetrics library provides a set of dataset-agnostic tools for evaluating the quality of a synthetic database by comparing it to the real database that it is modeled after.

It supports multiple data modalities:

  • Single Columns: Compare 1 dimensional numpy arrays representing individual columns.
  • Column Pairs: Compare how columns in a pandas.DataFrame relate to each other, in groups of 2.
  • Single Table: Compare an entire table, represented as a pandas.DataFrame.
  • Multi Table: Compare multi-table and relational datasets represented as a python dict with multiple tables passed as pandas.DataFrames.
  • Time Series: Compare tables representing ordered sequences of events.

It includes a variety of metrics such as:

  • Statistical metrics which use statistical tests to compare the distributions of the real and synthetic distributions.
  • Detection metrics which use machine learning to try to distinguish between real and synthetic data.
  • Efficacy metrics which compare the performance of machine learning models when run on the synthetic and real data.
  • Bayesian Network and Gaussian Mixture metrics which learn the distribution of the real data and evaluate the likelihood of the synthetic data belonging to the learned distribution.
  • Privacy metrics which evaluate whether the synthetic data is leaking information about the real data.

Install

SDMetrics is part of the SDV project and is automatically installed alongside it. For details about this process please visit the SDV Installation Guide

Optionally, SDMetrics can also be installed as a standalone library using the following commands:

Using pip:

pip install sdmetrics

Using conda:

conda install -c sdv-dev -c conda-forge -c pytorch sdmetrics

For more installation options please visit the SDMetrics installation Guide

Usage

SDMetrics is included as part of the framework offered by SDV to evaluate the quality of your synthetic dataset. For more details about how to use it please visit the corresponding User Guide:

Standalone usage

SDMetrics can also be used as a standalone library to run metrics individually.

In this short example we show how to use it to evaluate a toy multi-table dataset and its synthetic replica by running all the compatible multi-table metrics on it:

import sdmetrics

# Load the demo data, which includes:
# - A dict containing the real tables as pandas.DataFrames.
# - A dict containing the synthetic clones of the real data.
# - A dict containing metadata about the tables.
real_data, synthetic_data, metadata = sdmetrics.load_demo()

# Obtain the list of multi table metrics, which is returned as a dict
# containing the metric names and the corresponding metric classes.
metrics = sdmetrics.multi_table.MultiTableMetric.get_subclasses()

# Run all the compatible metrics and get a report
sdmetrics.compute_metrics(metrics, real_data, synthetic_data, metadata=metadata)

The output will be a table with all the details about the executed metrics and their score:

metric name score min_value max_value goal
CSTest Chi-Squared 0.76651 0 1 MAXIMIZE
KSTest Inverted Kolmogorov-Smirnov D statistic 0.75 0 1 MAXIMIZE
KSTestExtended Inverted Kolmogorov-Smirnov D statistic 0.777778 0 1 MAXIMIZE
LogisticDetection LogisticRegression Detection 0.882716 0 1 MAXIMIZE
SVCDetection SVC Detection 0.833333 0 1 MAXIMIZE
BNLikelihood BayesianNetwork Likelihood nan 0 1 MAXIMIZE
BNLogLikelihood BayesianNetwork Log Likelihood nan -inf 0 MAXIMIZE
LogisticParentChildDetection LogisticRegression Detection 0.619444 0 1 MAXIMIZE
SVCParentChildDetection SVC Detection 0.916667 0 1 MAXIMIZE

What's next?

If you want to read more about each individual metric, please visit the following folders:

The Synthetic Data Vault

This repository is part of The Synthetic Data Vault Project

Comments
  • Gh stronger detection classifiers

    Gh stronger detection classifiers

    Add Random Forest and Gradient Boosting from sklearn to the single table detection tests. Being able to fool these classifiers would be a great improvement for generative models.

    opened by TanguyUrvoy 7
  • README.md example has a bug

    README.md example has a bug

    Environment Details

    Please indicate the following details about the environment in which you found the bug:

    • SDMetrics version: 0.7.0
    • Python version: 3.8
    • Operating System: Linux VM
    • pandas version: 1.2.4

    Error Description

    Following the README.md example of calculating the BoundaryAdherence:

    Running the code gives the following error: AttributeError: 'Series' object has no attribute 'columns'

    I would expect the code to generate the BoundaryAdherence for the start_date column of the real_data pandas dataframe and synthetic_Data pandas dataframe

    Steps to reproduce

    Follow this snippet from the README.md that shows the usage of BoundaryAdherence

    # calculate whether the synthetic data respects the min/max bounds
    # set by the real data
    from sdmetrics.single_table import BoundaryAdherence
    
    BoundaryAdherence.compute(
        real_data['start_date'],
        synthetic_data['start_date']
    )
    

    Solution

    I will open a PR for this later today or tomorrow.

    type(real_data['start_date']) 
    

    Returns a pandas.core.series.Series

    type(real_data[['start_date']])
    

    Returns a pandas.core.frame.DataFrame

    BoundaryAdherence.compute() expects DataFrames for both the real_data and synthetic_data arguments.

    # calculate whether the synthetic data respects the min/max bounds
    # set by the real data
    from sdmetrics.single_table import BoundaryAdherence
    
    BoundaryAdherence.compute(
        real_data[['start_date']],
        synthetic_data[['start_date']]
    )
    
    documentation resolution:resolved 
    opened by Pverheijen 5
  • KeyError: 'fields' , --> reports.generate 'fields'

    KeyError: 'fields' , --> reports.generate 'fields'

    Environment Details

    Please indicate the following details about the environment in which you found the bug:

    • SDMetrics version:'0.7.0'
    • Python version: 3.8.8
    • Operating System: Windows server 2016 dataserver

    Error Description

    Steps to reproduce

    I tried to use sdmetrics to compare the Synthetic data. Example of the data set

    Customer | State | Customer Lifetime Value | Response | Coverage | Education | Effective To Date | EmploymentStatus | Gender | Income | ... | Months Since Policy Inception | Number of Open Complaints | Number of Policies | Policy Type | Policy | Renew Offer Type | Sales Channel | Total Claim Amount | Vehicle Class | Vehicle Size -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- BU79786 | Washington | 2763.519279 | No | Basic | Bachelor | 2/24/11 | Employed | F | 56274 | ... | 5 | 0 | 1 | Corporate Auto | Corporate L3 | Offer1 | Agent | 384.811147 | Two-Door Car | Medsize QZ44356 | Arizona | 6979.535903 | No | Extended | Bachelor | 1/31/11 | Unemployed | F | 0 | ... | 42 | 0 | 8 | Personal Auto | Personal L3 | Offer3 | Agent | 1131.464935 | Four-Door Car | Medsiz

    loaded the Gaussian copula pkl model to generate the synthetic data and created a metadata on the original dataset to use it in quality report. I have pasted the Metadata created for the data above: {'fields': {'Customer': {'type': 'id', 'subtype': 'string'}, 'State': {'type': 'categorical'}, 'Customer Lifetime Value': {'type': 'numerical', 'subtype': 'float'}, 'Response': {'type': 'categorical'}, 'Coverage': {'type': 'categorical'}, 'Education': {'type': 'categorical'}, 'Effective To Date': {'type': 'categorical'}, 'EmploymentStatus': {'type': 'categorical'}, 'Gender': {'type': 'categorical'}, 'Income': {'type': 'numerical', 'subtype': 'integer'}, 'Location Code': {'type': 'categorical'}, 'Marital Status': {'type': 'categorical'}, 'Monthly Premium Auto': {'type': 'numerical', 'subtype': 'integer'}, 'Months Since Last Claim': {'type': 'numerical', 'subtype': 'integer'}, 'Months Since Policy Inception': {'type': 'numerical', 'subtype': 'integer'}, 'Number of Open Complaints': {'type': 'numerical', 'subtype': 'integer'}, 'Number of Policies': {'type': 'numerical', 'subtype': 'integer'}, 'Policy Type': {'type': 'categorical'}, 'Policy': {'type': 'categorical'}, 'Renew Offer Type': {'type': 'categorical'}, 'Sales Channel': {'type': 'categorical'}, 'Total Claim Amount': {'type': 'numerical', 'subtype': 'float'}, 'Vehicle Class': {'type': 'categorical'}, 'Vehicle Size': {'type': 'categorical'}}, 'primary_key': 'Customer'}.-->

    #creating metadata
    metadata.add_table(
        name='INS',
        data=data2,
        primary_key='Customer'
         )
    
    
    report = QualityReport()
    report.generate(data2, synthetic_data,metadata)
    
    
    error:
    Creating report:   0%|          | 0/4 [00:00<?, ?it/s]
    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    Input In [54], in <cell line: 2>()
          1 report = QualityReport()
    ----> 2 report.generate(data2, synthetic_data,metadata)
    
    File ~\.conda\envs\SDVENV\lib\site-packages\sdmetrics\reports\single_table\quality_report.py:72, in QualityReport.generate(self, real_data, synthetic_data, metadata)
         70 for metric in tqdm.tqdm(metrics, desc='Creating report'):
         71     try:
    ---> 72         self._metric_results[metric.__name__] = metric.compute_breakdown(
         73             real_data, synthetic_data, metadata)
         74     except IncomputableMetricError:
         75         # Metric is not compatible with this dataset.
         76         self._metric_results[metric.__name__] = {}
    
    File ~\.conda\envs\SDVENV\lib\site-packages\sdmetrics\single_table\multi_single_column.py:147, in MultiSingleColumnMetric.compute_breakdown(cls, real_data, synthetic_data, metadata, **kwargs)
        123 @classmethod
        124 def compute_breakdown(cls, real_data, synthetic_data, metadata=None, **kwargs):
        125     """Compute this metric broken down by column.
        126 
        127     This is done by computing the underlying SingleColumn metric to all the
       (...)
        145             A mapping of column name to metric output.
        146     """
    --> 147     return cls._compute(
        148         cls, real_data, synthetic_data, metadata, store_errors=True, **kwargs)
    
    File ~\.conda\envs\SDVENV\lib\site-packages\sdmetrics\single_table\multi_single_column.py:68, in MultiSingleColumnMetric._compute(self, real_data, synthetic_data, metadata, store_errors, **kwargs)
         43 def _compute(self, real_data, synthetic_data, metadata=None, store_errors=False, **kwargs):
         44     """Compute this metric for all columns.
         45 
         46     This is done by computing the underlying SingleColumn metric to all the
       (...)
         66             A mapping of column name to metric output.
         67     """
    ---> 68     real_data, synthetic_data, metadata = self._validate_inputs(
         69         real_data, synthetic_data, metadata)
         71     fields = self._select_fields(metadata, self.field_types)
         72     invalid_cols = set(metadata['fields'].keys()) - set(fields)
    
    File ~\.conda\envs\SDVENV\lib\site-packages\sdmetrics\single_table\base.py:119, in SingleTableMetric._validate_inputs(cls, real_data, synthetic_data, metadata)
        116 if not isinstance(metadata, dict):
        117     metadata = metadata.to_dict()
    --> 119 fields = metadata['fields']
        120 for column in real_data.columns:
        121     if column not in fields:
    
    KeyError: 'fields'
    
    
    bug resolution:WAI 
    opened by ketandaryanani 3
  • reports.generate 'fields' index issue

    reports.generate 'fields' index issue

    Environment Details

    Please indicate the following details about the environment in which you found the bug:

    • SDMetrics version: '0.7.0'
    • Python version: 3.9.13
    • Operating System: Windows under WSL2--Unbuntu

    Error Description

    Attepmting to generate report from the following command report = QualityReport() report.generate(x_y_df, synth_data, meta_data)

    Steps to reproduce

    x_y_df.to_json(reports_path + '/' + 'xy_df.json')
    
        with open(reports_path + '/' + 'xy_df.json') as f:
            meta_data = json.load(f)
    
        
        # Initialize report
        report = QualityReport()
        report.generate(x_y_df, synth_data, meta_data)
    

    Traceback:

    File "/home/bdeck8317/miniconda3/lib/python3.9/site-packages/sdmetrics/reports/single_table/quality_report.py", line 72, in generate
        self._metric_results[metric.__name__] = metric.compute_breakdown(
      File "/home/bdeck8317/miniconda3/lib/python3.9/site-packages/sdmetrics/single_table/multi_single_column.py", line 147, in compute_breakdown
        return cls._compute(
      File "/home/bdeck8317/miniconda3/lib/python3.9/site-packages/sdmetrics/single_table/multi_single_column.py", line 68, in _compute
        real_data, synthetic_data, metadata = self._validate_inputs(
      File "/home/bdeck8317/miniconda3/lib/python3.9/site-packages/sdmetrics/single_table/base.py", line 119, in _validate_inputs
        fields = metadata['fields']
    KeyError: 'fields'
    
    

    Thanks, Ben

    bug resolution:WAI 
    opened by bdeck8317 3
  • NewRowSynthesis: ValueError: multi-line expressions are only valid in the context of data, use DataFrame.eval

    NewRowSynthesis: ValueError: multi-line expressions are only valid in the context of data, use DataFrame.eval

    Environment Details

    • SDV version: sdv==0.17.1
    • Python version: Python 3.9.13
    • Operating System: Linux

    Error Description

    pandas==1.4.3

    ValueError when running NewRowSynthesis

    Steps to reproduce

    from sdmetrics.single_table import NewRowSynthesis
    
    metadata_obj, real_data = load_tabular_demo("student_placements_pii", metadata=True)
    
    model = GaussianCopula(
        primary_key="student_id"
    )
    model.fit(real_data)
    synthetic_data = model.sample(250)
    
    new_row_synthesis_score = NewRowSynthesis.compute(
        real_data=real_data, synthetic_data=synthetic_data, metadata=metadata_obj.to_dict()
    )
    
    ValueError                                Traceback (most recent call last)
    Cell In [43], line 11
          8 model.fit(real_data)
          9 synthetic_data = model.sample(250)
    ---> 11 new_row_synthesis_score = NewRowSynthesis.compute(
         12     real_data=real_data, synthetic_data=synthetic_data, metadata=metadata_obj.to_dict()
         13 )
    
    File ~/miniconda3/envs/mloperator/lib/python3.9/site-packages/sdmetrics/single_table/new_row_synthesis.py:104, in NewRowSynthesis.compute(cls, real_data, synthetic_data, metadata, numerical_match_tolerance, synthetic_sample_size)
        101     row_filter.append(field_filter)
        103 try:
    --> 104     matches = real_data.query(' and '.join(row_filter))
        105 except TypeError:
        106     if len(real_data) > 10000:
    
    File ~/miniconda3/envs/mloperator/lib/python3.9/site-packages/pandas/core/frame.py:4111, in DataFrame.query(self, expr, inplace, **kwargs)
       4109 kwargs["level"] = kwargs.pop("level", 0) + 1
       4110 kwargs["target"] = None
    -> 4111 res = self.eval(expr, **kwargs)
       4113 try:
       4114     result = self.loc[res]
    
    File ~/miniconda3/envs/mloperator/lib/python3.9/site-packages/pandas/core/frame.py:4240, in DataFrame.eval(self, expr, inplace, **kwargs)
       4237     kwargs["target"] = self
    ...
        328     )
        329 engine = _check_engine(engine)
        330 _check_parser(parser)
    
    ValueError: multi-line expressions are only valid in the context of data, use DataFrame.eval
    
    
    bug feature:metrics resolution:resolved 
    opened by darenr 2
  • BoundaryAdherence: ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

    BoundaryAdherence: ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

    Environment Details

    • SDV version: sdv==0.17.1
    • Python version: Python 3.9.13
    • Operating System: Linux

    Error Description

    pandas==1.4.3

    ValueError when running BoundaryAdherence

    Steps to reproduce

    from sdmetrics.single_column import BoundaryAdherence
    
    metadata_obj, real_data = load_tabular_demo("student_placements_pii", metadata=True)
    
    model = GaussianCopula(
        primary_key="student_id"
    )
    model.fit(real_data)
    synthetic_data = model.sample(250)
    
    BoundaryAdherence.compute(
        real_data=real_data, synthetic_data=synthetic_data
    )
    
    ValueError                                Traceback (most recent call last)
    Cell In [42], line 11
          8 model.fit(real_data)
          9 synthetic_data = model.sample(250)
    ---> 11 BoundaryAdherence.compute(
         12     real_data=real_data, synthetic_data=synthetic_data
         13 )
    
    File ~/miniconda3/envs/mloperator/lib/python3.9/site-packages/sdmetrics/single_column/statistical/boundary_adherence.py:46, in BoundaryAdherence.compute(cls, real_data, synthetic_data)
         32 @classmethod
         33 def compute(cls, real_data, synthetic_data):
         34     """Compute the boundary adherence of two continuous columns.
         35 
         36     Args:
       (...)
         44             The boundary adherence of the two columns.
         45     """
    ---> 46     real_data = pd.Series(real_data).dropna()
         47     synthetic_data = pd.Series(synthetic_data).dropna()
         49     if is_datetime(real_data):
    
    File ~/miniconda3/envs/mloperator/lib/python3.9/site-packages/pandas/core/series.py:367, in __init__(self, data, index, dtype, name, copy, fastpath)
        364         self.name = name
        365     return
    ...
       1528         f"The truth value of a {type(self).__name__} is ambiguous. "
       1529         "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
       1530     )
    
    ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
    
    
    bug resolution:WAI feature:metrics 
    opened by darenr 2
  • Update README.md to fix a bug

    Update README.md to fix a bug

    BoundaryAdherence expects DataFrames for the real and synthetic dataframe arguments, currently it is receiving series due to accessing date_range with single square brackets instead of double square brackets.

    opened by Pverheijen 2
  • Accademic Paper to cite?

    Accademic Paper to cite?

    Is there any paper to cite related to this library? I am evaluating some synthetic data for a publication so I wanted to include your contribution among the citations.

    Thanks Andrea Galloni

    question resolution:resolved 
    opened by andreagalloni92 2
  • SDMetrics 0.4.2 has incompatible copula version with SDV

    SDMetrics 0.4.2 has incompatible copula version with SDV

    Environment Details

    Please indicate the following details about the environment in which you found the bug:

    • SDMetrics version: 0.4.2
    • SDV version: 0.14.1
    • Python version: 3.8.10
    • Operating System: ubuntu server lts 20.4.04

    Error Description

    The latest SDMetrics version (which is installed by default when installing SDV) has incompatible copula requirements with downstream SDV.

    Steps to reproduce

    On a fresh virtual environment, install pip-tools.

    Place the following on a file named requirements.in

    sdv
    #sdmetrics==0.4.1
    

    Type the following commands

    pip install -r requirements.in
    pip-compile requirements.in
    

    pip-compile reports:

    Could not find a version that matches copulas<0.7,<0.8,>=0.6.1,>=0.7.0 (from sdv==0.14.1->-r requirements.txt (line 1))
    Tried: 0.0.0, 0.0.0, 0.1.0, 0.1.0, 0.1.1, 0.1.1, 0.2.0, 0.2.0, 0.2.1, 0.2.1, 0.2.3, 0.2.3, 0.2.4, 0.2.4, 0.2.5, 0.2.5, 0.3.0, 0.3.0, 0.3.2, 0.3.2, 0.3.3, 0.3.3, 0.4.0, 0.4.0, 0.5.0, 0.5.0, 0.5.1, 0.5.1, 0.6.0, 0.6.0, 0.6.1, 0.6.1, 0.7.0, 0.7.0
    Skipped pre-versions: 0.3.0.dev0, 0.3.0.dev0, 0.3.2.dev1, 0.3.2.dev1, 0.3.3.dev0, 0.3.3.dev0, 0.4.0.dev0, 0.4.0.dev0, 0.5.0.dev0, 0.5.0.dev0, 0.5.0.dev1, 0.5.0.dev1, 0.5.1.dev0, 0.5.1.dev0, 0.5.1.dev1, 0.5.1.dev1, 0.5.2.dev0, 0.5.2.dev0, 0.5.2.dev1, 0.5.2.dev1, 0.6.0.dev0, 0.6.0.dev0, 0.6.1.dev0, 0.6.1.dev0, 0.7.0.dev0, 0.7.0.dev0
    There are incompatible versions in the resolved dependencies:
      copulas<0.8,>=0.7.0 (from sdmetrics==0.4.2->sdv==0.14.1->-r requirements.txt (line 1))
      copulas<0.7,>=0.6.1 (from sdv==0.14.1->-r requirements.txt (line 1))
    

    From the setup.py of both projects, we can verify the above requirements.

    pip install works correctly, but we get the following (snippet):

    Collecting llvmlite<0.39,>=0.38.0rc1
      Using cached llvmlite-0.38.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
    Collecting charset-normalizer~=2.0.0; python_version >= "3"
      Using cached charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
    Collecting idna<4,>=2.5; python_version >= "3"
      Using cached idna-3.3-py3-none-any.whl (61 kB)
    Collecting certifi>=2017.4.17
      Using cached certifi-2021.10.8-py2.py3-none-any.whl (149 kB)
    Collecting urllib3<1.27,>=1.21.1
      Using cached urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
    ERROR: numba 0.55.1 has requirement numpy<1.22,>=1.18, but you'll have numpy 1.22.3 which is incompatible.
    ERROR: rdt 0.6.4 has requirement scipy<1.8,>=1.5.4, but you'll have scipy 1.8.0 which is incompatible.
    ERROR: sdmetrics 0.4.2 has requirement copulas<0.8,>=0.7.0, but you'll have copulas 0.6.1 which is incompatible.
    Installing collected packages: tqdm, typing-extensions, torch, numpy, six, python-dateutil, pytz, pandas, deepecho, scipy, threadpoolctl, joblib, scikit-learn, llvmlite, numba, pyts, pyyaml, psutil, rdt, fonttools, cycler, pyparsing, packaging, pillow, kiwisolver, matplotlib, copulas, sdmetrics, charset-normalizer, idna, certifi, urllib3, requests, torchvision, ctgan, graphviz, text-unidecode, Faker, sdv
    

    When uncommenting sdmetrics from requirements.in, both commands run "correctly".

    Furthermore, when pip-compile and pip have cached sdmetrics==0.4.1, they both select that version instead and no error is shown.

    The following file never compiles:

    sdv==0.14.1
    sdmetrics==0.4.2
    

    I don't know what the appropriate solution to something like this would be. I'm not a library developer.

    bug 
    opened by antheas 2
  • README doesn't accurately describe the output of `compute_metrics`

    README doesn't accurately describe the output of `compute_metrics`

    The current README doesn't print the latest output. More specifically, the command sdmetrics.compute_metrics(metrics, real_data, synthetic_data, metadata=metadata) currently doesn't print the same as what the README prints (e.g. the current code produces a column named error containing None values which the README doesn't have, as well as other changes).

    documentation resolution:obsolete 
    opened by fealho 2
  • Relational `KSTest` crashes with `IncomputableMetricError` if a table doesn't have numerical columns

    Relational `KSTest` crashes with `IncomputableMetricError` if a table doesn't have numerical columns

    Environment Details

    • SDMetrics version: 0.4.1
    • Python version: 3.7

    Error Description

    The relational KSTest is supposed to run the KSTest on all numerical columns in all tables and return the average score.

    However, this test crashes if it encounters a table that has no numerical columns. I expect this test to succeed as long as there is at least 1 numerical column in any of the tables.

    Steps to reproduce

    Use the relational demo dataset and pass it in with the metadata.

    from sdv.metrics.demos import load_multi_table_demo
    from sdv.metrics.relational import KSTest
    
    real_data, synthetic_data, metadata = load_multi_table_demo()
    KSTest.compute(real_data, synthetic_data, metadata)
    

    Output:

    /usr/local/lib/python3.7/dist-packages/sdmetrics/single_table/base.py in _select_fields(cls, metadata, types)
         78 
         79         if len(fields) == 0:
    ---> 80             raise IncomputableMetricError(f'Cannot find fields of types {types}')
         81 
         82         return fields
    
    IncomputableMetricError: Cannot find fields of types ('numerical',)
    

    I believe this is happening because table sessions has no numerical columns. Interestingly, it does work if I exclude the metadata object -- because then it starts assuming that the id field is a numerical column.

    KSTest.compute(real_data, synthetic_data)
    
    0.8555555555555556
    
    bug 
    opened by npatki 2
  • Detection metrics should only use statistically modeled columns (filter out the rest)

    Detection metrics should only use statistically modeled columns (filter out the rest)

    Problem Description

    The Detection metrics use machine learning to determine whether the real vs. synthetic data can be detected. For this to work, we should only be using columns that are statistically modeled.

    Expected behavior

    When running any of the detection metrics, the following columns should be ignored:

    • Primary keys
    • Foreign keys
    • Any other kinds of IDs
    • PII or sensitive data
    • Text data (or data created by RegEx)

    None of these columns provide any useful information for detection.

    The remaining data types are statistically modeled and should be included: numerical, datetime, categorical (non-PII), boolean

    Additional context

    We already filtered out primary keys in #119. The issue of foreign keys is discussed in #285.

    feature request 
    opened by npatki 0
  • Does removing foreign keys in detection metrics for multi-tables make sense?

    Does removing foreign keys in detection metrics for multi-tables make sense?

    Environment details

    If you are already running SDMetrics, please indicate the following details about the environment in which you are running it:

    • SDMetrics version: 0.8.2
    • Python version:
    • Operating System:

    Problem description

    Nice correction for DetectionMetric (Solving primary_key use for detection metrics #251, https://github.com/sdv-dev/SDMetrics/pull/251) Removing the primary key from the table makes more sense for the evaluation. But, what about foreign keys present in a table of a relational database. Does not the same problem also occur for the foreign keys too and they need to be deleted? Also, for the parent-child logistic detection metric, the foreign keys (referencing parent rows) in the child tables are no longer required when the denormalized tables are used.

    question under discussion 
    opened by mohamedgy 1
  • Visualize cardinality of foreign key columns

    Visualize cardinality of foreign key columns

    I'm filing this issue on behalf of a user request on our Slack.

    Problem Description

    Currently, the single column visualization function (utils.get_column_plot) only supports columns that are numerical, categorical, boolean or datetime. It would be nice to support foreign keys as well.

    Expected behavior

    If I provide a foreign key name into the utils.get_column_plot function, then I expect to see a plot of real vs. synthetic data:

    1. Compute the cardinality (# of children with the same parent) for the synthetic and real data. This will form 2 distributions for real & synthetic data
    2. Plot those distributions similar to how we plot numerical data. The x-axis label should read "Cardinality".

    This can be done from get_column_plot.

    from sdmetrics.reports import utils
    
    fig = utils.get_column_plot(
        real_data=real_table,
        synthetic_data=synthetic_table,
        column_name='foreign_key_column',
        metadata=my_table_metadata_dict
    )
    
    fig.show()
    
    feature request 
    opened by npatki 1
  • Quality Report crashes when numerical column has only `NaN` values

    Quality Report crashes when numerical column has only `NaN` values

    Environment Details

    • SDMetrics version: 0.8.0
    • Python version: 3.7
    • Operating System: Linux

    Error Description

    A numerical column in the real data may contain missing values. Sometimes, the synthetic data may only produce these missing values and fail to create any numerical values. In such cases, the software crashes when I try to produce a quality report.

    Expected Behavior: Certain metrics may not be computable if there are only NaN values. But instead of crashing the report, the error should be noted in the detailed breakdowns, and the report should still produce a score while ignoring the values (along with details, visualizations, etc.)

    Steps to reproduce

    import pandas as pd
    from sdmetrics.reports.single_table import QualityReport
    
    real_data = pd.DataFrame(data={
        'col1': [1, 2, 1, 3, 4],
        'col2': [2, 4, 1, 7, 1]
    })
    
    # the 'col2' only contain NaN values
    synthetic_data = pd.DataFrame(data={
        'col1': [1, 3, 2, 2, 1],
        'col2': [np.nan]*5
    })
    
    metadata = {
        'fields': {
            'col1': { 'type': 'numerical', 'subtype': 'integer' },
            'col2': { 'type': 'numerical', 'subtype': 'integer' }
        }
    }
    
    report = QualityReport()
    report.generate(real_data, synthetic_data, metadata)
    

    Output

    Creating report:  50%|█████     | 2/4 [00:00<00:00, 106.06it/s]
    ---------------------------------------------------------------------------
    ValueError                                Traceback (most recent call last)
    <ipython-input-55-a6be2bd142ce> in <module>
         17 
         18 report = QualityReport()
    ---> 19 report.generate(real_data, synthetic_data, metadata)
    
    3 frames
    /usr/local/lib/python3.7/dist-packages/sdmetrics/reports/single_table/quality_report.py in generate(self, real_data, synthetic_data, metadata)
         71             try:
         72                 self._metric_results[metric.__name__] = metric.compute_breakdown(
    ---> 73                     real_data, synthetic_data, metadata)
         74             except IncomputableMetricError:
         75                 # Metric is not compatible with this dataset.
    
    /usr/local/lib/python3.7/dist-packages/sdmetrics/single_table/multi_column_pairs.py in compute_breakdown(cls, real_data, synthetic_data, metadata, **kwargs)
        128             synthetic = synthetic_data[list(sorted_columns)]
        129             breakdown[sorted_columns] = cls.column_pairs_metric.compute_breakdown(
    --> 130                 real, synthetic, **kwargs)
        131 
        132         return breakdown
    
    /usr/local/lib/python3.7/dist-packages/sdmetrics/column_pairs/statistical/correlation_similarity.py in compute_breakdown(cls, real_data, synthetic_data, coefficient)
         83 
         84         correlation_real, _ = correlation_fn(real_data[column1], real_data[column2])
    ---> 85         correlation_synthetic, _ = correlation_fn(synthetic_data[column1], synthetic_data[column2])
         86 
         87         if np.isnan(correlation_real) or np.isnan(correlation_synthetic):
    
    /usr/local/lib/python3.7/dist-packages/scipy/stats/stats.py in pearsonr(x, y)
       4014 
       4015     if n < 2:
    -> 4016         raise ValueError('x and y must have length at least 2.')
       4017 
       4018     x = np.asarray(x)
    
    ValueError: x and y must have length at least 2.
    

    Note: It is OK that the correlation metric is crashing (correlation is undefined if there are no values). But the report should not crash.

    bug feature:reports 
    opened by npatki 0
Releases(v0.8.1)
  • v0.8.1(Dec 10, 2022)

    This release fixes bugs in the existing metrics and reports. We also make the reports compatible with future SDV versions.

    New Features

    • Filter out additional sdtypes that will be available in future versions of SDV - Issue #265 by @katxiao
    • NewRowSynthesis should ignore PrimaryKey column - Issue #260 by @katxiao

    Bug Fixes

    • Visualization crashes if there are metric errors - Issue #272 by @katxiao
    • Score for TVComplement if synthetic data only has missing values - Issue #271 by @katxiao
    • Fix 'timestamp' column metadata in the multi table demo - Issue #267 by @katxiao
    • Fix 'duration' column in the single table demo - Issue #266 by @katxiao
    • README.md example has a bug - Issue #262 by @katxiao
    • Update README.md to fix a bug - Issue #263 by @katxiao
    • Visualization get_column_pair_plot: update parameter name to column_names - Issue #258 by @katxiao
    • "Column Shapes" and "Column Pair Trends" Calculation Inconsistency - Issue #254 by @katxiao
    • Diagnostic Report missing RangeCoverage for numerical columns - Issue #255 by @katxiao

    v0.8.0 - 2022-11-02

    This release introduces the DiagnosticReport, which helps a user verify – at a quick glance – that their data is valid. We also fix an existing bug with detection metrics.

    New Features

    • Fixes for new metadata - Issue #253 by @katxiao
    • Add default synthetic sample size to DiagnosticReport - Issue #248 by @katxiao
    • Exclude pii columns from single table metrics - Issue #245 by @katxiao
    • Accept both old and new metadata - Issue #244 by @katxiao
    • Address Diagnostic Report and metric edge cases - Issue #243 by @katxiao
    • Update visualization average per table - Issue #242 by @katxiao
    • Add save and load functionality to multi-table DiagnosticReport - Issue #218 by @katxiao
    • Visualization methods for the multi-table DiagnosticReport - Issue #217 by @katxiao
    • Add getter methods to multi-table DiagnosticReport - Issue #216 by @katxiao
    • Create multi-table DiagnosticReport - Issue #215 by @katxiao
    • Visualization methods for the single-table DiagnosticReport - Issue #211 by @katxiao
    • Add getter methods to single-table DiagnosticReport - Issue #210 by @katxiao
    • Create single-table DiagnosticReport - Issue #209 by @katxiao
    • Add save and load functionality to single-table DiagnosticReport - Issue #212 by @katxiao
    • Add single table diagnostic report - Issue #237 by @katxiao
    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Nov 16, 2022)

    This release introduces the DiagnosticReport, which helps a user verify – at a quick glance – that their data is valid. We also fix an existing bug with detection metrics.

    New Features

    • Fixes for new metadata - Issue #253 by @katxiao
    • Add default synthetic sample size to DiagnosticReport - Issue #248 by @katxiao
    • Exclude pii columns from single table metrics - Issue #245 by @katxiao
    • Accept both old and new metadata - Issue #244 by @katxiao
    • Address Diagnostic Report and metric edge cases - Issue #243 by @katxiao
    • Update visualization average per table - Issue #242 by @katxiao
    • Add save and load functionality to multi-table DiagnosticReport - Issue #218 by @katxiao
    • Visualization methods for the multi-table DiagnosticReport - Issue #217 by @katxiao
    • Add getter methods to multi-table DiagnosticReport - Issue #216 by @katxiao
    • Create multi-table DiagnosticReport - Issue #215 by @katxiao
    • Visualization methods for the single-table DiagnosticReport - Issue #211 by @katxiao
    • Add getter methods to single-table DiagnosticReport - Issue #210 by @katxiao
    • Create single-table DiagnosticReport - Issue #209 by @katxiao
    • Add save and load functionality to single-table DiagnosticReport - Issue #212 by @katxiao
    • Add single table diagnostic report - Issue #237 by @katxiao

    Bug Fixes

    • Detection test test doesn't look at metadata when determining which columns to use - Issue #119 by @R-Palazzo

    Internal Improvements

    • Remove torch dependency - Issue #233 by @katxiao
    • Update README - Issue #250 by @katxiao
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Sep 27, 2022)

    This release introduces the QualityReport, which evaluates how well synthetic data captures mathematical properties from the real data. The QualityReport incorporates the new metrics introduced in the previous release, and allows users to get detailed results, visualize the scores, and save the report for future viewing. We also add utility methods for visualizing columns and pairs of columns.

    New Features

    • Catch typeerror in new row synthesis query - Issue #234 by @katxiao
    • Add NewRowSynthesis Metric - Issue #207 by @katxiao
    • Update plot utilities API - Issue #228 by @katxiao
    • Fix column pairs visualization bug - Issue #230 by @katxiao
    • Save version - Issue #229 by @katxiao
    • Update efficacy metrics API - Issue #227 by @katxiao
    • Add RangeCoverage Metric - Issue #208 by @katxiao
    • Add get_column_pairs_plot utility method - Issue #223 by @katxiao
    • Parse date as datetime - Issue #222 by @katxiao
    • Update error handling for reports - Issue #221 by @katxiao
    • Visualization API update - Issue #220 by @katxiao
    • Bug fixes for QualityReport - Issue #219 by @katxiao
    • Update column pair metric calculation - Issue #214 by @katxiao
    • Add get score methods for multi table QualityReport - Issue #190 by @katxiao
    • Add multi table QualityReport visualization methods - Issue #192 by @katxiao
    • Add plot_column visualization utility method - Issue #193 by @katxiao
    • Add save and load behavior to multi table QualityReport - Issue #188 by @katxiao
    • Create multi-table QualityReport - Issue #186 by @katxiao
    • Add single table QualityReport visualization methods - Issue #191 by @katxiao
    • Add save and load behavior to single table QualityReport - Issue #187 by @katxiao
    • Add get score methods for single table Quality Report - Issue #189 by @katxiao
    • Create single-table QualityReport - Issue #185 by @katxiao

    Internal Improvements

    • Auto apply "new" label instead of "pending review" - Issue #164 by @katxiao
    • fix typo - Issue #195 by @fealho
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Aug 12, 2022)

    This release removes SDMetric's dependency on the RDT library, and also introduces new quality and diagnostic metrics. Additionally, we introduce a new compute_breakdown method that returns a breakdown of metric results.

    New Features

    • Handle null values correctly - Issue #194 by @katxiao
    • Add wrapper classes for new single and multi table metrics - Issue #169 by @katxiao
    • Add CorrelationSimilarity metric - Issue #143 by @katxiao
    • Add CardinalityShapeSimilarity metric - Issue #160 by @katxiao
    • Add CardinalityStatisticSimilarity metric - Issue #145 by @katxiao
    • Add ContingencySimilarity Metric - Issue #159 by @katxiao
    • Add TVComplement metric - Issue #142 by @katxiao
    • Add MissingValueSimilarity metric - Issue #139 by @katxiao
    • Add CategoryCoverage metric - Issue #140 by @katxiao
    • Add compute breakdown column for single column - Issue #152 by @katxiao
    • Add BoundaryAdherence metric - Issue #138 by @katxiao
    • Get KSComplement Score Breakdown - Issue #130 by @katxiao
    • Add StatisticSimilarity Metric - Issue #137 by @katxiao
    • New features for KSTest.compute - Issue #129 by @amontanez24

    Internal Improvements

    • Add integration tests and fixes - Issue #183 by @katxiao
    • Remove rdt hypertransformer dependency in timeseries metrics - Issue #176 by @katxiao
    • Replace rdt LabelEncoder with sklearn - Issue #178 by @katxiao
    • Remove rdt as a dependency - Issue #182 by @katxiao
    • Use sklearn's OneHotEncoder instead of rdt - Issue #170 by @katxiao
    • Remove KSTestExtended - Issue #180 by @katxiao
    • Remove TSFClassifierEfficacy and TSFCDetection metrics - Issue #171 by @katxiao
    • Update the default tags for a feature request - Issue #172 by @katxiao
    • Bump github macos version - Issue #174 by @katxiao
    • Fix pydocstyle to check sdmetrics - Issue #153 by @pvk-developer
    • Update the RDT version to 1.0 - Issue #150 by @pvk-developer
    • Update slack invite link - Issue #132 by @pvk-developer
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(May 10, 2022)

    This release fixes an error where the relational KSTest crashes if a table doesn't have numerical columns. It also includes some housekeeping, updating the pomegranate and copulas version requirements.

    Issues closed

    • Cap pomegranate to <0.14.7 - Issue #116 by @csala
    • Relational KSTest crashes with IncomputableMetricError if a table doesn't have numerical columns - Issue #109 by @katxiao
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Dec 9, 2021)

    v0.4.1 - 2021-12-09

    This release improves the handling of metric errors, and updates the default transformer behavior used in SDMetrics.

    Issues closed

    • Report metric errors from compute_metrics - Issue #107 by @katxiao
    • Specify default categorical transformers - Issue #105 by @katxiao
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Nov 16, 2021)

    This release adds support for Python 3.9 and updates dependencies to ensure compatibility with the rest of the SDV ecosystem, and upgrades to the latests RDT release.

    Issues closed

    • Replace sktime for pyts - Issue #103 by @pvk-developer
    • Add support for Python 3.9 - Issue #102 by @pvk-developer
    • Increase code style lint - Issue #80 by @fealho
    • Add pip check to CI workflows - Issue #79 by @pvk-developer
    • Upgrade dependency ranges - Issue #69 by @katxiao
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Aug 17, 2021)

  • v0.3.1(Jul 12, 2021)

    v0.3.1 - 2021-07-12

    This release fixes a bug to make the privacy metrics available in the API docs. It also updates dependencies to ensure compatibility with the rest of the SDV ecosystem.

    Issues closed

    • CategoricalSVM not being imported - Issue #65 by @csala
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Mar 31, 2021)

    This release includes privacy metrics to evaluate if the real data could be obtained or deduced from the synthetic samples. Additionally all the metrics have a normalize method which takes the raw_score generated by the metric and returns a value between 0 and 1.

    Issues closed

    • Add normalize method to metrics - Issue #51 by @csala and @fealho
    • Implement privacy metrics - Issue #36 by @ZhuofanXie and @fealho
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Feb 24, 2021)

  • v0.1.3(Feb 15, 2021)

  • v0.1.2(Jan 27, 2021)

    Big fixing release that addresses several minor errors.

    Issues closed

    • More splits than classes - Issue #46 by @fealho
    • Scipy 1.6.0 causes an AttributeError - Issue #44 by @fealho
    • Time series metrics fails with variable length timeseries - Issue #42 by @fealho
    • ParentChildDetection metrics KeyError - Issue #39 by @csala
    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Dec 30, 2020)

    This version adds Time Series Detection and Efficacy metrics, as well as a fix to ensure that Single Table binary classification efficacy metrics work well with binary targets which are not boolean.

    Issues closed

    • Timeseries efficacy metrics - Issue #35 by @csala
    • Timeseries detection metrics - Issue #34 by @csala
    • Ensure binary classification targets are bool - Issue #33 by @csala
    Source code(tar.gz)
    Source code(zip)
  • v0.1.0(Dec 18, 2020)

    This release introduces a new project organization and API, with metrics grouped by data modality, with a common API:

    • Single Column
    • Column Pair
    • Single Table
    • Multi Table
    • Time Series

    Within each data modality, different families of metrics have been implemented:

    • Statistical
    • Detection
    • Bayesian Network and Gaussian Mixture Likelihood
    • Machine Learning Efficacy
    Source code(tar.gz)
    Source code(zip)
  • v0.0.4(Nov 27, 2020)

  • v0.0.3(Nov 20, 2020)

    Fix error on detection metrics when input data contains infinity or NaN values.

    Issues closed

    • ValueError: Input contains infinity or a value too large for dtype('float64') - Issue #11 by @csala
    Source code(tar.gz)
    Source code(zip)
  • v0.0.2(Aug 8, 2020)

  • v0.0.1(Jun 26, 2020)

Owner
The Synthetic Data Vault Project
The Synthetic Data Vault Project
Train/evaluate a Keras model, get metrics streamed to a dashboard in your browser.

Hera Train/evaluate a Keras model, get metrics streamed to a dashboard in your browser. Setting up Step 1. Plant the spy Install the package pip

Keplr 495 Dec 10, 2022
Pytorch Lightning 1.2k Jan 6, 2023
A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

About This repository provides data and code for the paper: Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (subm

Appen Repos 86 Dec 7, 2022
Asterisk is a framework to generate high-quality training datasets at scale

Asterisk is a framework to generate high-quality training datasets at scale

Mona Nashaat 44 Apr 25, 2022
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

============================================================================================================ `MILA will stop developing Theano <https:

null 9.6k Dec 31, 2022
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

============================================================================================================ `MILA will stop developing Theano <https:

null 9.6k Jan 6, 2023
Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

============================================================================================================ `MILA will stop developing Theano <https:

null 9.3k Feb 12, 2021
An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results

EasyDatas An easy way to build PyTorch datasets. Modularly build datasets and automatically cache processed results Installation pip install git+https

Ximing Yang 4 Dec 14, 2021
Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data.

Deep Learning Dataset Maker Deep Learning Datasets Maker is a QGIS plugin to make datasets creation easier for raster and vector data. How to use Down

deepbands 25 Dec 15, 2022
Cl datasets - PyTorch image dataloaders and utility functions to load datasets for supervised continual learning

Continual learning datasets Introduction This repository contains PyTorch image

berjaoui 5 Aug 28, 2022
Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Translated in ???? Korean/ Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on

Ludwig 8.7k Jan 5, 2023
Ludwig is a toolbox that allows to train and evaluate deep learning models without the need to write code.

Translated in ???? Korean/ Ludwig is a toolbox that allows users to train and test deep learning models without the need to write code. It is built on

Ludwig 8.7k Dec 31, 2022
OCTIS: Comparing Topic Models is Simple! A python package to optimize and evaluate topic models (accepted at EACL2021 demo track)

OCTIS : Optimizing and Comparing Topic Models is Simple! OCTIS (Optimizing and Comparing Topic models Is Simple) aims at training, analyzing and compa

MIND 478 Jan 1, 2023
Narya API allows you track soccer player from camera inputs, and evaluate them with an Expected Discounted Goal (EDG) Agent

Narya The Narya API allows you track soccer player from camera inputs, and evaluate them with an Expected Discounted Goal (EDG) Agent. This repository

Paul Garnier 121 Dec 30, 2022
HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events globally on daily to subseasonal timescales.

HeatNet HeatNet is a python package that provides tools to build, train and evaluate neural networks designed to predict extreme heat wave events glob

Google Research 6 Jul 7, 2022
This repository contains a set of codes to run (i.e., train, perform inference with, evaluate) a diarization method called EEND-vector-clustering.

EEND-vector clustering The EEND-vector clustering (End-to-End-Neural-Diarization-vector clustering) is a speaker diarization framework that integrates

null 45 Dec 26, 2022
Learning from Synthetic Shadows for Shadow Detection and Removal [Inoue+, IEEE TCSVT 2020].

Learning from Synthetic Shadows for Shadow Detection and Removal (IEEE TCSVT 2020) Overview This repo is for the paper "Learning from Synthetic Shadow

Naoto Inoue 67 Dec 28, 2022