Metrics to evaluate quality and efficacy of synthetic datasets.

The Synthetic Data Vault Project

Last update: Jan 3, 2023

Related tags

Overview

An Open Source Project from the Data to AI Lab, at MIT

Metrics for Synthetic Data Generation Projects

Website: https://sdv.dev
Documentation: https://sdv.dev/SDV
Repository: https://github.com/sdv-dev/SDMetrics
License: MIT
Development Status: Pre-Alpha

Overview

The SDMetrics library provides a set of dataset-agnostic tools for evaluating the quality of a synthetic database by comparing it to the real database that it is modeled after.

It supports multiple data modalities:

Single Columns: Compare 1 dimensional numpy arrays representing individual columns.
Column Pairs: Compare how columns in a pandas.DataFrame relate to each other, in groups of 2.
Single Table: Compare an entire table, represented as a pandas.DataFrame.
Multi Table: Compare multi-table and relational datasets represented as a python dict with multiple tables passed as pandas.DataFrames.
Time Series: Compare tables representing ordered sequences of events.

It includes a variety of metrics such as:

Statistical metrics which use statistical tests to compare the distributions of the real and synthetic distributions.
Detection metrics which use machine learning to try to distinguish between real and synthetic data.
Efficacy metrics which compare the performance of machine learning models when run on the synthetic and real data.
Bayesian Network and Gaussian Mixture metrics which learn the distribution of the real data and evaluate the likelihood of the synthetic data belonging to the learned distribution.
Privacy metrics which evaluate whether the synthetic data is leaking information about the real data.

Install

SDMetrics is part of the SDV project and is automatically installed alongside it. For details about this process please visit the SDV Installation Guide

Optionally, SDMetrics can also be installed as a standalone library using the following commands:

Using pip:

pip install sdmetrics

Using conda:

conda install -c sdv-dev -c conda-forge -c pytorch sdmetrics

For more installation options please visit the SDMetrics installation Guide

Usage

SDMetrics is included as part of the framework offered by SDV to evaluate the quality of your synthetic dataset. For more details about how to use it please visit the corresponding User Guide:

Evaluating Synthetic Data

Standalone usage

SDMetrics can also be used as a standalone library to run metrics individually.

In this short example we show how to use it to evaluate a toy multi-table dataset and its synthetic replica by running all the compatible multi-table metrics on it:

import sdmetrics

# Load the demo data, which includes:
# - A dict containing the real tables as pandas.DataFrames.
# - A dict containing the synthetic clones of the real data.
# - A dict containing metadata about the tables.
real_data, synthetic_data, metadata = sdmetrics.load_demo()

# Obtain the list of multi table metrics, which is returned as a dict
# containing the metric names and the corresponding metric classes.
metrics = sdmetrics.multi_table.MultiTableMetric.get_subclasses()

# Run all the compatible metrics and get a report
sdmetrics.compute_metrics(metrics, real_data, synthetic_data, metadata=metadata)

The output will be a table with all the details about the executed metrics and their score:

metric	name	score	min_value	max_value	goal
CSTest	Chi-Squared	0.76651	0	1	MAXIMIZE
KSTest	Inverted Kolmogorov-Smirnov D statistic	0.75	0	1	MAXIMIZE
KSTestExtended	Inverted Kolmogorov-Smirnov D statistic	0.777778	0	1	MAXIMIZE
LogisticDetection	LogisticRegression Detection	0.882716	0	1	MAXIMIZE
SVCDetection	SVC Detection	0.833333	0	1	MAXIMIZE
BNLikelihood	BayesianNetwork Likelihood	nan	0	1	MAXIMIZE
BNLogLikelihood	BayesianNetwork Log Likelihood	nan	-inf	0	MAXIMIZE
LogisticParentChildDetection	LogisticRegression Detection	0.619444	0	1	MAXIMIZE
SVCParentChildDetection	SVC Detection	0.916667	0	1	MAXIMIZE

What's next?

If you want to read more about each individual metric, please visit the following folders:

Single Column Metrics: sdmetrics/single_column
Single Table Metrics: sdmetrics/single_table
Multi Table Metrics: sdmetrics/multi_table
Time Series Metrics: sdmetrics/timeseries

The Synthetic Data Vault

This repository is part of The Synthetic Data Vault Project

Website: https://sdv.dev
Documentation: https://sdv.dev/SDV

Comments

Gh stronger detection classifiers

Add Random Forest and Gradient Boosting from sklearn to the single table detection tests. Being able to fool these classifiers would be a great improvement for generative models.

opened by TanguyUrvoy 7
README.md example has a bug
Environment Details

Please indicate the following details about the environment in which you found the bug:

SDMetrics version: 0.7.0

Python version: 3.8

Operating System: Linux VM

pandas version: 1.2.4

Error Description

Following the README.md example of calculating the BoundaryAdherence:

Running the code gives the following error: AttributeError: 'Series' object has no attribute 'columns'

I would expect the code to generate the BoundaryAdherence for the start_date column of the real_data pandas dataframe and synthetic_Data pandas dataframe

Steps to reproduce

Follow this snippet from the README.md that shows the usage of BoundaryAdherence

# calculate whether the synthetic data respects the min/max bounds # set by the real data from sdmetrics.single_table import BoundaryAdherence BoundaryAdherence.compute( real_data['start_date'], synthetic_data['start_date'] )

Solution

I will open a PR for this later today or tomorrow.

type(real_data['start_date'])

Returns a pandas.core.series.Series

type(real_data[['start_date']])

Returns a pandas.core.frame.DataFrame

BoundaryAdherence.compute() expects DataFrames for both the real_data and synthetic_data arguments.

# calculate whether the synthetic data respects the min/max bounds # set by the real data from sdmetrics.single_table import BoundaryAdherence BoundaryAdherence.compute( real_data[['start_date']], synthetic_data[['start_date']] )
documentation resolution:resolved
opened by Pverheijen 5

KeyError: 'fields' , --> reports.generate 'fields'

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDMetrics version:'0.7.0'
Python version: 3.8.8
Operating System: Windows server 2016 dataserver

Error Description

Steps to reproduce

I tried to use sdmetrics to compare the Synthetic data. Example of the data set

Customer | State | Customer Lifetime Value | Response | Coverage | Education | Effective To Date | EmploymentStatus | Gender | Income | ... | Months Since Policy Inception | Number of Open Complaints | Number of Policies | Policy Type | Policy | Renew Offer Type | Sales Channel | Total Claim Amount | Vehicle Class | Vehicle Size -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- BU79786 | Washington | 2763.519279 | No | Basic | Bachelor | 2/24/11 | Employed | F | 56274 | ... | 5 | 0 | 1 | Corporate Auto | Corporate L3 | Offer1 | Agent | 384.811147 | Two-Door Car | Medsize QZ44356 | Arizona | 6979.535903 | No | Extended | Bachelor | 1/31/11 | Unemployed | F | 0 | ... | 42 | 0 | 8 | Personal Auto | Personal L3 | Offer3 | Agent | 1131.464935 | Four-Door Car | Medsiz

loaded the Gaussian copula pkl model to generate the synthetic data and created a metadata on the original dataset to use it in quality report. I have pasted the Metadata created for the data above: {'fields': {'Customer': {'type': 'id', 'subtype': 'string'}, 'State': {'type': 'categorical'}, 'Customer Lifetime Value': {'type': 'numerical', 'subtype': 'float'}, 'Response': {'type': 'categorical'}, 'Coverage': {'type': 'categorical'}, 'Education': {'type': 'categorical'}, 'Effective To Date': {'type': 'categorical'}, 'EmploymentStatus': {'type': 'categorical'}, 'Gender': {'type': 'categorical'}, 'Income': {'type': 'numerical', 'subtype': 'integer'}, 'Location Code': {'type': 'categorical'}, 'Marital Status': {'type': 'categorical'}, 'Monthly Premium Auto': {'type': 'numerical', 'subtype': 'integer'}, 'Months Since Last Claim': {'type': 'numerical', 'subtype': 'integer'}, 'Months Since Policy Inception': {'type': 'numerical', 'subtype': 'integer'}, 'Number of Open Complaints': {'type': 'numerical', 'subtype': 'integer'}, 'Number of Policies': {'type': 'numerical', 'subtype': 'integer'}, 'Policy Type': {'type': 'categorical'}, 'Policy': {'type': 'categorical'}, 'Renew Offer Type': {'type': 'categorical'}, 'Sales Channel': {'type': 'categorical'}, 'Total Claim Amount': {'type': 'numerical', 'subtype': 'float'}, 'Vehicle Class': {'type': 'categorical'}, 'Vehicle Size': {'type': 'categorical'}}, 'primary_key': 'Customer'}.-->

#creating metadata
metadata.add_table(
    name='INS',
    data=data2,
    primary_key='Customer'
     )


report = QualityReport()
report.generate(data2, synthetic_data,metadata)


error:
Creating report:   0%|          | 0/4 [00:00<?, ?it/s]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [54], in <cell line: 2>()
      1 report = QualityReport()
----> 2 report.generate(data2, synthetic_data,metadata)

File ~\.conda\envs\SDVENV\lib\site-packages\sdmetrics\reports\single_table\quality_report.py:72, in QualityReport.generate(self, real_data, synthetic_data, metadata)
     70 for metric in tqdm.tqdm(metrics, desc='Creating report'):
     71     try:
---> 72         self._metric_results[metric.__name__] = metric.compute_breakdown(
     73             real_data, synthetic_data, metadata)
     74     except IncomputableMetricError:
     75         # Metric is not compatible with this dataset.
     76         self._metric_results[metric.__name__] = {}

File ~\.conda\envs\SDVENV\lib\site-packages\sdmetrics\single_table\multi_single_column.py:147, in MultiSingleColumnMetric.compute_breakdown(cls, real_data, synthetic_data, metadata, **kwargs)
    123 @classmethod
    124 def compute_breakdown(cls, real_data, synthetic_data, metadata=None, **kwargs):
    125     """Compute this metric broken down by column.
    126 
    127     This is done by computing the underlying SingleColumn metric to all the
   (...)
    145             A mapping of column name to metric output.
    146     """
--> 147     return cls._compute(
    148         cls, real_data, synthetic_data, metadata, store_errors=True, **kwargs)

File ~\.conda\envs\SDVENV\lib\site-packages\sdmetrics\single_table\multi_single_column.py:68, in MultiSingleColumnMetric._compute(self, real_data, synthetic_data, metadata, store_errors, **kwargs)
     43 def _compute(self, real_data, synthetic_data, metadata=None, store_errors=False, **kwargs):
     44     """Compute this metric for all columns.
     45 
     46     This is done by computing the underlying SingleColumn metric to all the
   (...)
     66             A mapping of column name to metric output.
     67     """
---> 68     real_data, synthetic_data, metadata = self._validate_inputs(
     69         real_data, synthetic_data, metadata)
     71     fields = self._select_fields(metadata, self.field_types)
     72     invalid_cols = set(metadata['fields'].keys()) - set(fields)

File ~\.conda\envs\SDVENV\lib\site-packages\sdmetrics\single_table\base.py:119, in SingleTableMetric._validate_inputs(cls, real_data, synthetic_data, metadata)
    116 if not isinstance(metadata, dict):
    117     metadata = metadata.to_dict()
--> 119 fields = metadata['fields']
    120 for column in real_data.columns:
    121     if column not in fields:

KeyError: 'fields'

bug resolution:WAI

opened by ketandaryanani 3

reports.generate 'fields' index issue

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDMetrics version: '0.7.0'
Python version: 3.9.13
Operating System: Windows under WSL2--Unbuntu

Error Description

Attepmting to generate report from the following command report = QualityReport() report.generate(x_y_df, synth_data, meta_data)

Steps to reproduce

x_y_df.to_json(reports_path + '/' + 'xy_df.json')

    with open(reports_path + '/' + 'xy_df.json') as f:
        meta_data = json.load(f)

    
    # Initialize report
    report = QualityReport()
    report.generate(x_y_df, synth_data, meta_data)

Traceback:

File "/home/bdeck8317/miniconda3/lib/python3.9/site-packages/sdmetrics/reports/single_table/quality_report.py", line 72, in generate
    self._metric_results[metric.__name__] = metric.compute_breakdown(
  File "/home/bdeck8317/miniconda3/lib/python3.9/site-packages/sdmetrics/single_table/multi_single_column.py", line 147, in compute_breakdown
    return cls._compute(
  File "/home/bdeck8317/miniconda3/lib/python3.9/site-packages/sdmetrics/single_table/multi_single_column.py", line 68, in _compute
    real_data, synthetic_data, metadata = self._validate_inputs(
  File "/home/bdeck8317/miniconda3/lib/python3.9/site-packages/sdmetrics/single_table/base.py", line 119, in _validate_inputs
    fields = metadata['fields']
KeyError: 'fields'

Thanks, Ben

bug resolution:WAI

opened by bdeck8317 3

NewRowSynthesis: ValueError: multi-line expressions are only valid in the context of data, use DataFrame.eval

Environment Details

SDV version: sdv==0.17.1
Python version: Python 3.9.13
Operating System: Linux

Error Description

pandas==1.4.3

ValueError when running NewRowSynthesis

Steps to reproduce

from sdmetrics.single_table import NewRowSynthesis

metadata_obj, real_data = load_tabular_demo("student_placements_pii", metadata=True)

model = GaussianCopula(
    primary_key="student_id"
)
model.fit(real_data)
synthetic_data = model.sample(250)

new_row_synthesis_score = NewRowSynthesis.compute(
    real_data=real_data, synthetic_data=synthetic_data, metadata=metadata_obj.to_dict()
)

ValueError                                Traceback (most recent call last)
Cell In [43], line 11
      8 model.fit(real_data)
      9 synthetic_data = model.sample(250)
---> 11 new_row_synthesis_score = NewRowSynthesis.compute(
     12     real_data=real_data, synthetic_data=synthetic_data, metadata=metadata_obj.to_dict()
     13 )

File ~/miniconda3/envs/mloperator/lib/python3.9/site-packages/sdmetrics/single_table/new_row_synthesis.py:104, in NewRowSynthesis.compute(cls, real_data, synthetic_data, metadata, numerical_match_tolerance, synthetic_sample_size)
    101     row_filter.append(field_filter)
    103 try:
--> 104     matches = real_data.query(' and '.join(row_filter))
    105 except TypeError:
    106     if len(real_data) > 10000:

File ~/miniconda3/envs/mloperator/lib/python3.9/site-packages/pandas/core/frame.py:4111, in DataFrame.query(self, expr, inplace, **kwargs)
   4109 kwargs["level"] = kwargs.pop("level", 0) + 1
   4110 kwargs["target"] = None
-> 4111 res = self.eval(expr, **kwargs)
   4113 try:
   4114     result = self.loc[res]

File ~/miniconda3/envs/mloperator/lib/python3.9/site-packages/pandas/core/frame.py:4240, in DataFrame.eval(self, expr, inplace, **kwargs)
   4237     kwargs["target"] = self
...
    328     )
    329 engine = _check_engine(engine)
    330 _check_parser(parser)

ValueError: multi-line expressions are only valid in the context of data, use DataFrame.eval

bug feature:metrics resolution:resolved

opened by darenr 2

BoundaryAdherence: ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Environment Details

SDV version: sdv==0.17.1
Python version: Python 3.9.13
Operating System: Linux

Error Description

pandas==1.4.3

ValueError when running BoundaryAdherence

Steps to reproduce

from sdmetrics.single_column import BoundaryAdherence

metadata_obj, real_data = load_tabular_demo("student_placements_pii", metadata=True)

model = GaussianCopula(
    primary_key="student_id"
)
model.fit(real_data)
synthetic_data = model.sample(250)

BoundaryAdherence.compute(
    real_data=real_data, synthetic_data=synthetic_data
)

ValueError                                Traceback (most recent call last)
Cell In [42], line 11
      8 model.fit(real_data)
      9 synthetic_data = model.sample(250)
---> 11 BoundaryAdherence.compute(
     12     real_data=real_data, synthetic_data=synthetic_data
     13 )

File ~/miniconda3/envs/mloperator/lib/python3.9/site-packages/sdmetrics/single_column/statistical/boundary_adherence.py:46, in BoundaryAdherence.compute(cls, real_data, synthetic_data)
     32 @classmethod
     33 def compute(cls, real_data, synthetic_data):
     34     """Compute the boundary adherence of two continuous columns.
     35 
     36     Args:
   (...)
     44             The boundary adherence of the two columns.
     45     """
---> 46     real_data = pd.Series(real_data).dropna()
     47     synthetic_data = pd.Series(synthetic_data).dropna()
     49     if is_datetime(real_data):

File ~/miniconda3/envs/mloperator/lib/python3.9/site-packages/pandas/core/series.py:367, in __init__(self, data, index, dtype, name, copy, fastpath)
    364         self.name = name
    365     return
...
   1528         f"The truth value of a {type(self).__name__} is ambiguous. "
   1529         "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1530     )

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

bug resolution:WAI feature:metrics

opened by darenr 2

Update README.md to fix a bug

BoundaryAdherence expects DataFrames for the real and synthetic dataframe arguments, currently it is receiving series due to accessing date_range with single square brackets instead of double square brackets.

opened by Pverheijen 2
Accademic Paper to cite?

Is there any paper to cite related to this library? I am evaluating some synthetic data for a publication so I wanted to include your contribution among the citations.

Thanks Andrea Galloni
question resolution:resolved

opened by andreagalloni92 2

SDMetrics 0.4.2 has incompatible copula version with SDV

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDMetrics version: 0.4.2
SDV version: 0.14.1
Python version: 3.8.10
Operating System: ubuntu server lts 20.4.04

Error Description

The latest SDMetrics version (which is installed by default when installing SDV) has incompatible copula requirements with downstream SDV.

Steps to reproduce

On a fresh virtual environment, install pip-tools.

Place the following on a file named requirements.in

sdv
#sdmetrics==0.4.1

Type the following commands

pip install -r requirements.in
pip-compile requirements.in

pip-compile reports:

Could not find a version that matches copulas<0.7,<0.8,>=0.6.1,>=0.7.0 (from sdv==0.14.1->-r requirements.txt (line 1))
Tried: 0.0.0, 0.0.0, 0.1.0, 0.1.0, 0.1.1, 0.1.1, 0.2.0, 0.2.0, 0.2.1, 0.2.1, 0.2.3, 0.2.3, 0.2.4, 0.2.4, 0.2.5, 0.2.5, 0.3.0, 0.3.0, 0.3.2, 0.3.2, 0.3.3, 0.3.3, 0.4.0, 0.4.0, 0.5.0, 0.5.0, 0.5.1, 0.5.1, 0.6.0, 0.6.0, 0.6.1, 0.6.1, 0.7.0, 0.7.0
Skipped pre-versions: 0.3.0.dev0, 0.3.0.dev0, 0.3.2.dev1, 0.3.2.dev1, 0.3.3.dev0, 0.3.3.dev0, 0.4.0.dev0, 0.4.0.dev0, 0.5.0.dev0, 0.5.0.dev0, 0.5.0.dev1, 0.5.0.dev1, 0.5.1.dev0, 0.5.1.dev0, 0.5.1.dev1, 0.5.1.dev1, 0.5.2.dev0, 0.5.2.dev0, 0.5.2.dev1, 0.5.2.dev1, 0.6.0.dev0, 0.6.0.dev0, 0.6.1.dev0, 0.6.1.dev0, 0.7.0.dev0, 0.7.0.dev0
There are incompatible versions in the resolved dependencies:
  copulas<0.8,>=0.7.0 (from sdmetrics==0.4.2->sdv==0.14.1->-r requirements.txt (line 1))
  copulas<0.7,>=0.6.1 (from sdv==0.14.1->-r requirements.txt (line 1))

From the setup.py of both projects, we can verify the above requirements.

pip install works correctly, but we get the following (snippet):

Collecting llvmlite<0.39,>=0.38.0rc1
  Using cached llvmlite-0.38.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.5 MB)
Collecting charset-normalizer~=2.0.0; python_version >= "3"
  Using cached charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
Collecting idna<4,>=2.5; python_version >= "3"
  Using cached idna-3.3-py3-none-any.whl (61 kB)
Collecting certifi>=2017.4.17
  Using cached certifi-2021.10.8-py2.py3-none-any.whl (149 kB)
Collecting urllib3<1.27,>=1.21.1
  Using cached urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
ERROR: numba 0.55.1 has requirement numpy<1.22,>=1.18, but you'll have numpy 1.22.3 which is incompatible.
ERROR: rdt 0.6.4 has requirement scipy<1.8,>=1.5.4, but you'll have scipy 1.8.0 which is incompatible.
ERROR: sdmetrics 0.4.2 has requirement copulas<0.8,>=0.7.0, but you'll have copulas 0.6.1 which is incompatible.
Installing collected packages: tqdm, typing-extensions, torch, numpy, six, python-dateutil, pytz, pandas, deepecho, scipy, threadpoolctl, joblib, scikit-learn, llvmlite, numba, pyts, pyyaml, psutil, rdt, fonttools, cycler, pyparsing, packaging, pillow, kiwisolver, matplotlib, copulas, sdmetrics, charset-normalizer, idna, certifi, urllib3, requests, torchvision, ctgan, graphviz, text-unidecode, Faker, sdv

When uncommenting sdmetrics from requirements.in, both commands run "correctly".

Furthermore, when pip-compile and pip have cached sdmetrics==0.4.1, they both select that version instead and no error is shown.

The following file never compiles:

sdv==0.14.1
sdmetrics==0.4.2

I don't know what the appropriate solution to something like this would be. I'm not a library developer.

bug

opened by antheas 2

README doesn't accurately describe the output of `compute_metrics`

The current README doesn't print the latest output. More specifically, the command sdmetrics.compute_metrics(metrics, real_data, synthetic_data, metadata=metadata) currently doesn't print the same as what the README prints (e.g. the current code produces a column named error containing None values which the README doesn't have, as well as other changes).
documentation resolution:obsolete

opened by fealho 2
Relational `KSTest` crashes with `IncomputableMetricError` if a table doesn't have numerical columns
Environment Details

SDMetrics version: 0.4.1

Python version: 3.7

Error Description

The relational KSTest is supposed to run the KSTest on all numerical columns in all tables and return the average score.

However, this test crashes if it encounters a table that has no numerical columns. I expect this test to succeed as long as there is at least 1 numerical column in any of the tables.

Steps to reproduce

Use the relational demo dataset and pass it in with the metadata.

from sdv.metrics.demos import load_multi_table_demo from sdv.metrics.relational import KSTest real_data, synthetic_data, metadata = load_multi_table_demo() KSTest.compute(real_data, synthetic_data, metadata)

Output:

/usr/local/lib/python3.7/dist-packages/sdmetrics/single_table/base.py in _select_fields(cls, metadata, types) 78 79 if len(fields) == 0: ---> 80 raise IncomputableMetricError(f'Cannot find fields of types {types}') 81 82 return fields IncomputableMetricError: Cannot find fields of types ('numerical',)

I believe this is happening because table sessions has no numerical columns. Interestingly, it does work if I exclude the metadata object -- because then it starts assuming that the id field is a numerical column.

KSTest.compute(real_data, synthetic_data) 0.8555555555555556
bug
opened by npatki 2
Detection metrics should only use statistically modeled columns (filter out the rest)
Problem Description

The Detection metrics use machine learning to determine whether the real vs. synthetic data can be detected. For this to work, we should only be using columns that are statistically modeled.

Expected behavior

When running any of the detection metrics, the following columns should be ignored:

Primary keys

Foreign keys

Any other kinds of IDs

PII or sensitive data

Text data (or data created by RegEx)

None of these columns provide any useful information for detection.

The remaining data types are statistically modeled and should be included: numerical, datetime, categorical (non-PII), boolean

Additional context

We already filtered out primary keys in #119. The issue of foreign keys is discussed in #285.
feature request
opened by npatki 0
Does removing foreign keys in detection metrics for multi-tables make sense?
Environment details

If you are already running SDMetrics, please indicate the following details about the environment in which you are running it:

SDMetrics version: 0.8.2

Python version:

Operating System:

Problem description

Nice correction for DetectionMetric (Solving primary_key use for detection metrics #251, https://github.com/sdv-dev/SDMetrics/pull/251) Removing the primary key from the table makes more sense for the evaluation. But, what about foreign keys present in a table of a relational database. Does not the same problem also occur for the foreign keys too and they need to be deleted? Also, for the parent-child logistic detection metric, the foreign keys (referencing parent rows) in the child tables are no longer required when the denormalized tables are used.
question under discussion
opened by mohamedgy 1
Visualize cardinality of foreign key columns
I'm filing this issue on behalf of a user request on our Slack.

Problem Description

Currently, the single column visualization function (utils.get_column_plot) only supports columns that are numerical, categorical, boolean or datetime. It would be nice to support foreign keys as well.

Expected behavior

If I provide a foreign key name into the utils.get_column_plot function, then I expect to see a plot of real vs. synthetic data:

Compute the cardinality (# of children with the same parent) for the synthetic and real data. This will form 2 distributions for real & synthetic data

Plot those distributions similar to how we plot numerical data. The x-axis label should read "Cardinality".

This can be done from get_column_plot.

from sdmetrics.reports import utils fig = utils.get_column_plot( real_data=real_table, synthetic_data=synthetic_table, column_name='foreign_key_column', metadata=my_table_metadata_dict ) fig.show()
feature request
opened by npatki 1

Quality Report crashes when numerical column has only `NaN` values

Environment Details

SDMetrics version: 0.8.0
Python version: 3.7
Operating System: Linux

Error Description

A numerical column in the real data may contain missing values. Sometimes, the synthetic data may only produce these missing values and fail to create any numerical values. In such cases, the software crashes when I try to produce a quality report.

Expected Behavior: Certain metrics may not be computable if there are only NaN values. But instead of crashing the report, the error should be noted in the detailed breakdowns, and the report should still produce a score while ignoring the values (along with details, visualizations, etc.)

Steps to reproduce

import pandas as pd
from sdmetrics.reports.single_table import QualityReport

real_data = pd.DataFrame(data={
    'col1': [1, 2, 1, 3, 4],
    'col2': [2, 4, 1, 7, 1]
})

# the 'col2' only contain NaN values
synthetic_data = pd.DataFrame(data={
    'col1': [1, 3, 2, 2, 1],
    'col2': [np.nan]*5
})

metadata = {
    'fields': {
        'col1': { 'type': 'numerical', 'subtype': 'integer' },
        'col2': { 'type': 'numerical', 'subtype': 'integer' }
    }
}

report = QualityReport()
report.generate(real_data, synthetic_data, metadata)

Output

Creating report:  50%|█████     | 2/4 [00:00<00:00, 106.06it/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-55-a6be2bd142ce> in <module>
     17 
     18 report = QualityReport()
---> 19 report.generate(real_data, synthetic_data, metadata)

3 frames
/usr/local/lib/python3.7/dist-packages/sdmetrics/reports/single_table/quality_report.py in generate(self, real_data, synthetic_data, metadata)
     71             try:
     72                 self._metric_results[metric.__name__] = metric.compute_breakdown(
---> 73                     real_data, synthetic_data, metadata)
     74             except IncomputableMetricError:
     75                 # Metric is not compatible with this dataset.

/usr/local/lib/python3.7/dist-packages/sdmetrics/single_table/multi_column_pairs.py in compute_breakdown(cls, real_data, synthetic_data, metadata, **kwargs)
    128             synthetic = synthetic_data[list(sorted_columns)]
    129             breakdown[sorted_columns] = cls.column_pairs_metric.compute_breakdown(
--> 130                 real, synthetic, **kwargs)
    131 
    132         return breakdown

/usr/local/lib/python3.7/dist-packages/sdmetrics/column_pairs/statistical/correlation_similarity.py in compute_breakdown(cls, real_data, synthetic_data, coefficient)
     83 
     84         correlation_real, _ = correlation_fn(real_data[column1], real_data[column2])
---> 85         correlation_synthetic, _ = correlation_fn(synthetic_data[column1], synthetic_data[column2])
     86 
     87         if np.isnan(correlation_real) or np.isnan(correlation_synthetic):

/usr/local/lib/python3.7/dist-packages/scipy/stats/stats.py in pearsonr(x, y)
   4014 
   4015     if n < 2:
-> 4016         raise ValueError('x and y must have length at least 2.')
   4017 
   4018     x = np.asarray(x)

ValueError: x and y must have length at least 2.

Note: It is OK that the correlation metric is crashing (correlation is undefined if there are no values). But the report should not crash.

bug feature:reports

opened by npatki 0

Releases(v0.8.1)

v0.8.1(Dec 10, 2022)
This release fixes bugs in the existing metrics and reports. We also make the reports compatible with future SDV versions.

New Features

Filter out additional sdtypes that will be available in future versions of SDV - Issue #265 by @katxiao

NewRowSynthesis should ignore PrimaryKey column - Issue #260 by @katxiao

Bug Fixes

Visualization crashes if there are metric errors - Issue #272 by @katxiao

Score for TVComplement if synthetic data only has missing values - Issue #271 by @katxiao

Fix 'timestamp' column metadata in the multi table demo - Issue #267 by @katxiao

Fix 'duration' column in the single table demo - Issue #266 by @katxiao

README.md example has a bug - Issue #262 by @katxiao

Update README.md to fix a bug - Issue #263 by @katxiao

Visualization get_column_pair_plot: update parameter name to column_names - Issue #258 by @katxiao

"Column Shapes" and "Column Pair Trends" Calculation Inconsistency - Issue #254 by @katxiao

Diagnostic Report missing RangeCoverage for numerical columns - Issue #255 by @katxiao

v0.8.0 - 2022-11-02

This release introduces the DiagnosticReport, which helps a user verify – at a quick glance – that their data is valid. We also fix an existing bug with detection metrics.

New Features

Fixes for new metadata - Issue #253 by @katxiao

Add default synthetic sample size to DiagnosticReport - Issue #248 by @katxiao

Exclude pii columns from single table metrics - Issue #245 by @katxiao

Accept both old and new metadata - Issue #244 by @katxiao

Address Diagnostic Report and metric edge cases - Issue #243 by @katxiao

Update visualization average per table - Issue #242 by @katxiao

Add save and load functionality to multi-table DiagnosticReport - Issue #218 by @katxiao

Visualization methods for the multi-table DiagnosticReport - Issue #217 by @katxiao

Add getter methods to multi-table DiagnosticReport - Issue #216 by @katxiao

Create multi-table DiagnosticReport - Issue #215 by @katxiao

Visualization methods for the single-table DiagnosticReport - Issue #211 by @katxiao

Add getter methods to single-table DiagnosticReport - Issue #210 by @katxiao

Create single-table DiagnosticReport - Issue #209 by @katxiao

Add save and load functionality to single-table DiagnosticReport - Issue #212 by @katxiao

Add single table diagnostic report - Issue #237 by @katxiao

Source code(tar.gz)
Source code(zip)
v0.8.0(Nov 16, 2022)
This release introduces the DiagnosticReport, which helps a user verify – at a quick glance – that their data is valid. We also fix an existing bug with detection metrics.

New Features

Fixes for new metadata - Issue #253 by @katxiao

Add default synthetic sample size to DiagnosticReport - Issue #248 by @katxiao

Exclude pii columns from single table metrics - Issue #245 by @katxiao

Accept both old and new metadata - Issue #244 by @katxiao

Address Diagnostic Report and metric edge cases - Issue #243 by @katxiao

Update visualization average per table - Issue #242 by @katxiao

Add save and load functionality to multi-table DiagnosticReport - Issue #218 by @katxiao

Visualization methods for the multi-table DiagnosticReport - Issue #217 by @katxiao

Add getter methods to multi-table DiagnosticReport - Issue #216 by @katxiao

Create multi-table DiagnosticReport - Issue #215 by @katxiao

Visualization methods for the single-table DiagnosticReport - Issue #211 by @katxiao

Add getter methods to single-table DiagnosticReport - Issue #210 by @katxiao

Create single-table DiagnosticReport - Issue #209 by @katxiao

Add save and load functionality to single-table DiagnosticReport - Issue #212 by @katxiao

Add single table diagnostic report - Issue #237 by @katxiao

Bug Fixes

Detection test test doesn't look at metadata when determining which columns to use - Issue #119 by @R-Palazzo

Internal Improvements

Remove torch dependency - Issue #233 by @katxiao

Update README - Issue #250 by @katxiao

Source code(tar.gz)
Source code(zip)
v0.7.0(Sep 27, 2022)
This release introduces the QualityReport, which evaluates how well synthetic data captures mathematical properties from the real data. The QualityReport incorporates the new metrics introduced in the previous release, and allows users to get detailed results, visualize the scores, and save the report for future viewing. We also add utility methods for visualizing columns and pairs of columns.

New Features

Catch typeerror in new row synthesis query - Issue #234 by @katxiao

Add NewRowSynthesis Metric - Issue #207 by @katxiao

Update plot utilities API - Issue #228 by @katxiao

Fix column pairs visualization bug - Issue #230 by @katxiao

Save version - Issue #229 by @katxiao

Update efficacy metrics API - Issue #227 by @katxiao

Add RangeCoverage Metric - Issue #208 by @katxiao

Add get_column_pairs_plot utility method - Issue #223 by @katxiao

Parse date as datetime - Issue #222 by @katxiao

Update error handling for reports - Issue #221 by @katxiao

Visualization API update - Issue #220 by @katxiao

Bug fixes for QualityReport - Issue #219 by @katxiao

Update column pair metric calculation - Issue #214 by @katxiao

Add get score methods for multi table QualityReport - Issue #190 by @katxiao

Add multi table QualityReport visualization methods - Issue #192 by @katxiao

Add plot_column visualization utility method - Issue #193 by @katxiao

Add save and load behavior to multi table QualityReport - Issue #188 by @katxiao

Create multi-table QualityReport - Issue #186 by @katxiao

Add single table QualityReport visualization methods - Issue #191 by @katxiao

Add save and load behavior to single table QualityReport - Issue #187 by @katxiao

Add get score methods for single table Quality Report - Issue #189 by @katxiao

Create single-table QualityReport - Issue #185 by @katxiao

Internal Improvements

Auto apply "new" label instead of "pending review" - Issue #164 by @katxiao

fix typo - Issue #195 by @fealho

Source code(tar.gz)
Source code(zip)
v0.6.0(Aug 12, 2022)
This release removes SDMetric's dependency on the RDT library, and also introduces new quality and diagnostic metrics. Additionally, we introduce a new compute_breakdown method that returns a breakdown of metric results.

New Features

Handle null values correctly - Issue #194 by @katxiao

Add wrapper classes for new single and multi table metrics - Issue #169 by @katxiao

Add CorrelationSimilarity metric - Issue #143 by @katxiao

Add CardinalityShapeSimilarity metric - Issue #160 by @katxiao

Add CardinalityStatisticSimilarity metric - Issue #145 by @katxiao

Add ContingencySimilarity Metric - Issue #159 by @katxiao

Add TVComplement metric - Issue #142 by @katxiao

Add MissingValueSimilarity metric - Issue #139 by @katxiao

Add CategoryCoverage metric - Issue #140 by @katxiao

Add compute breakdown column for single column - Issue #152 by @katxiao

Add BoundaryAdherence metric - Issue #138 by @katxiao

Get KSComplement Score Breakdown - Issue #130 by @katxiao

Add StatisticSimilarity Metric - Issue #137 by @katxiao

New features for KSTest.compute - Issue #129 by @amontanez24

Internal Improvements

Add integration tests and fixes - Issue #183 by @katxiao

Remove rdt hypertransformer dependency in timeseries metrics - Issue #176 by @katxiao

Replace rdt LabelEncoder with sklearn - Issue #178 by @katxiao

Remove rdt as a dependency - Issue #182 by @katxiao

Use sklearn's OneHotEncoder instead of rdt - Issue #170 by @katxiao

Remove KSTestExtended - Issue #180 by @katxiao

Remove TSFClassifierEfficacy and TSFCDetection metrics - Issue #171 by @katxiao

Update the default tags for a feature request - Issue #172 by @katxiao

Bump github macos version - Issue #174 by @katxiao

Fix pydocstyle to check sdmetrics - Issue #153 by @pvk-developer

Update the RDT version to 1.0 - Issue #150 by @pvk-developer

Update slack invite link - Issue #132 by @pvk-developer

Source code(tar.gz)
Source code(zip)
v0.5.0(May 10, 2022)
This release fixes an error where the relational KSTest crashes if a table doesn't have numerical columns. It also includes some housekeeping, updating the pomegranate and copulas version requirements.

Issues closed

Cap pomegranate to <0.14.7 - Issue #116 by @csala

Relational KSTest crashes with IncomputableMetricError if a table doesn't have numerical columns - Issue #109 by @katxiao

Source code(tar.gz)
Source code(zip)
v0.4.1(Dec 9, 2021)
v0.4.1 - 2021-12-09

This release improves the handling of metric errors, and updates the default transformer behavior used in SDMetrics.

Issues closed

Report metric errors from compute_metrics - Issue #107 by @katxiao

Specify default categorical transformers - Issue #105 by @katxiao

Source code(tar.gz)
Source code(zip)
v0.4.0(Nov 16, 2021)
This release adds support for Python 3.9 and updates dependencies to ensure compatibility with the rest of the SDV ecosystem, and upgrades to the latests RDT release.

Issues closed

Replace sktime for pyts - Issue #103 by @pvk-developer

Add support for Python 3.9 - Issue #102 by @pvk-developer

Increase code style lint - Issue #80 by @fealho

Add pip check to CI workflows - Issue #79 by @pvk-developer

Upgrade dependency ranges - Issue #69 by @katxiao

Source code(tar.gz)
Source code(zip)
v0.3.2(Aug 17, 2021)
This release makes pomegranate an optional dependency.

Issues closed

Make pomegranate an optional dependency - Issue #63 by @fealho

Source code(tar.gz)
Source code(zip)
v0.3.1(Jul 12, 2021)
v0.3.1 - 2021-07-12

This release fixes a bug to make the privacy metrics available in the API docs. It also updates dependencies to ensure compatibility with the rest of the SDV ecosystem.

Issues closed

CategoricalSVM not being imported - Issue #65 by @csala

Source code(tar.gz)
Source code(zip)
v0.3.0(Mar 31, 2021)
This release includes privacy metrics to evaluate if the real data could be obtained or deduced from the synthetic samples. Additionally all the metrics have a normalize method which takes the raw_score generated by the metric and returns a value between 0 and 1.

Issues closed

Add normalize method to metrics - Issue #51 by @csala and @fealho

Implement privacy metrics - Issue #36 by @ZhuofanXie and @fealho

Source code(tar.gz)
Source code(zip)
v0.2.0(Feb 24, 2021)

Dependency upgrades to ensure compatibility with the rest of the SDV ecosystem.
Source code(tar.gz)
Source code(zip)
v0.1.3(Feb 15, 2021)
Updates the required dependecies to facilitate a conda release.

Issues closed

Upgrade sktime - Issue #49 by @fealho

Source code(tar.gz)
Source code(zip)
v0.1.2(Jan 27, 2021)
Big fixing release that addresses several minor errors.

Issues closed

More splits than classes - Issue #46 by @fealho

Scipy 1.6.0 causes an AttributeError - Issue #44 by @fealho

Time series metrics fails with variable length timeseries - Issue #42 by @fealho

ParentChildDetection metrics KeyError - Issue #39 by @csala

Source code(tar.gz)
Source code(zip)
v0.1.1(Dec 30, 2020)
This version adds Time Series Detection and Efficacy metrics, as well as a fix to ensure that Single Table binary classification efficacy metrics work well with binary targets which are not boolean.

Issues closed

Timeseries efficacy metrics - Issue #35 by @csala

Timeseries detection metrics - Issue #34 by @csala

Ensure binary classification targets are bool - Issue #33 by @csala

Source code(tar.gz)
Source code(zip)
v0.1.0(Dec 18, 2020)
This release introduces a new project organization and API, with metrics grouped by data modality, with a common API:

Single Column

Column Pair

Single Table

Multi Table

Time Series

Within each data modality, different families of metrics have been implemented:

Statistical

Detection

Bayesian Network and Gaussian Mixture Likelihood

Machine Learning Efficacy

Source code(tar.gz)
Source code(zip)
v0.0.4(Nov 27, 2020)

Patch release to relax dependencies and avoid conflicts when using the latest SDV version.
Source code(tar.gz)
Source code(zip)
v0.0.3(Nov 20, 2020)
Fix error on detection metrics when input data contains infinity or NaN values.

Issues closed

ValueError: Input contains infinity or a value too large for dtype('float64') - Issue #11 by @csala

Source code(tar.gz)
Source code(zip)
v0.0.2(Aug 8, 2020)

Add support for Python 3.8 and a broader range of dependencies.
Source code(tar.gz)
Source code(zip)
v0.0.1(Jun 26, 2020)

First release to PyPI
Source code(tar.gz)
Source code(zip)