Official git for "CTAB-GAN: Effective Table Data Synthesizing"

Overview

CTAB-GAN

This is the official git paper CTAB-GAN: Effective Table Data Synthesizing. The paper is published on Asian Conference on Machine Learning (ACML 2021), please check our pdf on PMLR website for our newest version of paper, it adds more content on time consumption analysis of training CTAB-GAN. If you have any question, please contact [email protected] for more information.

Example

Experiment_Script_Adult.ipynb is an example notebook for training CTAB-GAN with Adult dataset. The dataset is alread under Real_Datasets folder. The evaluation code is also provided.

For large dataset

If your dataset has large number of column, you may encounter the problem that our currnet code cannot encode all of your data since CTAB-GAN will wrap the encoded data into an image-like format. What you can do is changing the line 341 and 348 in model/synthesizer/ctabgan_synthesizer.py. The number in the slide list

sides = [4, 8, 16, 24, 32]

is the side size of image. You can enlarge the list to [4, 8, 16, 24, 32, 64] or [4, 8, 16, 24, 32, 64, 128] for accepting larger dataset.

Bibtex

To cite this paper, you could use this bibtex

@InProceedings{zhao21,
  title = 	 {CTAB-GAN: Effective Table Data Synthesizing},
  author =       {Zhao, Zilong and Kunar, Aditya and Birke, Robert and Chen, Lydia Y.},
  booktitle = 	 {Proceedings of The 13th Asian Conference on Machine Learning},
  pages = 	 {97--112},
  year = 	 {2021},
  editor = 	 {Balasubramanian, Vineeth N. and Tsang, Ivor},
  volume = 	 {157},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--19 Nov},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v157/zhao21a/zhao21a.pdf},
  url = 	 {https://proceedings.mlr.press/v157/zhao21a.html}
}


Comments
  • Not working with Tabular Dataset

    Not working with Tabular Dataset

    Hi,

    Good day to you Zhao Zilong.

    I tried using CTAB_GAN to generate some fake data but I couldn't have a smooth generation. I used this:

    synthesizer = CTABGAN(raw_csv_path = real_path, test_ratio = 0.20,
    categorical_columns = ['Target'], log_columns = [], mixed_columns= {}, integer_columns = ['Sport', 'TotPkts','TotBytes', 'SrcPkts','DstPkts','SrcBytes','Target'], problem_type= {"Classification": 'Target'}, epochs = 10)

    I have the following errors:

    AttributeError Traceback (most recent call last) AttributeError: 'str' object has no attribute 'rint TypeError: loop of ufunc does not support argument 0 of type str which has no callable rint method

    I used the exact numpy 1.21.0 specified. image

    Please, help me to check and see what I am missing. It worked fine with Adult Data

    opened by vicjoy 4
  • Generating data without problem type

    Generating data without problem type

    Hi there,

    I wanted to use this repo to generate fake cencus data for upsampling my microdata. However, I am confused with the ''problem-type'' part. I checked the repo and It seems it does not work without giving ml specific problem, which I do not have yet. Still I tried this code:

    synthesizer = CTABGAN('/Users/erensmacbook/Desktop/hay.csv', test_ratio = 0.20,
    categorical_columns = ['INDP','AGEP','INCP'], epochs = 1, problem_type= {"REGRESSION": 'SEXP'})

    Where the categorical variables are income, age, job and sex groups. It seems worked, but then I got this error when I tried to generate sample data

    ~/ctab/model/ctabgan.py in generate_samples(self) 66 67 sample = self.synthesizer.sample(len(self.raw_df)) ---> 68 sample_df = self.data_prep.inverse_prep(sample) 69 70 return sample_df

    KeyError: 'age'

    Although , I dont have age in my dataset. Is it given, it is not possible to use this repo for various dataset?

    Thanks

    opened by erenarkangel 4
  • TypeError: __init__() takes 1 positional argument but 2 positional arguments (and 5 keyword-only arguments) were given

    TypeError: __init__() takes 1 positional argument but 2 positional arguments (and 5 keyword-only arguments) were given

    Hi,

    thanks for publishing the code! I somehow have a problem with the jupyter notebook example. I receive this error: TypeError: __init__() takes 1 positional argument but 2 positional arguments (and 5 keyword-only arguments) were given in the third cell. This is the complete traceback:

    TypeError                                 Traceback (most recent call last)
    /tmp/ipykernel_52046/237735526.py in <module>
          9 
         10 for i in range(num_exp):
    ---> 11     synthesizer.fit()
         12     syn = synthesizer.generate_samples()
         13     syn.to_csv(fake_file_root+"/"+dataset+"/"+ dataset+"_fake_{exp}.csv".format(exp=i), index= False)
    
    /Josef/CTAB-GAN/model/ctabgan.py in fit(self)
         39         start_time = time.time()
         40         self.data_prep = DataPrep(self.raw_df,self.categorical_columns,self.log_columns,self.mixed_columns,self.integer_columns,self.problem_type,self.test_ratio)
    ---> 41         self.synthesizer.fit(train_data=self.data_prep.df, categorical = self.data_prep.column_types["categorical"], 
         42         mixed = self.data_prep.column_types["mixed"],type=self.problem_type)
         43         end_time = time.time()
    
    /Josef/CTAB-GAN/model/synthesizer/ctabgan_synthesizer.py in fit(self, train_data, categorical, mixed, type)
        331 
        332         self.transformer = DataTransformer(train_data=train_data, categorical_list=categorical, mixed_dict=mixed)
    --> 333         self.transformer.fit()
        334 
        335         train_data = self.transformer.transform(train_data.values)
    
    /Josef/CTAB-GAN/model/synthesizer/transformer.py in fit(self)
         57         for id_, info in enumerate(self.meta):
         58             if info['type'] == "continuous":
    ---> 59                 gm = BayesianGaussianMixture(
         60                     self.n_clusters,
         61                     weight_concentration_prior_type='dirichlet_process',
    

    I could not figure out what this error is due to, do you know why this happens?

    opened by Zepp3 3
  • Datetime Object

    Datetime Object

    Hi Zilong,

    I'm using the CTAB-GAN to generate synthetic data. My dataset contains a datetime column with no missing values and I encounter the error message: could not convert string to float: '2010-04-01'. Could you please advise if the CTAB-GAN can handle datetime columns?

    Thank you for your time.

    image image

    opened by amieelxy 2
  • Treating Continuous Variable with Missing Values as Mix Variable

    Treating Continuous Variable with Missing Values as Mix Variable

    Hi,

    I would like to generate synthetic dataset with your repo, and my continuous column contains missing values. Reference to your article, I treat the continuous column with missing values as mixed variables but the "mixed_columns" parameter for object CTABGAN requires a "dictionary of column name and categorical modes used for "mix" of numeric and categorical distribution". I understand that for the "mortgage" case we should put down mode 0.0 to treat the special meaning of 0, but what mode should I put down for the missing values?

    Thank you!

    opened by amieelxy 2
  • Any way to generate multiple datasets with same learned model?

    Any way to generate multiple datasets with same learned model?

    Hi,

    So when running the CTAB-GAN code as in the example on this repo, CTAB-GAN learns the inputted dataset and generates a new fake dataset.

    Is there any way to run it so it learns the input dataset once, and then generates multiple fake datasets?

    The alternative is to run the example code every time, which means the code has to learn the input dataset every time.

    Kind regards,

    Fergal

    opened by femurray 2
  • TypeError: __init__() takes 1 positional argument but 3 positional arguments (and 5 keyword-only arguments) were given

    TypeError: __init__() takes 1 positional argument but 3 positional arguments (and 5 keyword-only arguments) were given

    For my project I remote control a PhD student's PC on campus as it is quicker than my own laptop. I do this so my code can run quicker.

    One issue is that CTAB-GAN runs fine (but slow) on my laptop, but can't get going on the PhD PC. I've made sure the python versions are the same, but the same error keeps getting thrown. The details are below.

    Traceback (most recent call last):

    Input In [6] in <cell line: 37> synthesizer.fit()

    File ~\Desktop\python\CTAB-GAN\model\ctabgan.py:43 in fit self.synthesizer.fit(train_data=self.data_prep.df, categorical = self.data_prep.column_types["categorical"],

    File ~\Desktop\python\CTAB-GAN\model\synthesizer\ctabgan_synthesizer.py:333 in fit self.transformer.fit()

    File ~\Desktop\python\CTAB-GAN\model\synthesizer\transformer.py:59 in fit gm = BayesianGaussianMixture(

    TypeError: init() takes 1 positional argument but 3 positional arguments (and 5 keyword-only arguments) were given

    When I google the error, it says to add the argument "self" in the function init(), however, all instances of the function init() on the transformer.py file already have "self" as an argument. Do you have any insight into this issue?

    opened by femurray 2
  • Standardizing data prior to use

    Standardizing data prior to use

    The first question I have is, would it be worth standardizing my datasets before using CTAB-GAN on them?

    I haven't done this so far as the CTAB-GAN tutorial didn't include standardization, but realized after running cWGAN code (which does use standardizing) that it could potentially speed up the time it takes to run CTAB-GAN on each dataset, among other benefits.

    Or would doing so mess up the running of CTAB-GAN?

    opened by femurray 2
  • TypeError: loop of ufunc does not support argument 0 of type float which has no callable log method

    TypeError: loop of ufunc does not support argument 0 of type float which has no callable log method

    Hi!

    I have another question about CTAB-GAN. Did you also encounter the error above while developing the model? I guess it's because some values in the dataset (I try to apply my own dataset to CTAB-GAN) are non-float types. But I transformed every numeric column to float now by hand to make sure this is not the case but the error still occurs. Maybe you had the same problem and know a solution from your experience?

    opened by Zepp3 2
  • Saving fake data to .csv in jupyter notebook

    Saving fake data to .csv in jupyter notebook

    Hi,

    this line syn.to_csv(fake_file_root+"/"+dataset+"/"+dataset+"_fake_{exp}.csv".format(exp=i), index= False) in the third cell of the juypter notebook example won't work if the directory Fake_Datasets/Adult is not created yet (as it is in the repo), as pandas does not create directories if they are not existing. Maybe create the directory already in the repo so one can just clone it or do something like

    outdir = 'Fake_Datasets/Adult'
    if not os.path.exists(outdir):
        os.mkdir(outdir)
    syn.to_csv(outdir+"/"+dataset+"_fake_{exp}.csv".format(exp=i), index= False)
    

    if I'm not wrong.

    Kind regards!

    opened by Zepp3 2
  • ImportError: cannot import name 'compute_associations' from 'dython.nominal'

    ImportError: cannot import name 'compute_associations' from 'dython.nominal'

    Hi, I found this error in evaluation.py. It turns out dython was updated, and compute_associations was removed.

    According to http://shakedzy.xyz/dython/modules/nominal/#compute_associations, the compute_associations is replaced by associations(compute_only=True)['corr'].

    I edited the code in evaluation.py locally, editing line 10 to import associations instead of compute_associations and replacing "real_corr = compute_associations(real, nominal_columns=cat_cols)" on line 110 with "real_corr = associations(real, nominal_columns=cat_cols, compute_only=True)['corr']".

    Line 112 was updated accordingly too.

    I think this fixes the issue, just wanted to flag it here in case this affects the code in ways the developers might notice but that I wouldn't.

    Thanks

    opened by femurray 0
  • How to specify decimal places in generated data

    How to specify decimal places in generated data

    My input data has integers and decimals with two places, but the generated data has eight decimal places, even when the input is an integer. Is there a way to specify number of decimal places, or data type (decimal/integer) or I need to do that manually after the data generation?

    opened by pgschr 3
  • Add features

    Add features

    I added two features. On one hand, one can now choose the number of samples he or she wants to generate. On the other hand, all hyperparameters can be set according to one's application and/or preferences.

    opened by Zepp3 0
Owner
null
git《Beta R-CNN: Looking into Pedestrian Detection from Another Perspective》(NeurIPS 2020) GitHub:[fig3]

Beta R-CNN: Looking into Pedestrian Detection from Another Perspective This is the pytorch implementation of our paper "[Beta R-CNN: Looking into Pede

null 35 Sep 8, 2021
git《Learning Pairwise Inter-Plane Relations for Piecewise Planar Reconstruction》(ECCV 2020) GitHub:

Learning Pairwise Inter-Plane Relations for Piecewise Planar Reconstruction Code for the ECCV 2020 paper by Yiming Qian and Yasutaka Furukawa Getting

null 37 Dec 4, 2022
git《Commonsense Knowledge Base Completion with Structural and Semantic Context》(AAAI 2020) GitHub: [fig1]

Commonsense Knowledge Base Completion with Structural and Semantic Context Code for the paper Commonsense Knowledge Base Completion with Structural an

AI2 96 Nov 5, 2022
git《Tangent Space Backpropogation for 3D Transformation Groups》(CVPR 2021) GitHub:1]

LieTorch: Tangent Space Backpropagation Introduction The LieTorch library generalizes PyTorch to 3D transformation groups. Just as torch.Tensor is a m

Princeton Vision & Learning Lab 482 Jan 6, 2023
git《Self-Attention Attribution: Interpreting Information Interactions Inside Transformer》(AAAI 2021) GitHub:

Self-Attention Attribution This repository contains the implementation for AAAI-2021 paper Self-Attention Attribution: Interpreting Information Intera

null 60 Dec 29, 2022
git《Investigating Loss Functions for Extreme Super-Resolution》(CVPR 2020) GitHub:

Investigating Loss Functions for Extreme Super-Resolution NTIRE 2020 Perceptual Extreme Super-Resolution Submission. Our method ranked first and secon

Sejong Yang 0 Oct 17, 2022
git《FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding》(CVPR 2021) GitHub: [fig8]

FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding (CVPR 2021) This repo contains the implementation of our state-of-the-art fewshot ob

null 233 Dec 29, 2022
git《Pseudo-ISP: Learning Pseudo In-camera Signal Processing Pipeline from A Color Image Denoiser》(2021) GitHub: [fig5]

Pseudo-ISP: Learning Pseudo In-camera Signal Processing Pipeline from A Color Image Denoiser Abstract The success of deep denoisers on real-world colo

Yue Cao 51 Nov 22, 2022
git《Joint Entity and Relation Extraction with Set Prediction Networks》(2020) GitHub:

Joint Entity and Relation Extraction with Set Prediction Networks Source code for Joint Entity and Relation Extraction with Set Prediction Networks. W

null 130 Dec 13, 2022
git《USD-Seg:Learning Universal Shape Dictionary for Realtime Instance Segmentation》(2020) GitHub: [fig2]

USD-Seg This project is an implement of paper USD-Seg:Learning Universal Shape Dictionary for Realtime Instance Segmentation, based on FCOS detector f

Ruolin Ye 80 Nov 28, 2022
Let's Git - Versionsverwaltung & Open Source Hausaufgabe

Let's Git - Versionsverwaltung & Open Source Hausaufgabe Herzlich Willkommen zu dieser Hausaufgabe für unseren MOOC: Let's Git! Wir hoffen, dass Du vi

null 1 Dec 13, 2021
This git repo contains the implementation of my ML project on Heart Disease Prediction

Introduction This git repo contains the implementation of my ML project on Heart Disease Prediction. This is a real-world machine learning model/proje

Aryan Dutta 1 Feb 2, 2022
Split your patch similarly to `git add -p` but supporting multiple buckets

split-patch.py This is git add -p on steroids for patches. Given a my.patch you can run ./split-patch.py my.patch You can choose in which bucket to p

null 102 Oct 6, 2022
The project is an official implementation of our CVPR2019 paper "Deep High-Resolution Representation Learning for Human Pose Estimation"

Deep High-Resolution Representation Learning for Human Pose Estimation (CVPR 2019) News [2020/07/05] A very nice blog from Towards Data Science introd

Leo Xiao 3.9k Jan 5, 2023
Official implementation of AAAI-21 paper "Label Confusion Learning to Enhance Text Classification Models"

Description: This is the official implementation of our AAAI-21 accepted paper Label Confusion Learning to Enhance Text Classification Models. The str

null 101 Nov 25, 2022
Official PyTorch implementation for paper Context Matters: Graph-based Self-supervised Representation Learning for Medical Images

Context Matters: Graph-based Self-supervised Representation Learning for Medical Images Official PyTorch implementation for paper Context Matters: Gra

null 49 Nov 23, 2022
The official implementation of NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021]. https://arxiv.org/pdf/2101.12378.pdf

NeMo: Neural Mesh Models of Contrastive Features for Robust 3D Pose Estimation [ICLR-2021] Release Notes The offical PyTorch implementation of NeMo, p

Angtian Wang 76 Nov 23, 2022
Official Repo for Ground-aware Monocular 3D Object Detection for Autonomous Driving

Visual 3D Detection Package: This repo aims to provide flexible and reproducible visual 3D detection on KITTI dataset. We expect scripts starting from

Yuxuan Liu 305 Dec 19, 2022
Official TensorFlow code for the forthcoming paper

~ Efficient-CapsNet ~ Are you tired of over inflated and overused convolutional neural networks? You're right! It's time for CAPSULES :)

Vittorio Mazzia 203 Jan 8, 2023