Official git for "CTAB-GAN: Effective Table Data Synthesizing"



This is the official git paper CTAB-GAN: Effective Table Data Synthesizing. The paper is published on Asian Conference on Machine Learning (ACML 2021), please check our pdf on PMLR website for our newest version of paper, it adds more content on time consumption analysis of training CTAB-GAN. If you have any question, please contact [email protected] for more information.


Experiment_Script_Adult.ipynb is an example notebook for training CTAB-GAN with Adult dataset. The dataset is alread under Real_Datasets folder. The evaluation code is also provided.

For large dataset

If your dataset has large number of column, you may encounter the problem that our currnet code cannot encode all of your data since CTAB-GAN will wrap the encoded data into an image-like format. What you can do is changing the line 341 and 348 in model/synthesizer/ The number in the slide list

sides = [4, 8, 16, 24, 32]

is the side size of image. You can enlarge the list to [4, 8, 16, 24, 32, 64] or [4, 8, 16, 24, 32, 64, 128] for accepting larger dataset.


To cite this paper, you could use this bibtex

  title = 	 {CTAB-GAN: Effective Table Data Synthesizing},
  author =       {Zhao, Zilong and Kunar, Aditya and Birke, Robert and Chen, Lydia Y.},
  booktitle = 	 {Proceedings of The 13th Asian Conference on Machine Learning},
  pages = 	 {97--112},
  year = 	 {2021},
  editor = 	 {Balasubramanian, Vineeth N. and Tsang, Ivor},
  volume = 	 {157},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {17--19 Nov},
  publisher =    {PMLR},
  pdf = 	 {},
  url = 	 {}

  • Not working with Tabular Dataset

    Good day to you Zhao Zilong.

    I tried using CTAB_GAN to generate some fake data but I couldn't have a smooth generation. I used this:

    synthesizer = CTABGAN(raw_csv_path = real_path, test_ratio = 0.20,
    categorical_columns = ['Target'], log_columns = [], mixed_columns= {}, integer_columns = ['Sport', 'TotPkts','TotBytes', 'SrcPkts','DstPkts','SrcBytes','Target'], problem_type= {"Classification": 'Target'}, epochs = 10)

    I have the following errors:

    AttributeError Traceback (most recent call last) AttributeError: 'str' object has no attribute 'rint TypeError: loop of ufunc does not support argument 0 of type str which has no callable rint method

    I used the exact numpy 1.21.0 specified. image

    Please, help me to check and see what I am missing. It worked fine with Adult Data

    opened by vicjoy 4
  • Generating data without problem type

    Hi there,

    I wanted to use this repo to generate fake cencus data for upsampling my microdata. However, I am confused with the ''problem-type'' part. I checked the repo and It seems it does not work without giving ml specific problem, which I do not have yet. Still I tried this code:

    synthesizer = CTABGAN('/Users/erensmacbook/Desktop/hay.csv', test_ratio = 0.20,
    categorical_columns = ['INDP','AGEP','INCP'], epochs = 1, problem_type= {"REGRESSION": 'SEXP'})

    Where the categorical variables are income, age, job and sex groups. It seems worked, but then I got this error when I tried to generate sample data

    ~/ctab/model/ in generate_samples(self) 66 67 sample = self.synthesizer.sample(len(self.raw_df)) ---> 68 sample_df = self.data_prep.inverse_prep(sample) 69 70 return sample_df

    KeyError: 'age'

    Although , I dont have age in my dataset. Is it given, it is not possible to use this repo for various dataset?


    opened by erenarkangel 4
  • TypeError: __init__() takes 1 positional argument but 2 positional arguments (and 5 keyword-only arguments) were given

    thanks for publishing the code! I somehow have a problem with the jupyter notebook example. I receive this error: TypeError: __init__() takes 1 positional argument but 2 positional arguments (and 5 keyword-only arguments) were given in the third cell. This is the complete traceback:

    TypeError                                 Traceback (most recent call last)
    /tmp/ipykernel_52046/ in <module>
         10 for i in range(num_exp):
    ---> 11
         12     syn = synthesizer.generate_samples()
         13     syn.to_csv(fake_file_root+"/"+dataset+"/"+ dataset+"_fake_{exp}.csv".format(exp=i), index= False)
    /Josef/CTAB-GAN/model/ in fit(self)
         39         start_time = time.time()
         40         self.data_prep = DataPrep(self.raw_df,self.categorical_columns,self.log_columns,self.mixed_columns,self.integer_columns,self.problem_type,self.test_ratio)
    ---> 41, categorical = self.data_prep.column_types["categorical"], 
         42         mixed = self.data_prep.column_types["mixed"],type=self.problem_type)
         43         end_time = time.time()
    /Josef/CTAB-GAN/model/synthesizer/ in fit(self, train_data, categorical, mixed, type)
        332         self.transformer = DataTransformer(train_data=train_data, categorical_list=categorical, mixed_dict=mixed)
    --> 333
        335         train_data = self.transformer.transform(train_data.values)
    /Josef/CTAB-GAN/model/synthesizer/ in fit(self)
         57         for id_, info in enumerate(self.meta):
         58             if info['type'] == "continuous":
    ---> 59                 gm = BayesianGaussianMixture(
         60                     self.n_clusters,
         61                     weight_concentration_prior_type='dirichlet_process',

    I could not figure out what this error is due to, do you know why this happens?

    opened by Zepp3 3
  • Datetime Object

    Hi Zilong,

    I'm using the CTAB-GAN to generate synthetic data. My dataset contains a datetime column with no missing values and I encounter the error message: could not convert string to float: '2010-04-01'. Could you please advise if the CTAB-GAN can handle datetime columns?

    Thank you for your time.

    image image

    opened by amieelxy 2
  • Treating Continuous Variable with Missing Values as Mix Variable

    I would like to generate synthetic dataset with your repo, and my continuous column contains missing values. Reference to your article, I treat the continuous column with missing values as mixed variables but the "mixed_columns" parameter for object CTABGAN requires a "dictionary of column name and categorical modes used for "mix" of numeric and categorical distribution". I understand that for the "mortgage" case we should put down mode 0.0 to treat the special meaning of 0, but what mode should I put down for the missing values?

    Thank you!

    opened by amieelxy 2
  • Any way to generate multiple datasets with same learned model?

    So when running the CTAB-GAN code as in the example on this repo, CTAB-GAN learns the inputted dataset and generates a new fake dataset.

    Is there any way to run it so it learns the input dataset once, and then generates multiple fake datasets?

    The alternative is to run the example code every time, which means the code has to learn the input dataset every time.

    Kind regards,


    opened by femurray 2
  • TypeError: __init__() takes 1 positional argument but 3 positional arguments (and 5 keyword-only arguments) were given

    For my project I remote control a PhD student's PC on campus as it is quicker than my own laptop. I do this so my code can run quicker.

    One issue is that CTAB-GAN runs fine (but slow) on my laptop, but can't get going on the PhD PC. I've made sure the python versions are the same, but the same error keeps getting thrown. The details are below.

    Traceback (most recent call last):

    Input In [6] in <cell line: 37>

    File ~\Desktop\python\CTAB-GAN\model\ in fit, categorical = self.data_prep.column_types["categorical"],

    File ~\Desktop\python\CTAB-GAN\model\synthesizer\ in fit

    File ~\Desktop\python\CTAB-GAN\model\synthesizer\ in fit gm = BayesianGaussianMixture(

    TypeError: init() takes 1 positional argument but 3 positional arguments (and 5 keyword-only arguments) were given

    When I google the error, it says to add the argument "self" in the function init(), however, all instances of the function init() on the file already have "self" as an argument. Do you have any insight into this issue?

    opened by femurray 2
  • Standardizing data prior to use

    The first question I have is, would it be worth standardizing my datasets before using CTAB-GAN on them?

    I haven't done this so far as the CTAB-GAN tutorial didn't include standardization, but realized after running cWGAN code (which does use standardizing) that it could potentially speed up the time it takes to run CTAB-GAN on each dataset, among other benefits.

    Or would doing so mess up the running of CTAB-GAN?

    opened by femurray 2
  • TypeError: loop of ufunc does not support argument 0 of type float which has no callable log method

    I have another question about CTAB-GAN. Did you also encounter the error above while developing the model? I guess it's because some values in the dataset (I try to apply my own dataset to CTAB-GAN) are non-float types. But I transformed every numeric column to float now by hand to make sure this is not the case but the error still occurs. Maybe you had the same problem and know a solution from your experience?

    opened by Zepp3 2
  • Saving fake data to .csv in jupyter notebook

    this line syn.to_csv(fake_file_root+"/"+dataset+"/"+dataset+"_fake_{exp}.csv".format(exp=i), index= False) in the third cell of the juypter notebook example won't work if the directory Fake_Datasets/Adult is not created yet (as it is in the repo), as pandas does not create directories if they are not existing. Maybe create the directory already in the repo so one can just clone it or do something like

    outdir = 'Fake_Datasets/Adult'
    if not os.path.exists(outdir):
    syn.to_csv(outdir+"/"+dataset+"_fake_{exp}.csv".format(exp=i), index= False)

    if I'm not wrong.

    Kind regards!

    opened by Zepp3 2
  • ImportError: cannot import name 'compute_associations' from 'dython.nominal'

    Hi, I found this error in It turns out dython was updated, and compute_associations was removed.

    According to, the compute_associations is replaced by associations(compute_only=True)['corr'].

    I edited the code in locally, editing line 10 to import associations instead of compute_associations and replacing "real_corr = compute_associations(real, nominal_columns=cat_cols)" on line 110 with "real_corr = associations(real, nominal_columns=cat_cols, compute_only=True)['corr']".

    Line 112 was updated accordingly too.

    I think this fixes the issue, just wanted to flag it here in case this affects the code in ways the developers might notice but that I wouldn't.


    opened by femurray 0
  • How to specify decimal places in generated data

    My input data has integers and decimals with two places, but the generated data has eight decimal places, even when the input is an integer. Is there a way to specify number of decimal places, or data type (decimal/integer) or I need to do that manually after the data generation?

    opened by pgschr 3
  • Add features

    I added two features. On one hand, one can now choose the number of samples he or she wants to generate. On the other hand, all hyperparameters can be set according to one's application and/or preferences.

    opened by Zepp3 0
