{ "info": { "author": "MIT Data To AI Lab", "author_email": "dailabmit@gmail.com", "bugtrack_url": null, "classifiers": [ "Development Status :: 2 - Pre-Alpha", "Intended Audience :: Developers", "License :: OSI Approved :: MIT License", "Natural Language :: English", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: 3.7" ], "description": "

\n\"DAI-Lab\"\nAn open source project from Data to AI Lab at MIT.\n

\n\n[![PyPi Shield](https://img.shields.io/pypi/v/TGAN.svg)](https://pypi.python.org/pypi/TGAN)\n[![Travis CI Shield](https://travis-ci.org/DAI-Lab/TGAN.svg?branch=master)](https://travis-ci.org/DAI-Lab/TGAN)\n\n# TGAN\n\nGenerative adversarial training for synthesizing tabular data.\n\nTGAN is a tabular data synthesizer. It can generate fully synthetic data from real data. Currently, TGAN can\ngenerate numerical columns and categorical columns.\n\n* Free software: MIT license\n* Documentation: https://DAI-Lab.github.io/TGAN\n* Homepage: https://github.com/DAI-Lab/TGAN\n\n# Requirements\n\n## Python\n\n**TGAN** has been developed and runs on Python [3.5](https://www.python.org/downloads/release/python-356/),\n[3.6](https://www.python.org/downloads/release/python-360/) and\n[3.7](https://www.python.org/downloads/release/python-370/).\n\nAlso, although it is not strictly required, the usage of a [virtualenv](https://virtualenv.pypa.io/en/latest/)\nis highly recommended in order to avoid interfering with other software installed in the system where **TGAN**\nis run.\n\n# Installation\n\nThe simplest and recommended way to install TGAN is using `pip`:\n\n```\npip install tgan\n```\n\nAlternatively, you can also clone the repository and install it from sources\n\n```\ngit clone git@github.com:DAI-Lab/TGAN.git\ncd TGAN\nmake install\n```\n\nFor development, you can use `make install-develop` instead in order to install all the required\ndependencies for testing and code linting.\n\n# Data Format\n\n## Input Format\n\nIn order to be able to sample new synthetic data, **TGAN** first needs to be *fitted* to\nexisting data.\n\nThe input data for this *fitting* process has to be a single table that satisfies the following\nrules:\n\n* Has no missing values.\n* Has columns of types `int`, `float`, `str` or `bool`.\n* Each column contains data of only one type.\n\nAn example of such a tables would be:\n\n| str_column | float_column | int_column | bool_column |\n|------------|--------------|------------|-------------|\n| 'green' | 0.15 | 10 | True |\n| 'blue' | 7.25 | 23 | False |\n| 'red' | 10.00 | 1 | False |\n| 'yellow' | 5.50 | 17 | True |\n\nAs you can see, this table contains 4 columns: `str_column`, `float_column`, `int_column` and\n`bool_column`, each one being an example of the supported value types. Notice aswell that there is\nno missing values for any of the rows.\n\n**NOTE**: It's important to have properly identifed which of the columns are numerical, which means\nthat they represent a magnitude, and which ones are categorical, as during the preprocessing of\nthe data, numerical and categorical columns will be processed differently.\n\n## Output Format\n\nThe output of **TGAN** is a table of sampled data with the same columns as the input table and as\nmany rows as requested.\n\n## Demo Datasets\n\n**TGAN** includes a few datasets to use for development or demonstration purposes. These datasets\ncome from the [UCI Machine Learning repository](http://archive.ics.uci.edu/ml), and have been\npreprocessed to be ready to use with **TGAN**, following the requirements specified in the\n[Input Format](#input-format) section.\n\nThese datasets can be browsed and directly downloaded from the\n[hdi-project-tgan AWS S3 Bucket](http://hdi-project-tgan.s3.amazonaws.com/index.html)\n\n### Census dataset\n\nThis dataset contains a single table, with information from the census, labeled with information of\nwheter or not the income of is greater than 50.000 $/year. It's a single csv file, containing\n199522 rows and 41 columns. From these 41 columns, only 7 are identified as continuous. In\n**TGAN** this dataset is called `census`.\n\n### Cover type\n\nThis dataset contains a single table with cartographic information labeled with the different\nforrest cover types. It's a single csv file, containing 465588 rows and 55 columns. From these\n55 columns, 10 are identified as continuous. In **TGAN** this dataset is called `covertype`.\n\n# Quickstart\n\nIn this short tutorial we will guide you through a series of steps that will help you getting\nstarted with the most basic usage of **TGAN** in order to generate samples from a given dataset.\n\n**NOTE**: The following examples are also covered in a [Jupyter](https://jupyter.org/) notebook,\nwhich you can execute by running the following commands inside your *virtualenv*:\n\n```\npip install jupyter\njupyter notebook examples/Usage_Example.ipynb\n```\n\n## 1. Load the data\n\nThe first step is to load the data wich we will use to fit TGAN. In order to do so, we will first\nimport the function `tgan.data.load_data` and call it with the name of the dataset that we want to\nload.\n\nIn this case, we will load the `census` dataset, which we will use during the subsequent steps,\nand obtain two objects:\n\n1. `data`, that will contain a `pandas.DataFrame` with the table of data from the `census`\ndataset ready to be used to fit the model.\n\n2. `continuous_columns`, that will contain a `list` with the indices of continuous columns.\n\n```\n>>> from tgan.data import load_demo_data\n>>> data, continuous_columns = load_demo_data('census')\n>>> data.head(3).T[:10]\n 0 1 2\n0 73 58 18\n1 Not in universe Self-employed-not incorporated Not in universe\n2 0 4 0\n3 0 34 0\n4 High school graduate Some college but no degree 10th grade\n5 0 0 0\n6 Not in universe Not in universe High school\n7 Widowed Divorced Never married\n8 Not in universe or children Construction Not in universe or children\n9 Not in universe Precision production craft & repair Not in universe\n\n>>> continuous_columns\n[0, 5, 16, 17, 18, 29, 38]\n\n```\n\n## 2. Create a TGAN instance\n\nThe next step is to import TGAN and create an instance of the model.\n\nTo do so, we need to import the `tgan.model.TGANModel` class and call it with the\n`continuous_columns` as unique argument.\n\nThis will create a TGAN instance with the default parameters:\n\n```\n>>> from tgan.model import TGANModel\n>>> tgan = TGANModel(continuous_columns)\n```\n\n## 3. Fit the model\n\nOnce you have a **TGAN** instance, you can proceed to call it's `fit` method passing the `data` that\nyou loaded before in order to start the fitting process:\n\n```\n>>> tgan.fit(data)\n```\n\nThis process will not return anything, however, the progress of the fitting will be printed in the\nscreen.\n\n**NOTE** Depending on the performance of the system you are running, and the parameters selected\nfor the model, this step can take up to a few hours.\n\n## 4. Sample new data\n\nAfter the model has been fitted, you are ready to generate new samples by calling the `sample`\nmethod of the `TGAN` instance passing it the desired amount of samples:\n\n```\n>>> num_samples = 1000\n>>> samples = tgan.sample(num_samples)\n>>> samples.head(3).T[:10]\n 0 1 2\n0 12 27 56\n\n0 12 27 56\n1 Not in universe Self-employed-not incorporated Private\n2 0 4 35\n3 0 34 22\n4 Children Some college but no degree Some college but no degree\n5 0 0 500\n6 Not in universe Not in universe Not in universe\n7 Never married Married-civilian spouse present Married-civilian spouse present\n8 Not in universe or children Construction Finance insurance and real estate\n9 Not in universe Precision production craft & repair Adm support including clerical\n\n```\n\nThe returned object, `samples`, is a `pandas.DataFrame` containing a table of synthetic data with\nthe same format as the input data and 1000 rows as we requested.\n\n## 5. Save and Load a model\n\nIn the steps above we saw that the fitting process can take a lot of time, so we probably would\nlike to avoid having to fit every we want to generate samples. Instead we can fit a model once,\nsave it, and load it every time we want to sample new data.\n\nIf we have a fitted model, we can save it by calling it's `save` method, that only takes\nas argument the path where the model will be stored. Similarly, the `TGANModel.load` allows to load\na model stored on disk by passing as argument the path where the model is stored.\n\n```\n>>> model_path = 'models/mymodel.pkl'\n>>> tgan.save(model_path)\nModel saved successfully.\n```\n\nBear in mind that in case the file already exists, **TGAN** will avoid overwritting it unless the\n`force=True` argument is passed:\n\n```\n>>> tgan.save(model_path)\nThe indicated path already exists. Use `force=True` to overwrite.\n```\n\nIn order to do so:\n\n```\n>>> tgan.save(model_path, force=True)\nModel saved successfully.\n```\n\nOnce the model is saved, it can be loaded back as a **TGAN** instance by using the `TGANModel.load`\nmethod:\n\n```\n>>> new_tgan = TGANModel.load(model_path)\n>>> new_samples = new_tgan.sample(num_samples)\n>>> new_samples.head(3).T[:10]\n\n 0 1 2\n0 12 27 56\n\n0 12 27 56\n1 Not in universe Self-employed-not incorporated Private\n2 0 4 35\n3 0 34 22\n4 Children Some college but no degree Some college but no degree\n5 0 0 500\n6 Not in universe Not in universe Not in universe\n7 Never married Married-civilian spouse present Married-civilian spouse present\n8 Not in universe or children Construction Finance insurance and real estate\n9 Not in universe Precision production craft & repair Adm support including clerical\n```\n\nAt this point we could use this model instance to generate more samples.\n\n# Loading custom datasets\n\nIn the previous steps we used some demonstration data but we did not show you how to load your own\ndataset.\n\nIn order to do so you will need to generate a `pandas.DataFrame` object from your dataset. If your\ndataset is in a `csv` format you can do so by using `pandas.read_csv` and passing to it the path to\nthe CSV file that you want to load.\n\nAdditionally, you will need to create 0-indexed list of columns indices to be considered continuous.\n\nFor example, if we want to load a local CSV file, `path/to/my.csv`, that has as continuous columns\ntheir first 4 columns, that is, indices `[0, 1, 2, 3]`, we would do it like this:\n\n```\n>>> import pandas as pd\n>>> data = pd.read_csv('data/census.csv')\n>>> continuous_columns = [0, 1, 2, 3]\n```\n\nNow you can use the `continuous_columns` to create a **TGAN** instance and use the `data` to `fit`\nit, like we did before:\n\n```\n>>> from tgan.model import TGANModel\n>>> tgan = TGANModel(continuous_columns)\n>>> tgan.fit(data)\n```\n\n# Model Parameters\n\nIf you want to change the default behavior of `TGANModel`, such as as different `batch_size` or\n`num_epochs`, you can do so by passing different arguments when creating the instance.\n\n## Model general behavior\n\n* continous_columns (`list[int]`, required): List of columns indices to be considered continuous.\n* output (`str`, default=`output`): Path to store the model and its artifacts.\n\n## Neural network definition and fitting\n\n* max_epoch (`int`, default=`100`): Number of epochs to use during training.\n* steps_per_epoch (`int`, default=`10000`): Number of steps to run on each epoch.\n* save_checkpoints(`bool`, default=True): Whether or not to store checkpoints of the model after each training epoch.\n* restore_session(`bool`, default=True): Whether or not continue training from the last checkpoint.\n* batch_size (`int`, default=`200`): Size of the batch to feed the model at each step.\n* z_dim (`int`, default=`100`): Number of dimensions in the noise input for the generator.\n* noise (`float`, default=`0.2`): Upper bound to the gaussian noise added to categorical columns.\n* l2norm (`float`, default=`0.00001`): L2 reguralization coefficient when computing losses.\n* learning_rate (`float`, default=`0.001`): Learning rate for the optimizer.\n* num_gen_rnn (`int`, default=`400`): Number of units in rnn cell in generator.\n* num_gen_feature (`int`, default=`100`): Number of units in fully connected layer in generator.\n* num_dis_layers (`int`, default=`2`): Number of layers in discriminator.\n* num_dis_hidden (`int`, default=`200`): Number of units per layer in discriminator.\n* optimizer (`str`, default=`AdamOptimizer`): Name of the optimizer to use during `fit`, possible\n values are: [`GradientDescentOptimizer`, `AdamOptimizer`, `AdadeltaOptimizer`].\n\nIf you wanted to create an identical instance to the one created on step 2, but passing the\narguments in a explicit way, this can be achieved with the following lines:\n\n```\n>>> from tgan.model import TGANModel\n>>> tgan = TGANModel(\n ...: continuous_columns,\n ...: output='output',\n ...: max_epoch=5,\n ...: steps_per_epoch=10000,\n ...: save_checkpoints=True,\n ...: restore_session=True,\n ...: batch_size=200,\n ...: z_dim=200,\n ...: noise=0.2,\n ...: l2norm=0.00001,\n ...: learning_rate=0.001,\n ...: num_gen_rnn=100,\n ...: num_gen_feature=100,\n ...: num_dis_layers=1,\n ...: num_dis_hidden=100,\n ...: optimizer='AdamOptimizer'\n ...: )\n```\n\n# Command-line interface\n\nWe include a command-line interface that allows users to access TGAN functionality. Currently only\none action is supported.\n\n## Random hyperparameter search\n\n### Input\n\nTo run random searchs for the best model hyperparameters for a given dataset, we will need:\n\n* A dataset, in a csv file, without any missing value, only columns of type `bool`, `str`, `int` or\n `float` and only one type for column, as specified in the [Input Format](#input-format).\n\n* A JSON file containing the configuration for the search. This configuration shall contain:\n\n * `name`: Name of the experiment. A folder with this name will be created.\n * `num_random_search`: Number of iterations in hyper parameter search.\n * `train_csv`: Path to the csv file containing the dataset.\n * `continuous_cols`: List of column indices, starting at 0, to be considered continuous.\n * `epoch`: Number of epoches to train the model.\n * `steps_per_epoch`: Number of optimization steps in each epoch.\n * `sample_rows`: Number of rows to sample when evaluating the model.\n\nYou can see an example of such a json file in [examples/config.json](examples/config.json), which you\ncan download and use as a template.\n\n### Execution\n\nOnce we have prepared everything we can launch the random hyperparameter search with this command:\n\n``` bash\ntgan experiments config.json results.json\n```\n\nWhere the first argument, `config.json`, is the path to your configuration JSON, and the second,\n`results.json`, is the path to store the summary of the execution.\n\nThis will run the random search, wich basically consist of the folling steps:\n\n1. We fetch and split our data between test and train.\n2. We randomly select the hyperparameters to test.\n3. Then, for each hyperparameter combination, we train a TGAN model using the real training data T\n and generate a synthetic training dataset Tsynth.\n4. We then train machine learning models on both the real and synthetic datasets.\n5. We use these trained models on real test data and see how well they perform.\n\n### Output\n\nAfter the experiment has finished, the following can be found:\n\n* A JSON file, in the example above called `results.json`, containing a summary of the experiments.\n This JSON will contain a key for each experiment `name`, and on it, an array of length\n `num_random_search`, with the selected parameters and its evaluation score. For a configuration\n like the example, the summary will look like this:\n\n``` python\n{\n 'census': [\n {\n \"steps_per_epoch\" : 10000,\n \"num_gen_feature\" : 300,\n \"num_dis_hidden\" : 300,\n \"batch_size\" : 100,\n \"num_gen_rnn\" : 400,\n \"score\" : 0.937802280415988,\n \"max_epoch\" : 5,\n \"num_dis_layers\" : 4,\n \"learning_rate\" : 0.0002,\n \"z_dim\" : 100,\n \"noise\" : 0.2\n },\n ... # 9 more nodes\n ]\n}\n```\n\n* A set of folders, each one names after the `name` specified in the JSON configuration, contained\nin the `experiments` folder. In each folder, sampled data and the models can be found. For a configuration\nlike the example, this will look like this:\n\n```\nexperiments/\n census/\n data/ # Sampled data with each of the models in the random search.\n model_0/\n logs/ # Training logs\n model/ # Tensorflow model checkpoints\n model_1/ # 9 more folders, one for each model in the random search\n ...\n```\n\n# Research\n\nThe first **TAGN** version was built as the supporting software for the [Synthesizing Tabular Data using Generative Adversarial Networks](https://arxiv.org/pdf/1811.11264.pdf) paper by Lei Xu and Kalyan Veeramachaneni.\n\nThe exact version of software mentioned in the paper can be found in the releases section as the [research pre-release](https://github.com/DAI-Lab/TGAN/releases/tag/research)\n\n# Citation\n\nIf you use TGAN, please cite the following work:\n\n> Lei Xu, Kalyan Veeramachaneni. 2018. Synthesizing Tabular Data using Generative Adversarial Networks.\n\n```LaTeX\n@article{xu2018synthesizing,\n title={Synthesizing Tabular Data using Generative Adversarial Networks},\n author={Xu, Lei and Veeramachaneni, Kalyan},\n journal={arXiv preprint arXiv:1811.11264},\n year={2018}\n}\n```\nYou can find the original paper [here](https://arxiv.org/pdf/1811.11264.pdf)\n\n\n# History\n\n## 0.1.0\n\n* First release on PyPI.\n\n\n", "description_content_type": "text/markdown", "docs_url": null, "download_url": "", "downloads": { "last_day": -1, "last_month": -1, "last_week": -1 }, "home_page": "https://github.com/DAI-Lab/TGAN", "keywords": "tgan", "license": "MIT license", "maintainer": "", "maintainer_email": "", "name": "tgan", "package_url": "https://pypi.org/project/tgan/", "platform": "", "project_url": "https://pypi.org/project/tgan/", "project_urls": { "Homepage": "https://github.com/DAI-Lab/TGAN" }, "release_url": "https://pypi.org/project/tgan/0.1.0/", "requires_dist": [ "numpy (>=1.16.0)", "pandas (>=0.23.4)", "scikit-learn (>=0.20.2)", "tensorflow (<2.0,>=1.13.0)", "tensorpack (==0.9.4)", "bumpversion (>=0.5.3) ; extra == 'dev'", "pip (>=9.0.1) ; extra == 'dev'", "watchdog (>=0.8.3) ; extra == 'dev'", "m2r (>=0.2.0) ; extra == 'dev'", "Sphinx (>=1.7.1) ; extra == 'dev'", "sphinx-rtd-theme (>=0.2.4) ; extra == 'dev'", "flake8 (>=3.5.0) ; extra == 'dev'", "isort (>=4.3.4) ; extra == 'dev'", "autoflake (>=1.1) ; extra == 'dev'", "autopep8 (>=1.3.5) ; extra == 'dev'", "twine (>=1.10.0) ; extra == 'dev'", "wheel (>=0.30.0) ; extra == 'dev'", "coverage (>=4.5.1) ; extra == 'dev'", "tox (>=2.9.1) ; extra == 'dev'", "pytest (>=3.4.2) ; extra == 'dev'", "pytest-cov (>=2.6.0) ; extra == 'dev'", "pytest (>=3.4.2) ; extra == 'test'", "pytest-cov (>=2.6.0) ; extra == 'test'" ], "requires_python": ">=3.5", "summary": "Generative adversarial training for synthesizing tabular data", "version": "0.1.0" }, "last_serial": 5231846, "releases": { "0.0.0": [ { "comment_text": "", "digests": { "md5": "9ca140361555d57f90759fd3782c2d08", "sha256": "cbbe0c96865dbc60f942b6b650276e9de1ecfa76a3d74a0ed385967e54fd9218" }, "downloads": -1, "filename": "tgan-0.0.0-py3-none-any.whl", "has_sig": false, "md5_digest": "9ca140361555d57f90759fd3782c2d08", "packagetype": "bdist_wheel", "python_version": "py3", "requires_python": null, "size": 1137, "upload_time": "2019-05-01T13:39:11", "url": "https://files.pythonhosted.org/packages/cd/bc/372b708211907a9902a332239c48ace427dba920f22dd9160f982cf9ed46/tgan-0.0.0-py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "f0479e0b93249fd714f5ef834b66bd7e", "sha256": "baca16ac96a8334bc0382825fb092688f0f54add07d9033f14a0579bbbed499c" }, "downloads": -1, "filename": "tgan-0.0.0.tar.gz", "has_sig": false, "md5_digest": "f0479e0b93249fd714f5ef834b66bd7e", "packagetype": "sdist", "python_version": "source", "requires_python": null, "size": 961, "upload_time": "2019-05-01T13:39:13", "url": "https://files.pythonhosted.org/packages/f0/0a/6ed3103c7a6ce899d43be407c634ff8e9ebc6569de6f7ec103c58eb6fc4d/tgan-0.0.0.tar.gz" } ], "0.1.0": [ { "comment_text": "", "digests": { "md5": "36431430dc6ca7f699cab0301d587c7b", "sha256": "76a42879e0969826e57c9cbcc5eb8b1d5c3c3c37624a91b4faa45e9c78f56fcd" }, "downloads": -1, "filename": "tgan-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "36431430dc6ca7f699cab0301d587c7b", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.5", "size": 27237, "upload_time": "2019-05-06T09:52:59", "url": "https://files.pythonhosted.org/packages/90/e1/3174c7fb5e2bc6b28771797e74af060f05cb7275aa222e40be56108c303e/tgan-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "8b0b9b0c912edcf94d5c0e082571669c", "sha256": "20f5335815cf587f7ef9ac81a784dccecfab77a64d99319f9776070efa302e84" }, "downloads": -1, "filename": "tgan-0.1.0.tar.gz", "has_sig": false, "md5_digest": "8b0b9b0c912edcf94d5c0e082571669c", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 83895, "upload_time": "2019-05-06T09:53:03", "url": "https://files.pythonhosted.org/packages/95/40/c25a8a9663c9b7d52fd531207347ef8f2dc2b4fa00cd9ae15299fb3647ff/tgan-0.1.0.tar.gz" } ] }, "urls": [ { "comment_text": "", "digests": { "md5": "36431430dc6ca7f699cab0301d587c7b", "sha256": "76a42879e0969826e57c9cbcc5eb8b1d5c3c3c37624a91b4faa45e9c78f56fcd" }, "downloads": -1, "filename": "tgan-0.1.0-py2.py3-none-any.whl", "has_sig": false, "md5_digest": "36431430dc6ca7f699cab0301d587c7b", "packagetype": "bdist_wheel", "python_version": "py2.py3", "requires_python": ">=3.5", "size": 27237, "upload_time": "2019-05-06T09:52:59", "url": "https://files.pythonhosted.org/packages/90/e1/3174c7fb5e2bc6b28771797e74af060f05cb7275aa222e40be56108c303e/tgan-0.1.0-py2.py3-none-any.whl" }, { "comment_text": "", "digests": { "md5": "8b0b9b0c912edcf94d5c0e082571669c", "sha256": "20f5335815cf587f7ef9ac81a784dccecfab77a64d99319f9776070efa302e84" }, "downloads": -1, "filename": "tgan-0.1.0.tar.gz", "has_sig": false, "md5_digest": "8b0b9b0c912edcf94d5c0e082571669c", "packagetype": "sdist", "python_version": "source", "requires_python": ">=3.5", "size": 83895, "upload_time": "2019-05-06T09:53:03", "url": "https://files.pythonhosted.org/packages/95/40/c25a8a9663c9b7d52fd531207347ef8f2dc2b4fa00cd9ae15299fb3647ff/tgan-0.1.0.tar.gz" } ] }